From f65a5b0e2ede17e7c75de1d3492ace6568902e34 Mon Sep 17 00:00:00 2001 From: Andrey Avtomonov Date: Mon, 29 Jun 2026 18:35:57 +0200 Subject: [PATCH] =?UTF-8?q?feat:=20ktx=20batch=20=E2=80=94=20scan=20resili?= =?UTF-8?q?ence,=20analytics=20SQL=20craft,=20connector=20hardening=20(#31?= =?UTF-8?q?2)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * docs: add spider2-specs handoff directory for benchmark-driven feature specs * feat(cli): connection-scoped wiki pages Add an optional `connections` frontmatter field so database-specific wiki knowledge can be scoped to a connection without polluting searches about other databases, while page keys stay a flat, globally-unique namespace. - connections: single string or list; absent/empty ⇒ unscoped (applies to all) - wiki_search (MCP) and `ktx wiki --connection` return unscoped ∪ matching pages, filtered at the disk-load seam so all three search lanes draw their candidate pool from the already-scoped set (not a post-filter) - wiki_write accepts connections with REPLACE semantics and rejects a connection-scoped write whose key collides with a disjoint-connection page (data-loss guard; hard error, no silent clobber) - explicit connection-id args (wiki_search, memory_ingest, ktx wiki) are validated against ktx.yaml via a shared assertConfiguredConnectionId, which also closes the prior gap where memory_ingest's connectionId was unvalidated; persisted ids absent from config warn (not fail) in `ktx status` - prompt guidance in the wiki_capture skill and external-ingest prompt; the session connectionId is surfaced to the memory agent and ingest work units Implements spider2-specs/specs/01-connection-scoped-wiki.md; intake draft moved to spider2-specs/done/. * docs(spider2-specs): add specs/ refinement stage and composite-key join spec Describe the todo/ → specs/ → done/ pipeline in the README (refined specs are the durable artifact; intake drafts move to done/ on ship) and add a MEDIUM-priority spec for multi-column composite-key join detection found during the first sqlite smoke test. * feat(cli): add --verbatim ingest mode for authoritative documents Store each --text/--file document body unchanged as a GLOBAL wiki page instead of routing it through the memory agent, which may rewrite, condense, or re-title it. The LLM derives only metadata (summary, tags, sl_refs) and only for frontmatter fields the document does not already set; the stored body is written by code and never edited. - Deterministic page key: files derive it from the filename, inline text from its leading Markdown heading (headless inline text is rejected — pass it as --file instead). - Idempotent: re-running the same body is a no-op; a different body at the same key fails loudly rather than overwriting. - Works with llm.provider.backend: none, deriving a degraded summary from the heading or first sentence. - Existing frontmatter (including unmodeled fields like effective_date) passes through untouched; --connection-id scopes the page. * feat(cli): SQL-authoring craft and per-dialect notes tool for the analytics skill Spec 07: add a dialect-agnostic block to the ktx-analytics skill (schema discovery, composition, window-function correctness, numeric precision, answer completeness) with one worked window-then-filter example. Workflow steps gain pointers into it; existing guidance is unchanged. Spec 08: add a read-only sql_dialect_notes MCP tool returning a connection's engine SQL conventions (FQTN form, identifier quoting/case, date/time, top-N idiom, JSON access), resolved through the existing sqlAnalysisDialectForDriver path. Notes are per-dialect markdown files under context/sql-analysis/dialects, served by the tool and copied to dist (package-internal, never installed). Non-SQL connections return a clear KtxExpectedError. The flat skill gains a one-line pointer to the tool. Both spider2-specs intake drafts move to done/ with implementation notes. * feat(cli): tolerate objects that fail introspection during scan Isolate per-object introspection failures so one broken or inaccessible object no longer zeroes out a connection's whole semantic layer: the sqlite and bigquery connectors introspect each object defensively (tryIntrospectObject), the live-database adapter records a scan outcome and fetch report, and enabled_tables accepts catalog.db.name, db.name, or bare names with a clear no-match error. Includes matching ktx-daemon introspection changes, docs, and tests. * docs(spider2-specs): add 06-scan-tolerate-broken-objects spec * feat(cli): generalize analytics fan-out rule to multi-hop join chains The ktx-analytics skill's fan-out rule only reliably caught single-hop inflation; agents still silently fanned out on multi-hop chains where the offending one-to-many join sits several hops below the SUM/COUNT and is easy to miss. Rewrite the Composition rule so the danger reads as cumulative across the whole chain (pre-aggregate per measure-owning table), add an affirmative grain-verification habit (default: pre-aggregate to grain; escape hatch: COUNT(DISTINCT key) for pure counts only; SUM/AVG of a fanned-out measure must pre-aggregate), and add one generic wrong-vs-right worked example. Content-only and dialect-agnostic; no new tool, flag, or config. Implements spider2-specs/specs/09 and annotates spec 07's one-example constraint as superseded. * feat(cli): add panel-completeness, time-series window, and text-encoded numeric SQL craft Extend the analytics skill's with three correctness habits and route the dialect-specific halves through sql_dialect_notes: - Panel completeness (spec 10): full-domain spine -> LEFT JOIN -> COALESCE for "each/every/all/per" questions, defaulted by measure additivity. - Time-series windows (spec 11): explicit cumulative frames, calendar-range rolling windows with minimum-periods guards, and period-over-period via LAG. - Text-encoded numerics (spec 12): sample distinct values, strip/scale/cast in one early CTE, and confirm coverage with a failure-detecting cast. Add per-dialect Series, Rolling window, and Safe cast notes to all seven dialect files so the skill stays dialect-agnostic while the engine-specific syntax lives in sql_dialect_notes. Tests updated and passing (19). * docs(spider2-specs): add specs 10-12 for analytics SQL-craft additions Refined specs and completion records for the panel-completeness spine (10), time-series window recipes (11), and text-encoded numeric parsing (12) implemented in the preceding commit. * docs(spider2-specs): add backlog intake drafts 13-14 - 13: canonical authoritative-source measures - 14: output-completeness final check * skill(analytics): spec 14 output-completeness + iter1 (active column planning) Bundles two changes (entangled in SKILL.md; future spider2 iterations land as separate commits): - spec 14 (output-completeness): multi-part "answer every requested output" rule + a "Final completeness check" in workflow Step 6 and ; analytics skill-content test updated; intake draft -> done/, refined spec added. - iter1 experiment: spec 14's passive end-check did not change behavior on the benchmark's output-completeness failures, so (a) the Plan step now writes the exact output-column list UP FRONT as a contract the final SELECT must match, and (b) "expose identity" -> "project BOTH the entity id and its name" (covers both omission directions). All generic craft. Driven by the Spider 2.0-Lite failure analysis (incomplete output was the largest failure bucket); benchmark only as motivation. Co-Authored-By: Claude Opus 4.8 * skill(analytics): iter2 — deterministic order in string/array aggregation GROUP_CONCAT/string_agg/array_agg element order is undefined without an explicit ORDER BY; also note SQLite's default text sort is binary/case-sensitive (uppercase before lowercase) vs case-insensitive (COLLATE NOCASE). Generic SQLite craft. Spider 2.0-Lite motivation: an ordered-ingredient-list question failed only on the within-string element order (right elements, wrong order); benchmark as motivation only. Co-Authored-By: Claude Opus 4.8 * feat(mcp): structured, leveled logging for the MCP server Add one synchronous pino logger per MCP server process, written through the io.stderr sink: plain JSON when stderr is not a TTY, colorized pino-pretty (sync, in-process) when it is. Every tool call logs tool.start with its raw params BEFORE the handler runs and tool.end after (info / warn past KTX_MCP_SLOW_TOOL_MS / error), correlated by callId plus sessionId, so a runaway sql_execution leaves a recoverable start line with its exact SQL and no matching end. HTTP logs session.open/close and wires the previously-dead transport.onerror to transport.error; stdio routes its transport error through the logger. Level via KTX_MCP_LOG_LEVEL (default info). Existing mcp_request_completed telemetry and registerParsedTool are unchanged; no worker/async transport and no redaction in v1 (logs are local-only). Implements spider2-specs/specs/15-mcp-server-structured-logging.md and moves the intake draft to done/. * feat(mcp): report uptimeMs in MCP server /health The /health endpoint now includes uptimeMs (monotonic elapsed time since the server started), mirroring the Python daemon's uptime_ms telemetry field. * feat(cli): bound read-query execution with a per-connection deadline Enforce one shared query deadline (default 30s, overridable per connection via query_timeout_ms) on every executeReadOnly path, so an accidentally-expensive LLM-authored query returns a fast "query exceeded Ns" KtxQueryError instead of hanging the MCP server. - New shared contract context/connections/query-deadline.ts (resolveQueryDeadlineMs, queryDeadlineExceededError); query_timeout_ms added to the shared warehouse schema; BigQuery's job_timeout_ms removed. - SQLite runs the read query in a short-lived forked child process and enforces the deadline with SIGKILL. worker_threads + terminate() was tried first but cannot interrupt a synchronous better-sqlite3 scan (the native loop never yields); SIGKILL reclaims the process in ~2ms and keeps the event loop free. - Remote connectors apply a real server-side statement timeout and re-wrap their own timeout signal as KtxQueryError: Postgres statement_timeout/57014, MySQL max_execution_time/3024, Snowflake STATEMENT_TIMEOUT_IN_SECONDS/604, ClickHouse max_execution_time + aligned request_timeout/159, SQL Server requestTimeout/ ETIMEOUT, BigQuery jobTimeoutMs. - Relationship validation skips a candidate to review on a deadline timeout instead of aborting the pass; the deadline surfaces through the existing MCP pino logger as a matched tool.start/tool.end(error) pair (no new logging code). Also fixes a pre-existing, unrelated invalid cast in mcp-server-factory.test.ts that was breaking tsc -p tsconfig.test.json. * docs(spider2-specs): mark spec 16 (bounded query execution) done Append Implementation notes to the refined spec (what shipped, where, and the worker-thread -> child-process+SIGKILL deviation with its evidence) and move the intake draft from todo/ to done/. * skill(analytics): iter3 — measure-as-amount, inter-event gap, top-per-metric career Three generic interpretation rules: a named business measure (sales/revenue/spend) means its amount not a row count; "inter-event duration/gap" is LAG/LEAD time-between events not a magnitude column; "highest across several achievements" aggregates per metric over the whole history. All three demonstrably FIRE (verified on local008/003/152 SQL). local008 flips to correct (mechanism-aligned). 003/152 still fail on a different axis (source-column / grouping). Generic craft; benchmark only as motivation. Co-Authored-By: Claude Opus 4.8 * skill(analytics): spine-for-extreme-selection + aggregate-over-selected-set Two generic answer-completeness refinements: - Selecting the extreme group (lowest/highest count over a period/category domain) must rank over the COMPLETE spine, not only groups with fact rows — an empty period is a genuine 0 and often the true minimum. - An aggregate scoped to a per-entity selected set ('avg revenue per actor in those top-3 films') is computed ACROSS that set, distinct from the per-item value; project both. Co-Authored-By: Claude Opus 4.8 (1M context) * skill(analytics): iter2 — sharpen extreme-selection spine + top-N ranking-measure - spine-for-extreme: concrete cue that a zero-row period never appears in a GROUP BY of the facts; generate the full calendar, LEFT JOIN, COALESCE, then rank. - aggregate-over-selected-set: top-N selection ranks by the named ranking measure (the item's own revenue), independent of the per-item share that feeds the aggregate. Co-Authored-By: Claude Opus 4.8 (1M context) * skill(analytics): iter3 — comparison-between-two-extremes is one wide row Distinguishes a cross-item comparison ('the difference between the highest and lowest month' -> single wide row, both extremes side by side + the comparison column) from 'report a metric for each group' (-> stays long). Generic, question- derived; targets the wide-vs-long shape gap without affecting per-group long output. Co-Authored-By: Claude Opus 4.8 (1M context) * skill(analytics): iter4 — anchor a period bucket to the named lifecycle event When a record carries multiple lifecycle timestamps (created/placed, approved, shipped, delivered, completed, settled) and the question counts/measures records in a named *completed state* by period ("delivered orders by month", "shipped items per week"), bucket the period by that named event's own timestamp, not the record-creation timestamp; the state value is the qualifying filter, the matching timestamp is the time anchor. Wording priority is explicit — purchased/placed/ created/submitted/ordered keep the start-event timestamp — and a non-temporal state filter (counts by customer/city/seller with no period) introduces no anchor. Generic analytics craft: counting completed-state records by their creation date silently answers "records that later reached that state, grouped by when they started" instead of the question asked. Surfaced via the spider2-autofix loop; FAIR_PRODUCT (adversary-screened, restatable from question wording + schema/ semantic-layer lifecycle descriptions, no gold dependency). Co-Authored-By: Claude Opus 4.8 (1M context) * skill(analytics): iter5 — canonicalize observed URL-path variants before page-level analysis When a question groups/filters/sequences web pages by a path/url column, sample its distinct values; if the data itself shows /route and /route/ variants for the same page context, canonicalize in an early CTE (preserve / as root, strip trailing slashes from non-root paths, map an observed empty path to / only when the column is a URL path with blank root-page events) and use the canonical path everywhere above. Explicitly forbids inventing aliases the data doesn't show: no merging different route names, no stripping query/fragment/host/scheme, no lowercasing, and no canonicalization when the question asks for raw URL/path or slash-vs-no-slash diffs. Generic web-analytics craft: raw request logs routinely store the same user-visible page with and without a trailing slash, so grouping raw labels silently splits one page into several. Surfaced via the spider2-autofix loop (Codex runner, round r2); FAIR_PRODUCT (adversary-screened, restatable from URL-path semantics + page-grain question wording + solver-observed distinct values, no gold dependency). The rule fired mechanism-aligned on both targets; flipped local330 (landing/exit page counts), local331 residual is a separate sequence-semantics axis beyond canonicalization. Co-Authored-By: Claude Opus 4.8 (1M context) * skill(analytics): iter6 — coverage over a selected group is a set-membership aggregate When a question first selects a group of entities ("the top 5 actors", "these products") and then asks what count/share/percentage of a DIFFERENT subject domain relates to *these* selected entities ("what % of customers rented films featuring these actors"), the subject set is the UNION across the whole group: count DISTINCT subject ids once across the selected entities and return one collective value at the subject-domain grain — not one row per selected entity (which double-counts subjects related to more than one entity and answers a different question). Narrowly guarded: emit one row per entity only when the wording says "for each / per / by / list" or asks for each entity's own metric ("top 5 players and their batting averages"). The collective-coverage cousin of the existing per-entity selected-set rule. Generic analytics craft (per-entity metric vs set-level coverage). Surfaced via the spider2-autofix loop (Codex runner, round r3); FAIR_PRODUCT (adversary-screened, restatable from wording alone, no gold dependency). Flipped local195 mechanism-aligned (union COUNT(DISTINCT customer)/total, one scalar); 0 regression across 5 passing per-entity top-N guards (local023/024/029/212/221 stayed long). Co-Authored-By: Claude Opus 4.8 (1M context) * skill(analytics): label-only joins must LEFT JOIN — incomplete dims silently drop fact rows Mirror of the existing fan-out rule for the DROP direction: an inner JOIN to a dimension table used only to attach a display attribute silently discards every fact row whose key has no parent when the dimension is incomplete (trimmed catalogs, late-arriving / SCD-gap rows), shrinking counts/sums and the universe over which shares/averages/medians are computed. Guidance: LEFT JOIN pure enrichment; inner-join a dimension only when intended as a filter; key the aggregate/GROUP BY on the fact column, not the dimension column. Spider2 autofix round 'joindim': flips complex_oracle local050 (FAIL->PASS, official scorer) — solver dropped the gratuitous products inner-join and recovered the exact gold. local060/063 also adopt LEFT JOIN (rule fires) but remain gold-convention-blocked. Guards local061/067 held. Co-Authored-By: Claude Opus 4.8 (1M context) * docs(spider2-specs): add todo/17 — lifecycle-event metrics (semantic-layer) Draft intake spec surfaced by the spider2-autofix loop (round r1): the model-layer form of the shipped iter4 lifecycle-date-anchoring skill rule — infer per-state lifecycle-event metrics (e.g. delivered_orders with defaultTimeDimension = the delivery timestamp) during enrichment so the correct time anchor is the default for any consumer, not only an agent that loaded the skill. Generic; FAIR_PRODUCT. Co-Authored-By: Claude Opus 4.8 (1M context) * fix(connectors): accept leading underscore in connection/identifier ids The safe-identifier validator regex /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/ allowed an underscore everywhere except the first character, so a connection id / database name that legitimately starts with '_' (valid in Snowflake, e.g. _1000_GENOMES) could never be ingested or queried. Allow a leading underscore across all 16 duplicated validators (connection ids, source ids, page/wiki keys, warehouse- verification tool schemas). Path-safety is unaffected — '.' and '/' remain excluded, and assertSafePathToken still blocks traversal. Co-Authored-By: Claude Opus 4.8 (1M context) * feat(analytics): generic geospatial query guidance Add a Snowflake ST_* dialect note (ST_MAKEPOINT lon-first, ST_DWITHIN/ST_CONTAINS/ ST_WITHIN/ST_INTERSECTS, bbox->polygon via ST_MAKEPOLYGON/ST_MAKELINE) and a dialect-agnostic 'Spatial predicates' recipe in the analytics skill (resolve the entity geometry, build an area-of-interest polygon, test with the engine's containment/proximity/overlap predicate; mind lon/lat argument order). Steers the solver off hand-rolled lat/lon BETWEEN boxes toward correct, index-assisted geospatial predicates. Co-Authored-By: Claude Opus 4.8 (1M context) * feat(analytics): parse code/dependency text by language grammar Add two generic rules: (1) parse imported/required/loaded packages by the language or manifest format (Java import keep-package-path allowing underscores/ mixed-case; Python import/from + alias stripping; R library/require; .ipynb parse JSON cell source before language rules; JSON manifests flatten the dependency object keys), stripping comments/prose and splitting multi-import lines; (2) on a de-duplicated table with a documented copy/occurrence count, choose COUNT(*) vs the weight column from the population the question names, not silently. Steers off one broad regex that drops valid identifiers and matches prose. Co-Authored-By: Claude Opus 4.8 (1M context) * feat(analytics): source filters/dates/measures from the owning fact grain Add a rule for joined fact tables at different grains (parent order vs child line item): read each predicate, calendar bucket, and measure from the table whose grain the question names, not whichever is in scope post-join. An order-grain filter ("orders that are Complete", "the order's creation date") must come from the parent even though the child carries its own status/created_at; line price/cost come from the child. Mirror at metric grain: don't combine a parent-grain count with child rows (num_of_item * SUM(line_price) per line) — aggregate each measure at its own grain before combining. Co-Authored-By: Claude Opus 4.8 (1M context) * feat(analytics): collapse multi-valued classes to one representative per entity before counting/concentration When an entity carries a multi-valued classification array (IPC/CPC codes, tags) and the methodology counts entities-per-class or a concentration/diversity metric (HHI, originality, share), pick ONE representative per entity first (the array's main/primary/first flag, else a defined fallback like most-frequent), then aggregate; and use COUNT(DISTINCT entity) when the denominator is defined as a count of entities. Unnesting the array otherwise multiplies an entity's weight by its code count, inflating per-class frequencies and skewing the ranking/score. Co-Authored-By: Claude Opus 4.8 (1M context) * feat(connectors): introspect BigQuery datasets hosted in foreign projects A dataset_ids/dataset_id entry may now be written `project.dataset` to introspect a dataset hosted in another project while query jobs still bill to credentials.project_id. Entries are parsed once at the config boundary into canonical {project, dataset} pairs; introspection, primary-key discovery, testConnection, getTableRowCount, and listTables (grouped per project) all resolve in the dataset's own project, and scanned tables are labeled with that project so sampling, distinct-value, and read queries resolve. Bare entries are unchanged. Implements spider2-specs/specs/18-bigquery-cross-project-datasets.md. * feat(scan): durable, resumable, bounded relationship detection during enrichment Move the enrichment persistence boundary to the cost boundary and bound the open-ended relationship stage (spec 19). - Checkpoint descriptions + embeddings into the queryable `_schema` manifest (and the raw enrichment artifacts) before relationship detection runs, via a new `onCheckpoint` hook + `writeLocalScanEnrichmentCheckpoint`. An interrupted, budget-truncated, or failed relationship stage now degrades to "no joins", never "no descriptions". - Resume the enrichment cache by content identity: re-key the SQLite stage store on `(connection_id, stage, input_hash)` so a re-run with a fresh runId resumes finished descriptions/embeddings instead of re-paying for LLM work. The disposable cache recreates its table if the on-disk key shape differs. - Make the relationship stage observable and bounded: a sticky wall-clock budget (`scan.relationships.detectionBudgetMs`, default 600000 ms) + per-unit progress + honored `ctx.signal`, threaded through profiling, validation, and composite detection. On exhaustion/abort it stops scheduling, finalizes, and returns a partial result instead of throwing or hanging. - Mark a budget/abort-truncated result partial (diagnostics `partial`/`partialReason` + recoverable `relationship_detection_partial` warning). A graceful partial saves as a completed stage and resumes cheaply; raising the budget changes inputHash and forces a fresh, fuller run. A process killed mid-stage saves nothing. Document `detectionBudgetMs` in the ktx.yaml reference. Append implementation notes to specs/19 and move the intake draft to done/. Also carries the in-tree per-table enrichment LLM timeout work it builds on (`description-generation.ts` + the `enrichment_timeout` warning code), which is intertwined in `local-enrichment.ts`/`types.ts` and cannot be split into a separately-building commit. * feat(scan): bound + retry the per-table enrichment LLM call The batched table-description call had no retry (sampleTable retried 3x, this did not), so a single transient backend error (e.g. an overloaded/burst rejection when many tables enrich concurrently) silently nulled a whole table's descriptions — observed dropping ~70% of a db's tables during a bad window despite ample quota. - Wrap generateObject in retryAsync (3 attempts + backoff; KTX_ENRICH_LLM_ATTEMPTS). - Fresh per-attempt timeout (KTX_ENRICH_LLM_TIMEOUT_MS, default 120s) still bounds a wedged wide table; a timeout is surfaced as KtxAbortedError so it is NOT retried (one wedge stays one timeout, not 3x). - Granular per-table progress + start/done/retry/timeout logging. Composes with spec 19 (its non-goal #1): spec 19 makes completed descriptions durable; this makes more of them complete. * feat(scan): survive a hung LLM enrichment backend and resume descriptions Two compounding failure modes on the per-table description-enrichment path (spec 20): Enforced per-table timeout for subprocess backends. The runtime declares whether it owns an SDK subprocess (subprocessForkSpec on KtxLlmRuntimePort); codex/claude-code calls run behind a ktx-owned detached child that is tree-killed (SIGKILL of the process group on POSIX, taskkill /T on Windows) on the deadline or ctx.signal, reaping the wedged model grandchild. HTTP backends keep native fetch abort. Default stays 120s, one-wedge-one-timeout. Incremental, resumable descriptions persistence. generateDescriptions flushes enriched tables per batch to an inputHash-tagged durable record (at a stable, non-syncId path) plus only the changed manifest shards, skips already-enriched tables on resume, and never lets one table's failure discard the stage (a skipped table costs one missing description, not the whole stage's output). Spec 20 refined + intake draft moved to done/. * feat(scan): selective enrichment stages (--stages) + per-stage cache keys Split the single coarse enrichment cache key into per-stage hashes (descriptions <- snapshot + LLM identity; embeddings <- snapshot + embedding identity + description digest; relationships <- snapshot + relationship settings + LLM identity), so changing one stage's inputs invalidates only that stage and never throws away the expensive per-table descriptions on an unrelated edit. Add `ktx ingest --stages ` to force-re-run a chosen subset on an already-ingested connection: a named stage bypasses the completed-stage short-circuit while the per-table descriptions resume record still skips already-enriched tables, and unselected stages are left untouched on disk. Feed embeddings + relationships their description context from the on-disk _schema when descriptions do not run this invocation, and carry descriptions into the llmProposals evidence packet (closing a latent gap on the full-run path too). Surface an enrichment_stage_stale warning when an unselected stage's inputs have drifted, rather than silently cascading the work. Implements spider2-specs/specs/21-selective-enrichment-stages.md. * test(analytics): realign SKILL.md acceptance test with the evolved skill Three assertions in analytics-skill-content.test.ts drifted from the analytics SKILL.md as later iterations edited the skill without updating the test: - the sub-heading was renamed Window functions -> Ordering & aggregation determinism (iter2), so follow the source name; - the rule "Expose identity, not just the label" was renamed to "Project BOTH identity and label" (spec 14), so match the new wording; - the dialect-FQTN guard false-positived on the Java package example com.planet_ink.coffee_mud, whose backticks made a 3-segment package path read as a BigQuery/Snowflake `a.b.c` table reference. Drop the backticks so the guard stays at full strength without weakening it. * fix(scan): --stages subset must not delete unselected stages' on-disk artifacts A --stages subset that omitted descriptions wiped all on-disk ai/db descriptions from the written _schema. runLocalScan writes the structural manifest shard from the bare snapshot BEFORE enrichment runs, and the shard merge treats ai/db as scan-managed and overwrites them with whatever the run emits — none, on a subset that skips descriptions. Enrichment then read the already-wiped shard via loadPriorDescriptions and had nothing to restore. runLocalScanEnrichment now returns the best-available descriptions (fresh-this-run if descriptions ran, else loaded from the on-disk _schema) instead of [], and runLocalScan captures the prior descriptions before the structural write and feeds them to both the structural write and enrichment, so an unselected stage's artifacts survive. Joins were already preserved for --stages descriptions via the manual/inferred preservedJoins path. Tests: a full runLocalScan --stages relationships path test (RED without the fix, GREEN with it — the earlier unit test missed the structural-pre-write ordering), plus enrichment-layer contract tests for both directions. Validated live on northwind: --stages relationships keeps all 110 descriptions + 22 joins (was wiping to 0); --stages descriptions restores descriptions from the spec-20 resume record (no LLM calls) while keeping joins. * feat(dialects): bigquery nested-data (ARRAY/STRUCT/UNNEST), geospatial (GEOGRAPHY), SAFE_DIVIDE bigquery.md lacked the two sections that define BigQuery analytics (present in snowflake.md): - Nested & repeated data: UNNEST to flatten arrays of STRUCTs (GA360 hits, GA4 event_params), dot-notation field access, key-value param scalar-subquery extraction, fan-out/COUNT(DISTINCT) guard. - Geospatial (GEOGRAPHY): ST_GEOGPOINT (lon-first), containment/proximity/distance/intersection predicates, areal allocation via ST_AREA(ST_INTERSECTION()). - SAFE_DIVIDE for zero-denominator-safe rates; sharded-table shard-presence note. Generic BigQuery craft surfaced by sql_dialect_notes; product-completeness (any BQ analyst benefits). * feat(dialects): sqlite ROUND half-up FP-underflow note (+1e-9 before ROUND) SQLite ROUND(x,n) rounds half-away-from-zero, but binary FP stores an exact half-way value just below it, so ROUND(6.475,2) returns 6.47 not 6.48. Add a dialect note: nudge by a tiny epsilon (1e-9) below display precision before rounding for deterministic half-up, leaving non-boundary values unchanged. Generic SQLite craft surfaced by sql_dialect_notes (any analyst rounding a displayed average/rate/price benefits). Co-Authored-By: Claude Opus 4.8 (1M context) * docs(analytics): list-as-delimited-string, answer-literally, drop free-text columns Add SKILL.md guidance to emit list-valued answer cells as delimited STRING (not ARRAY/repeated column), answer the literal ask without unrequested transformations (HAVING for aggregate bounds), and avoid projecting unrequested free-text columns that corrupt row-delimited output. * fix(scan,mcp): gitignore runtime logs, budget-guard LLM proposal, validate enrich timeout - gitignore `.ktx/logs/` in both scaffold + setup-merge lists: the managed MCP daemon writes raw tool params (SQL, memory_ingest content) to mcp.log under a version-controlled `.ktx/`, and snowflake.log already sat there unprotected. - gate the LLM relationship proposal on the detection budget/abort signal so an exhausted or aborted stage cannot start a fresh LLM call; document the boundary. - validate KTX_ENRICH_LLM_TIMEOUT_MS (NaN/0 → 120s default) like enrichAttempts, so a bad value no longer times out every table immediately. - daemon introspection now warns on malformed column/FK rows instead of dropping them silently, matching the table-row path and the "surface broken objects" goal. - docs: document `ktx wiki -c/--connection`; fix the SQLite query-deadline schema doc (forked-subprocess SIGKILL, not worker-thread termination). * fix(scan,wiki,mcp): address PR #312 review findings - scan: key the description pipeline (resume map, enriched-schema and embedding-text lookups, manifest write/read) by full table identity via tableRefKey/buildTableRef, so two same-named tables in different schemas no longer cross-assign descriptions or skip a sibling on resume - scan: re-throw a genuine context cancel during the batched description LLM call so Ctrl-C resumes the stage instead of nulling tables and recording it completed; per-table timeouts still degrade (context.signal not aborted) - scan: report statisticalValidation 'skipped' (not 'completed') when a budget/abort stop leaves relationship profiling partial - wiki: sync the full page corpus into the sqlite index and filter only the candidate/result set, so a connection-scoped search no longer prunes other connections' pages and cached embeddings from the shared index - wiki: route verbatim ingest through the canonical writePageAndSync so contentHash is set and later syncs can short-circuit - mcp: drop the as-unknown-as cast in serializeMcpError - dialects/analytics: document the integer-division trap on postgres/sqlite/tsql Adds regression tests for each behavior change. * fix(wiki): scope connection filter before SQLite lane limit Connection-scoped wiki search applied the connectionId allowlist after the lexical/semantic lanes had already truncated to laneCandidatePoolLimit over the full (connection-agnostic) corpus. When the requested connection was a minority of a large corpus, its pages were crowded out of the candidate pool before filtering, so a semantic-only match could be missed outright and lexical hits under-ranked. Push the path allowlist into searchLexicalCandidates/searchSemanticCandidates so LIMIT applies to in-scope rows, matching what the token lane already did, and drop the now-redundant post-limit JS filters. --------- Co-authored-by: Claude Opus 4.8 --- .../content/docs/cli-reference/ktx-ingest.mdx | 76 ++ .../content/docs/cli-reference/ktx-setup.mdx | 7 + .../content/docs/cli-reference/ktx-wiki.mdx | 13 + .../content/docs/configuration/ktx-yaml.mdx | 26 +- .../content/docs/guides/writing-context.mdx | 17 + .../docs/integrations/primary-sources.mdx | 35 +- knip.json | 4 +- packages/cli/package.json | 2 + packages/cli/scripts/copy-runtime-assets.mjs | 7 + packages/cli/src/cli-program.ts | 2 +- packages/cli/src/commands/ingest-commands.ts | 54 +- .../cli/src/commands/knowledge-commands.ts | 4 + packages/cli/src/connection-drivers.ts | 12 +- .../cli/src/connectors/bigquery/connector.ts | 267 ++++-- .../src/connectors/clickhouse/connector.ts | 57 +- .../cli/src/connectors/mysql/connector.ts | 16 + .../cli/src/connectors/postgres/connector.ts | 19 +- .../cli/src/connectors/snowflake/connector.ts | 35 +- .../cli/src/connectors/sqlite/connector.ts | 166 +++- .../src/connectors/sqlite/read-query-child.ts | 40 + .../cli/src/connectors/sqlserver/connector.ts | 21 +- packages/cli/src/context-build-view.ts | 2 + .../connections/bigquery-identifiers.ts | 8 + .../connections/configured-connections.ts | 24 + .../src/context/connections/query-deadline.ts | 45 + .../live-database/daemon-introspection.ts | 26 +- .../adapters/live-database/fetch-report.ts | 48 + .../live-database/live-database.adapter.ts | 14 +- .../ingest/adapters/live-database/manifest.ts | 12 +- .../adapters/live-database/scan-outcome.ts | 55 ++ .../ingest/adapters/live-database/stage.ts | 10 +- .../context/ingest/adapters/metabase/types.ts | 2 +- .../context/ingest/ingest-bundle.runner.ts | 1 + .../context/ingest/local-bundle-runtime.ts | 4 +- .../cli/src/context/ingest/local-ingest.ts | 2 +- .../src/context/ingest/local-stage-ingest.ts | 9 +- .../context/ingest/stages/build-wu-context.ts | 4 +- .../discover-data.tool.ts | 2 +- .../entity-details.tool.ts | 2 +- .../sql-execution.tool.ts | 2 +- .../cli/src/context/llm/ai-sdk-runtime.ts | 6 + .../src/context/llm/claude-code-runtime.ts | 62 +- packages/cli/src/context/llm/codex-runtime.ts | 109 ++- packages/cli/src/context/llm/runtime-port.ts | 26 + .../llm/subprocess-generate-object-child.ts | 39 + .../context/llm/subprocess-generate-object.ts | 152 +++ packages/cli/src/context/mcp/context-tools.ts | 163 +++- .../src/context/mcp/local-project-ports.ts | 33 + packages/cli/src/context/mcp/logger.ts | 58 ++ packages/cli/src/context/mcp/server.ts | 2 + packages/cli/src/context/mcp/types.ts | 18 +- .../context/memory/memory-agent.service.ts | 2 +- packages/cli/src/context/project/config.ts | 5 + .../cli/src/context/project/driver-schemas.ts | 10 +- packages/cli/src/context/project/project.ts | 2 +- .../cli/src/context/project/setup-config.ts | 1 + .../context/scan/description-generation.ts | 143 ++- .../cli/src/context/scan/enabled-tables.ts | 15 +- .../cli/src/context/scan/enrichment-state.ts | 85 +- .../scan/local-enrichment-artifacts.ts | 247 ++++- .../cli/src/context/scan/local-enrichment.ts | 518 ++++++++-- packages/cli/src/context/scan/local-scan.ts | 92 +- .../scan/local-structural-artifacts.ts | 6 + .../src/context/scan/object-introspection.ts | 50 + .../scan/relationship-composite-candidates.ts | 19 +- .../scan/relationship-detection-budget.ts | 93 ++ .../context/scan/relationship-diagnostics.ts | 5 + .../context/scan/relationship-discovery.ts | 56 +- .../context/scan/relationship-llm-proposal.ts | 14 +- .../context/scan/relationship-profiling.ts | 22 +- .../context/scan/relationship-validation.ts | 102 +- .../sqlite-local-enrichment-state-store.ts | 75 +- packages/cli/src/context/scan/types.ts | 7 +- .../context/sl/pglite-sl-search-prototype.ts | 2 +- .../src/context/sl/semantic-layer.service.ts | 2 +- packages/cli/src/context/sl/source-files.ts | 2 +- .../context/sl/tools/connection-id-schema.ts | 2 +- .../src/context/sql-analysis/dialect-notes.ts | 49 + .../context/sql-analysis/dialects/bigquery.md | 13 + .../sql-analysis/dialects/clickhouse.md | 9 + .../context/sql-analysis/dialects/mysql.md | 9 + .../context/sql-analysis/dialects/postgres.md | 10 + .../sql-analysis/dialects/snowflake.md | 10 + .../context/sql-analysis/dialects/sqlite.md | 11 + .../src/context/sql-analysis/dialects/tsql.md | 10 + packages/cli/src/context/wiki/keys.ts | 2 +- .../context/wiki/knowledge-wiki.service.ts | 16 +- .../cli/src/context/wiki/local-knowledge.ts | 91 +- .../context/wiki/sqlite-knowledge-index.ts | 44 +- .../src/context/wiki/tools/wiki-write.tool.ts | 34 + packages/cli/src/context/wiki/types.ts | 6 + packages/cli/src/knowledge.ts | 26 +- packages/cli/src/mcp-http-server.ts | 19 +- packages/cli/src/mcp-server-factory.ts | 21 +- packages/cli/src/mcp-stdio-server.ts | 12 +- packages/cli/src/notion-page-picker.ts | 2 +- .../prompts/memory_agent_external_ingest.md | 2 + packages/cli/src/public-ingest.ts | 6 +- packages/cli/src/scan.ts | 17 +- packages/cli/src/setup-databases.ts | 2 +- packages/cli/src/skills/analytics/SKILL.md | 202 +++- packages/cli/src/skills/wiki_capture/SKILL.md | 24 + packages/cli/src/status-project.ts | 60 +- packages/cli/src/text-ingest.ts | 97 +- packages/cli/src/verbatim-ingest.ts | 308 ++++++ .../cli/test/commands/ingest-commands.test.ts | 117 +++ .../connectors/bigquery/connector.test.ts | 187 +++- .../connectors/clickhouse/connector.test.ts | 38 + .../test/connectors/mysql/connector.test.ts | 50 + .../connectors/postgres/connector.test.ts | 58 +- .../connectors/snowflake/connector.test.ts | 52 + .../test/connectors/sqlite/connector.test.ts | 174 +++- .../connectors/sqlserver/connector.test.ts | 47 + .../configured-connections.test.ts | 26 + .../connections/query-deadline.test.ts | 36 + .../query-history-filter-picker.test.ts | 1 + .../daemon-introspection.test.ts | 33 + .../live-database.adapter.test.ts | 112 ++- .../adapters/live-database/manifest.test.ts | 4 +- .../live-database/scan-outcome.test.ts | 65 ++ .../ingest/local-bundle-runtime.test.ts | 1 + .../cli/test/context/llm/local-config.test.ts | 3 + .../llm/subprocess-generate-object.test.ts | 138 +++ .../subprocess-test-children.test-utils.ts | 45 + .../mcp/__snapshots__/mcp-tools-list.json | 56 +- .../test/context/mcp/dialect-notes.test.ts | 111 +++ .../context/mcp/local-project-ports.test.ts | 43 + packages/cli/test/context/mcp/logger.test.ts | 99 ++ packages/cli/test/context/mcp/server.test.ts | 209 ++++ .../cli/test/context/project/config.test.ts | 22 + .../test/context/project/setup-config.test.ts | 4 +- .../scan/description-generation.test.ts | 133 ++- .../context/scan/description-resume.test.ts | 264 +++++ .../test/context/scan/enabled-tables.test.ts | 24 + .../context/scan/enrichment-state.test.ts | 166 +++- .../scan/local-enrichment-artifacts.test.ts | 90 +- .../context/scan/local-enrichment.test.ts | 899 +++++++++++++++++- .../cli/test/context/scan/local-scan.test.ts | 247 +++++ .../context/scan/object-introspection.test.ts | 47 + .../relationship-detection-budget.test.ts | 72 ++ .../scan/relationship-diagnostics.test.ts | 20 + .../scan/relationship-discovery.test.ts | 121 +++ .../scan/relationship-llm-proposal.test.ts | 2 + .../scan/relationship-validation.test.ts | 49 + .../test/context/wiki/local-knowledge.test.ts | 199 ++++ .../wiki/sqlite-knowledge-index.test.ts | 43 + .../wiki/tools/wiki-write.tool.test.ts | 102 ++ packages/cli/test/index.test.ts | 27 + packages/cli/test/knowledge.test.ts | 115 ++- .../cli/test/local-scan-connectors.test.ts | 4 +- packages/cli/test/mcp-http-server.test.ts | 82 +- packages/cli/test/mcp-server-factory.test.ts | 23 +- packages/cli/test/mcp-stdio-server.test.ts | 53 ++ .../skills/analytics-skill-content.test.ts | 146 +++ packages/cli/test/status-project.test.ts | 145 ++- .../test/telemetry/project-snapshot.test.ts | 1 + packages/cli/test/text-ingest.test.ts | 118 +++ packages/cli/test/verbatim-ingest.test.ts | 375 ++++++++ pnpm-lock.yaml | 145 +++ .../src/ktx_daemon/database_introspection.py | 129 ++- .../tests/test_database_introspection.py | 118 +++ spider2-specs/README.md | 62 ++ spider2-specs/done/.gitkeep | 0 .../done/01-connection-scoped-wiki.md | 74 ++ spider2-specs/done/02-verbatim-ingest-mode.md | 71 ++ .../done/06-scan-tolerate-broken-objects.md | 63 ++ .../done/07-analytics-skill-sql-craft.md | 112 +++ .../done/08-per-dialect-sql-syntax-notes.md | 83 ++ .../09-fan-out-safe-multi-hop-aggregation.md | 150 +++ .../done/10-panel-completeness-spine.md | 65 ++ .../done/11-time-series-window-recipes.md | 73 ++ .../done/12-parse-text-encoded-numbers.md | 61 ++ .../14-output-completeness-final-check.md | 105 ++ .../done/15-mcp-server-structured-logging.md | 116 +++ .../16-bounded-query-execution-timeout.md | 131 +++ .../18-bigquery-cross-project-datasets.md | 68 ++ ...-durable-bounded-relationship-detection.md | 89 ++ .../20-resilient-enrichment-under-slow-llm.md | 101 ++ .../done/21-selective-enrichment-stages.md | 91 ++ .../specs/01-connection-scoped-wiki.md | 300 ++++++ .../specs/02-verbatim-ingest-mode.md | 327 +++++++ .../specs/06-scan-tolerate-broken-objects.md | 361 +++++++ .../specs/07-analytics-skill-sql-craft.md | 363 +++++++ .../specs/08-per-dialect-sql-syntax-notes.md | 395 ++++++++ .../09-fan-out-safe-multi-hop-aggregation.md | 362 +++++++ .../specs/10-panel-completeness-spine.md | 289 ++++++ .../specs/11-time-series-window-recipes.md | 391 ++++++++ .../specs/12-parse-text-encoded-numbers.md | 405 ++++++++ .../14-output-completeness-final-check.md | 336 +++++++ .../specs/15-mcp-server-structured-logging.md | 405 ++++++++ .../16-bounded-query-execution-timeout.md | 493 ++++++++++ .../18-bigquery-cross-project-datasets.md | 418 ++++++++ ...-durable-bounded-relationship-detection.md | 471 +++++++++ .../20-resilient-enrichment-under-slow-llm.md | 533 +++++++++++ .../specs/21-selective-enrichment-stages.md | 567 +++++++++++ ...i-connection-routing-in-analytics-skill.md | 66 ++ .../todo/04-offline-schema-docs-adapter.md | 51 + .../todo/05-composite-key-join-detection.md | 59 ++ ...canonical-authoritative-source-measures.md | 89 ++ .../todo/17-lifecycle-event-metrics.md | 57 ++ 200 files changed, 17780 insertions(+), 672 deletions(-) create mode 100644 packages/cli/src/connectors/sqlite/read-query-child.ts create mode 100644 packages/cli/src/context/connections/configured-connections.ts create mode 100644 packages/cli/src/context/connections/query-deadline.ts create mode 100644 packages/cli/src/context/ingest/adapters/live-database/fetch-report.ts create mode 100644 packages/cli/src/context/ingest/adapters/live-database/scan-outcome.ts create mode 100644 packages/cli/src/context/llm/subprocess-generate-object-child.ts create mode 100644 packages/cli/src/context/llm/subprocess-generate-object.ts create mode 100644 packages/cli/src/context/mcp/logger.ts create mode 100644 packages/cli/src/context/scan/object-introspection.ts create mode 100644 packages/cli/src/context/scan/relationship-detection-budget.ts create mode 100644 packages/cli/src/context/sql-analysis/dialect-notes.ts create mode 100644 packages/cli/src/context/sql-analysis/dialects/bigquery.md create mode 100644 packages/cli/src/context/sql-analysis/dialects/clickhouse.md create mode 100644 packages/cli/src/context/sql-analysis/dialects/mysql.md create mode 100644 packages/cli/src/context/sql-analysis/dialects/postgres.md create mode 100644 packages/cli/src/context/sql-analysis/dialects/snowflake.md create mode 100644 packages/cli/src/context/sql-analysis/dialects/sqlite.md create mode 100644 packages/cli/src/context/sql-analysis/dialects/tsql.md create mode 100644 packages/cli/src/verbatim-ingest.ts create mode 100644 packages/cli/test/commands/ingest-commands.test.ts create mode 100644 packages/cli/test/context/connections/configured-connections.test.ts create mode 100644 packages/cli/test/context/connections/query-deadline.test.ts create mode 100644 packages/cli/test/context/ingest/adapters/live-database/scan-outcome.test.ts create mode 100644 packages/cli/test/context/llm/subprocess-generate-object.test.ts create mode 100644 packages/cli/test/context/llm/subprocess-test-children.test-utils.ts create mode 100644 packages/cli/test/context/mcp/dialect-notes.test.ts create mode 100644 packages/cli/test/context/mcp/logger.test.ts create mode 100644 packages/cli/test/context/scan/description-resume.test.ts create mode 100644 packages/cli/test/context/scan/enabled-tables.test.ts create mode 100644 packages/cli/test/context/scan/object-introspection.test.ts create mode 100644 packages/cli/test/context/scan/relationship-detection-budget.test.ts create mode 100644 packages/cli/test/mcp-stdio-server.test.ts create mode 100644 packages/cli/test/skills/analytics-skill-content.test.ts create mode 100644 packages/cli/test/verbatim-ingest.test.ts create mode 100644 spider2-specs/README.md create mode 100644 spider2-specs/done/.gitkeep create mode 100644 spider2-specs/done/01-connection-scoped-wiki.md create mode 100644 spider2-specs/done/02-verbatim-ingest-mode.md create mode 100644 spider2-specs/done/06-scan-tolerate-broken-objects.md create mode 100644 spider2-specs/done/07-analytics-skill-sql-craft.md create mode 100644 spider2-specs/done/08-per-dialect-sql-syntax-notes.md create mode 100644 spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md create mode 100644 spider2-specs/done/10-panel-completeness-spine.md create mode 100644 spider2-specs/done/11-time-series-window-recipes.md create mode 100644 spider2-specs/done/12-parse-text-encoded-numbers.md create mode 100644 spider2-specs/done/14-output-completeness-final-check.md create mode 100644 spider2-specs/done/15-mcp-server-structured-logging.md create mode 100644 spider2-specs/done/16-bounded-query-execution-timeout.md create mode 100644 spider2-specs/done/18-bigquery-cross-project-datasets.md create mode 100644 spider2-specs/done/19-durable-bounded-relationship-detection.md create mode 100644 spider2-specs/done/20-resilient-enrichment-under-slow-llm.md create mode 100644 spider2-specs/done/21-selective-enrichment-stages.md create mode 100644 spider2-specs/specs/01-connection-scoped-wiki.md create mode 100644 spider2-specs/specs/02-verbatim-ingest-mode.md create mode 100644 spider2-specs/specs/06-scan-tolerate-broken-objects.md create mode 100644 spider2-specs/specs/07-analytics-skill-sql-craft.md create mode 100644 spider2-specs/specs/08-per-dialect-sql-syntax-notes.md create mode 100644 spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md create mode 100644 spider2-specs/specs/10-panel-completeness-spine.md create mode 100644 spider2-specs/specs/11-time-series-window-recipes.md create mode 100644 spider2-specs/specs/12-parse-text-encoded-numbers.md create mode 100644 spider2-specs/specs/14-output-completeness-final-check.md create mode 100644 spider2-specs/specs/15-mcp-server-structured-logging.md create mode 100644 spider2-specs/specs/16-bounded-query-execution-timeout.md create mode 100644 spider2-specs/specs/18-bigquery-cross-project-datasets.md create mode 100644 spider2-specs/specs/19-durable-bounded-relationship-detection.md create mode 100644 spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md create mode 100644 spider2-specs/specs/21-selective-enrichment-stages.md create mode 100644 spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md create mode 100644 spider2-specs/todo/04-offline-schema-docs-adapter.md create mode 100644 spider2-specs/todo/05-composite-key-join-detection.md create mode 100644 spider2-specs/todo/13-canonical-authoritative-source-measures.md create mode 100644 spider2-specs/todo/17-lifecycle-event-metrics.md diff --git a/docs-site/content/docs/cli-reference/ktx-ingest.mdx b/docs-site/content/docs/cli-reference/ktx-ingest.mdx index ab3d231d..7c87f14f 100644 --- a/docs-site/content/docs/cli-reference/ktx-ingest.mdx +++ b/docs-site/content/docs/cli-reference/ktx-ingest.mdx @@ -34,8 +34,10 @@ connection is selected. | `--query-history` | Include database query-history usage patterns | Stored connection default | | `--no-query-history` | Skip database query-history usage patterns for this run | Stored connection default | | `--query-history-window-days ` | BigQuery/Snowflake query-history lookback window for this run | Stored connection default | +| `--stages ` | Comma-separated enrichment stages to (re)run: `descriptions`, `embeddings`, `relationships` | All three | | `--text ` | Capture inline text into **ktx** memory; repeatable | `[]` | | `--file ` | Capture a text file into **ktx** memory; use `-` for stdin; repeatable | `[]` | +| `--verbatim` | Store each `--text`/`--file` document body unchanged as a `GLOBAL` wiki page; the LLM derives metadata only | `false` | | `--connection-id ` | **ktx** connection id to tag captured text/file notes | - | | `--user-id ` | Memory user id for text/file capture attribution | `local-cli` | | `--fail-fast` | Stop after the first failed text/file item | `false` | @@ -63,6 +65,65 @@ use `--no-input` to fail fast with install guidance. `--text` and `--file` cannot be combined with a positional `connectionId` or `--all`; pass `--connection-id ` instead to tag captured notes. +### Verbatim ingest + +By default, captured text is routed through the memory agent, which decides what +to persist and may rewrite, condense, split, or re-title it. For *authoritative* +documents — metric definitions, formula specs, runbooks, compliance text — that +paraphrasing is a defect. Add `--verbatim` to store each `--text`/`--file` +document body **unchanged** as a `GLOBAL` wiki page: + +- The stored body is the input document, written by code; the LLM never edits it. + It is used only to derive page metadata (`summary`, `tags`, `sl_refs`), and even + that is skipped for fields the document's own frontmatter already sets. +- The page key is deterministic: a `--file` derives it from the filename, inline + `--text` from the document's leading Markdown heading (inline text without a + heading is rejected — pass it as `--file` instead). +- Ingest is idempotent. Re-running the same document is a safe no-op; a different + body at the same key fails loudly rather than overwriting. +- `--verbatim` works with `llm.provider.backend: none` — the only ingest path that + does. With no backend the `summary` is derived from the heading or first + sentence and `tags`/`sl_refs` are left empty; the full body is still stored. +- Existing frontmatter passes through untouched (including fields **ktx** does not + model, such as `effective_date` or `version`); generated metadata only fills + absent fields. `--connection-id ` scopes the page to that connection by + setting its `connections` frontmatter. + +### Selecting enrichment stages + +Database enrichment runs three stages: `descriptions` (one LLM call per table), +`embeddings` (vectors over the schema and descriptions), and `relationships` +(join detection, optionally LLM-proposed). Each stage is cached on a **per-stage +hash of only its own inputs**, so changing one stage's inputs invalidates only +that stage. Switching the description LLM re-runs only `descriptions`; upgrading +the embeddings model re-runs only `embeddings`; turning on +`scan.relationships.llmProposals` re-runs only `relationships`. The expensive +per-table descriptions are never thrown away because an unrelated setting moved. + +`--stages ` re-runs a chosen subset on an already-ingested connection. A +named stage is **force-recomputed** (it bypasses the completed-stage cache), +while unselected stages are left exactly as they are on disk: + +- `ktx ingest warehouse --stages embeddings` — re-embed on a new model, keeping + descriptions and joins. +- `ktx ingest --all --stages relationships --no-query-history` — backfill joins + across every database after enabling `llmProposals`, without re-paying for + descriptions. +- `ktx ingest warehouse --stages descriptions` — re-run thin descriptions (for + example after raising `KTX_ENRICH_LLM_TIMEOUT_MS`). When nothing the + descriptions depend on changed, the per-table resume record means only the + tables that previously failed are re-sent to the LLM. + +Stage names are validated: an unknown or empty name (`--stages foo`, `--stages +descriptions,foo`, `--stages ""`) is a hard parse error. Naming all three +(`--stages descriptions,embeddings,relationships`) forces a full enrichment +recompute, which is **not** the same as omitting the flag (omitting resumes +whatever is already done). After a selective run, **ktx** warns +(`enrichment_stage_stale`) when an unselected stage's inputs no longer match what +it was last built from — for example, re-running `descriptions` flags +`embeddings` as stale until you re-run `--stages embeddings`. The warning is +informational; **ktx** never silently cascades the extra work. + ## Examples ```bash @@ -77,6 +138,11 @@ ktx ingest warehouse --query-history # Set the lookback window for BigQuery or Snowflake query history ktx ingest warehouse --query-history-window-days 30 +# Re-embed one connection on a new embeddings model (descriptions/joins untouched) +ktx ingest warehouse --stages embeddings +# Backfill LLM-proposed joins across every database without re-describing +ktx ingest --all --stages relationships --no-query-history + # Build a context-source connection ktx ingest notion @@ -91,6 +157,12 @@ ktx ingest --file docs/revenue-notes.md --connection-id warehouse # Capture one stdin item printf "Refunds are excluded from net revenue." | ktx ingest --file - + +# Store an authoritative document verbatim (body preserved exactly) +ktx ingest --file docs/rfm-bucket-definitions.md --verbatim + +# Store it verbatim and scope it to one connection +ktx ingest --file docs/haversine-formula.md --verbatim --connection-id warehouse ``` ## Output @@ -191,3 +263,7 @@ according to `ingest.rateLimit`. | Python runtime is missing | The selected ingest target needs runtime-backed SQL analysis or source parsing | Accept the interactive prompt, rerun with `--yes`, or run the suggested `ktx admin runtime install` command | | Context-source options were ignored | Query-history flags were supplied for a context-source connection | Omit database-only flags when ingesting context-source connections | | Text ingest stops early | `--fail-fast` was used and one item failed | Fix the failed item or rerun without `--fail-fast` to collect all failures | +| `--verbatim requires --text or --file` | `--verbatim` was passed without a document to store | Add `--text` or `--file`, or drop `--verbatim` | +| Inline verbatim text needs a leading heading | `--text --verbatim` content has no `# Heading` to derive a stable key | Add a leading Markdown heading, or pass the content as `--file ` | +| A different page already exists at key | A verbatim re-run targeted an existing key with a different body | Use a distinct document name/key, or remove the existing page first | +| Connection scope conflict | Frontmatter `connections` disagrees with `--connection-id` | Remove one so the intended scope is unambiguous | diff --git a/docs-site/content/docs/cli-reference/ktx-setup.mdx b/docs-site/content/docs/cli-reference/ktx-setup.mdx index 700de548..10c27c16 100644 --- a/docs-site/content/docs/cli-reference/ktx-setup.mdx +++ b/docs-site/content/docs/cli-reference/ktx-setup.mdx @@ -134,6 +134,13 @@ incomplete. MySQL, and SQL Server; `schema_names` for Snowflake; `dataset_ids` for BigQuery; and `databases` for ClickHouse. +A BigQuery `--database-schema` value may be qualified as `project.dataset` to +scan a dataset hosted in another project (such as +`bigquery-public-data.austin_311`); a bare value stays in the credentials' +project. Setup does not discover foreign-project datasets, so supply qualified +entries explicitly. See +[Primary sources → BigQuery](/docs/integrations/primary-sources#cross-project-datasets). + With `--no-input`, scope for a scope-bearing driver (PostgreSQL, MySQL, ClickHouse, SQL Server, BigQuery, Snowflake) must come from `--database-schema` or from existing connection config in `ktx.yaml` (for example diff --git a/docs-site/content/docs/cli-reference/ktx-wiki.mdx b/docs-site/content/docs/cli-reference/ktx-wiki.mdx index 7887a463..15461fed 100644 --- a/docs-site/content/docs/cli-reference/ktx-wiki.mdx +++ b/docs-site/content/docs/cli-reference/ktx-wiki.mdx @@ -28,10 +28,17 @@ Edit the Markdown files under `wiki/` directly, or ingest source content with | Flag | Description | Default | |------|-------------|---------| | `--user-id ` | Local user id | `local` | +| `-c, --connection ` | Scope results to one connection: unscoped pages plus pages tagged with that connection | - | | `--limit ` | Maximum search results (search mode only) | - | | `--output ` | Output mode: `pretty` (default in TTY), `plain` (TSV), or `json` | `pretty` | | `--json` | Shortcut for `--output=json` (overrides `--output`) | `false` | +`-c, --connection ` takes a connection id from the `connections` map in +`ktx.yaml` (an unknown id is rejected). It narrows both list and search to +pages that are not tied to any connection plus pages tagged with that +connection, so an agent working against one database sees only the wiki +knowledge relevant to it. + `ktx wiki ` uses hybrid search when `storage.search` is `sqlite-fts5`. **ktx** combines lexical SQLite FTS5 matches, token matches, and semantic matches from wiki page embeddings stored in `.ktx/db.sqlite`. If embeddings are not @@ -50,6 +57,12 @@ ktx wiki --json # Search wiki pages ktx wiki "monthly recurring revenue" +# List pages scoped to one connection (unscoped + connection-tagged) +ktx wiki --connection warehouse + +# Search within one connection's scope +ktx wiki "monthly recurring revenue" -c warehouse + # Search wiki pages as JSON ktx wiki "monthly recurring revenue" --json --limit 10 diff --git a/docs-site/content/docs/configuration/ktx-yaml.mdx b/docs-site/content/docs/configuration/ktx-yaml.mdx index 1c28b50a..24e58e39 100644 --- a/docs-site/content/docs/configuration/ktx-yaml.mdx +++ b/docs-site/content/docs/configuration/ktx-yaml.mdx @@ -124,8 +124,10 @@ context-source drivers share the map. Warehouse connections are open objects: the listed fields are validated, and any other field is preserved and passed through to the connector. Use -`enabled_tables` to scope ingest to a specific list of -`schema.table` names - useful for smoke tests. +`enabled_tables` to scope ingest to a specific list of objects - useful for +smoke tests. Each entry accepts a `catalog.db.name`, `db.name`, or bare `name` +qualifier. ktx restricts the scan to the listed objects and fails with a clear +error (naming the available objects) if none match. ```yaml connections: @@ -137,6 +139,18 @@ connections: - public.customers ``` +For SQLite, which exposes a single `main` schema, the qualified `main.` +and the bare `` forms select the same object: + +```yaml +connections: + local-db: + driver: sqlite + path: ./warehouse.db + enabled_tables: + - customers # equivalent to main.customers +``` + Connector-specific scope fields let setup and scan use the same warehouse boundary: @@ -158,6 +172,12 @@ connections: dataset_ids: [analytics, mart] ``` +A BigQuery `dataset_ids` / `dataset_id` entry may be written `project.dataset` +to introspect a dataset hosted in another project (for example +`bigquery-public-data.austin_311`); jobs still bill to the `project_id` in +`credentials_json`. A bare `dataset` keeps using your own project. See +[Primary sources → BigQuery](/docs/integrations/primary-sources#cross-project-datasets). + For Postgres, MySQL, SQL Server, and Snowflake connections, set `maxConnections` when scan or ingest work needs to stay below the target's connection cap. Postgres, MySQL, and SQL Server default to `10`; Snowflake @@ -554,6 +574,7 @@ scan: profileConcurrency: 4 validationConcurrency: 4 validationBudget: all + detectionBudgetMs: 600000 ``` ### Enrichment @@ -582,6 +603,7 @@ the manifest. | `relationships.profileConcurrency` | `int > 0` | `4` | Parallel relationship-profile queries against the database. For pooled connectors, effective database concurrency is also bounded by the connection's `maxConnections`. | | `relationships.validationConcurrency` | `int > 0` | `4` | Parallel relationship validation queries against the database. | | `relationships.validationBudget` | `all` \| `int ≥ 0` | runtime default | Cap on validation queries per scan. `all` means unlimited. | +| `relationships.detectionBudgetMs` | `int > 0` | `600000` | Wall-clock budget (ms) for the whole relationship-detection stage, checked at table-profile, candidate-validation, and composite-probe boundaries. On exhaustion the stage stops scheduling new work and writes the joins found so far, marked partial; descriptions and embeddings are already durable. Sits above the per-query deadline. Raise it to trigger a fresher, fuller run. | ## `agent` diff --git a/docs-site/content/docs/guides/writing-context.mdx b/docs-site/content/docs/guides/writing-context.mdx index 8703030e..33cb90e1 100644 --- a/docs-site/content/docs/guides/writing-context.mdx +++ b/docs-site/content/docs/guides/writing-context.mdx @@ -321,6 +321,23 @@ Useful frontmatter: 5. Add `sl_refs` for relevant semantic sources. 6. Search again with a user-like phrase. +### Ingest an authoritative document verbatim + +When the document is already the source of truth — a metric-definition sheet, a +formula spec, a runbook, compliance text — you want **ktx** to index and surface +it, not re-author it. Instead of hand-copying the file into `wiki/global/`, ingest +it verbatim: + +```bash +ktx ingest --file docs/rfm-bucket-definitions.md --verbatim +``` + +The body is stored byte-for-byte (the LLM only derives `summary`, `tags`, and +`sl_refs` for the absent frontmatter fields), the page key is derived from the +filename, and re-running is a safe no-op. Existing frontmatter — including fields +**ktx** does not model, like `effective_date` — passes through unchanged. See +[`ktx ingest`](/docs/cli-reference/ktx-ingest) for the full flag reference. + ## Review context changes Before accepting agent-written context: diff --git a/docs-site/content/docs/integrations/primary-sources.mdx b/docs-site/content/docs/integrations/primary-sources.mdx index bb0ce12a..40af30e0 100644 --- a/docs-site/content/docs/integrations/primary-sources.mdx +++ b/docs-site/content/docs/integrations/primary-sources.mdx @@ -35,7 +35,7 @@ Agents should prefer environment or file references over literal secrets. | `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it | | `path` | Yes for path-style SQLite | SQLite | Local SQLite database path or `env:NAME` reference | | `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job | -| `job_timeout_ms` | No | BigQuery | BigQuery query job timeout in milliseconds | +| `query_timeout_ms` | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise. | | `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication | ## PostgreSQL @@ -220,6 +220,37 @@ BigQuery dataset scope is stored in `connections..dataset_ids`. Interactive setup discovers datasets from credentials plus location, then writes the chosen dataset ids as the scan scope. +### Cross-project datasets + +To introspect a dataset hosted in a **different project** than the one your +credentials bill to — for example Google's `bigquery-public-data`, a partner's +shared project, or an organization's central data project — qualify the entry +as `project.dataset`: + +```yaml title="ktx.yaml" +connections: + public-bq: + driver: bigquery + credentials_json: file:~/.config/gcloud/bq-service-account.json + location: US + dataset_ids: + - bigquery-public-data.austin_311 + - bigquery-public-data.census_bureau_usa + - analytics +``` + +**ktx** introspects each dataset in its host project while every query job still +bills to the `project_id` inside your `credentials_json`. A bare `dataset` entry +(no prefix) is scanned in your own project, exactly as before. A single +connection may mix datasets from several projects, and two projects may host +datasets with the same name without colliding. + +Interactive setup does not enumerate datasets in projects your credentials don't +own, so hand-write `project.dataset` entries for foreign datasets. The wizard's +table picker also only lists datasets in your connection's `location` region; +this affects table selection only — ingest and `discover_data` introspect a +cross-project dataset regardless of region. + ### Authentication | Method | Config | @@ -269,7 +300,7 @@ staged artifact shape as Postgres and Snowflake. - Parameter binding uses named `@param` syntax - Arrays flattened to comma-separated strings in results - Location specified at query execution time -- Supports `max_bytes_billed` and `job_timeout_ms` limits from `ktx.yaml` +- Supports the `max_bytes_billed` limit from `ktx.yaml`; the shared `query_timeout_ms` field maps to the query job's `jobTimeoutMs` --- diff --git a/knip.json b/knip.json index a2d16ca8..507cef0d 100644 --- a/knip.json +++ b/knip.json @@ -17,7 +17,9 @@ "test/**/*.test-utils.ts", "test/**/acceptance-fixtures.ts", "src/context/scan/relationship-benchmarks.ts!", - "src/context/scan/relationship-benchmark-report.ts!" + "src/context/scan/relationship-benchmark-report.ts!", + "src/connectors/sqlite/read-query-child.ts!", + "src/context/llm/subprocess-generate-object-child.ts!" ] }, "docs-site": { diff --git a/packages/cli/package.json b/packages/cli/package.json index 9aa57d23..2c2cb5ac 100644 --- a/packages/cli/package.json +++ b/packages/cli/package.json @@ -78,6 +78,8 @@ "openai": "^6.38.0", "p-limit": "^7.3.0", "pg": "^8.21.0", + "pino": "^10.3.1", + "pino-pretty": "^13.1.3", "posthog-node": "^5.34.9", "react": "^19.2.6", "semver": "^7.8.1", diff --git a/packages/cli/scripts/copy-runtime-assets.mjs b/packages/cli/scripts/copy-runtime-assets.mjs index 579cb8e8..47502c21 100644 --- a/packages/cli/scripts/copy-runtime-assets.mjs +++ b/packages/cli/scripts/copy-runtime-assets.mjs @@ -7,10 +7,17 @@ const promptsSource = join(packageRoot, 'src', 'prompts'); const promptsTarget = join(packageRoot, 'dist', 'prompts'); const skillsSource = join(packageRoot, 'src', 'skills'); const skillsTarget = join(packageRoot, 'dist', 'skills'); +// Per-dialect SQL notes are markdown served by the sql_dialect_notes MCP tool; +// tsc does not emit non-.ts files, so copy them next to their compiled module. +const dialectNotesSource = join(packageRoot, 'src', 'context', 'sql-analysis', 'dialects'); +const dialectNotesTarget = join(packageRoot, 'dist', 'context', 'sql-analysis', 'dialects'); await rm(promptsTarget, { recursive: true, force: true }); await rm(skillsTarget, { recursive: true, force: true }); +await rm(dialectNotesTarget, { recursive: true, force: true }); await mkdir(dirname(promptsTarget), { recursive: true }); await mkdir(dirname(skillsTarget), { recursive: true }); +await mkdir(dirname(dialectNotesTarget), { recursive: true }); await cp(promptsSource, promptsTarget, { recursive: true }); await cp(skillsSource, skillsTarget, { recursive: true }); +await cp(dialectNotesSource, dialectNotesTarget, { recursive: true }); diff --git a/packages/cli/src/cli-program.ts b/packages/cli/src/cli-program.ts index f9da6552..c30711c8 100644 --- a/packages/cli/src/cli-program.ts +++ b/packages/cli/src/cli-program.ts @@ -133,7 +133,7 @@ export function parseBooleanStringOption(value: string): boolean { } export function parseSafeConnectionIdOption(value: string): string { - if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) { + if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) { throw new InvalidArgumentError(`Unsafe connection id: ${value}`); } return value; diff --git a/packages/cli/src/commands/ingest-commands.ts b/packages/cli/src/commands/ingest-commands.ts index d7e09596..9d1c3af5 100644 --- a/packages/cli/src/commands/ingest-commands.ts +++ b/packages/cli/src/commands/ingest-commands.ts @@ -1,10 +1,12 @@ -import { type Command, Option } from '@commander-js/extra-typings'; +import { type Command, InvalidArgumentError, Option } from '@commander-js/extra-typings'; import { collectOption, type KtxCliCommandContext, parsePositiveIntegerOption, resolveCommandProjectDir, } from '../cli-program.js'; +import { KTX_SCAN_ENRICHMENT_STAGES } from '../context/scan/enrichment-state.js'; +import type { KtxScanEnrichmentStage } from '../context/scan/types.js'; import type { KtxCliDeps, KtxCliIo } from '../index.js'; import { runtimeInstallPolicyFromFlags } from '../managed-python-command.js'; import type { KtxPublicIngestArgs } from '../public-ingest.js'; @@ -14,6 +16,36 @@ import { resolveConnectionSelection } from './connection-selection.js'; profileMark('module:commands/ingest-commands'); +/** + * Parses `--stages` into an ordered, de-duplicated subset of the canonical + * enrichment-stage registry. An unknown or empty name is a hard parse error so + * a typo never silently degrades to "run everything." + * + * @internal + */ +export function parseEnrichmentStagesOption(value: string): KtxScanEnrichmentStage[] { + const names = value + .split(',') + .map((name) => name.trim()) + .filter((name) => name.length > 0); + if (names.length === 0) { + throw new InvalidArgumentError( + `must be a non-empty comma-separated list of stages (${KTX_SCAN_ENRICHMENT_STAGES.join(', ')})`, + ); + } + const valid = new Set(KTX_SCAN_ENRICHMENT_STAGES); + const selected = new Set(); + for (const name of names) { + if (!valid.has(name)) { + throw new InvalidArgumentError( + `unknown stage "${name}"; valid stages are ${KTX_SCAN_ENRICHMENT_STAGES.join(', ')}`, + ); + } + selected.add(name as KtxScanEnrichmentStage); + } + return KTX_SCAN_ENRICHMENT_STAGES.filter((stage) => selected.has(stage)); +} + interface IngestCommandOptions { runTextIngest: (args: KtxTextIngestArgs, io: KtxCliIo, deps: KtxCliDeps) => Promise; } @@ -32,8 +64,18 @@ export function registerIngestCommands( .addOption(new Option('--query-history', 'Include database query-history usage patterns').conflicts('noQueryHistory')) .addOption(new Option('--no-query-history', 'Skip database query-history usage patterns')) .option('--query-history-window-days ', 'Query-history lookback window for this run', parsePositiveIntegerOption) + .option( + '--stages ', + 'Comma-separated enrichment stages to (re)run (descriptions,embeddings,relationships); omit to run all', + parseEnrichmentStagesOption, + ) .option('--text ', 'Capture inline text into ktx memory; repeatable', collectOption, []) .option('--file ', 'Capture a text file into ktx memory; use - for stdin; repeatable', collectOption, []) + .option( + '--verbatim', + 'Store each --text/--file document body unchanged as a GLOBAL wiki page; the LLM derives only metadata', + false, + ) .option('--connection-id ', 'ktx connection id to tag captured text/file notes') .option('--user-id ', 'Memory user id for text/file capture attribution', 'local-cli') .option('--fail-fast', 'Stop after the first failed text/file item', false) @@ -47,6 +89,14 @@ export function registerIngestCommands( const projectDir = resolveCommandProjectDir(command); const hasTextCapture = options.text.length > 0 || options.file.length > 0; + if (options.verbatim === true && !hasTextCapture) { + command.error('error: --verbatim requires --text or --file'); + } + + if (options.stages !== undefined && hasTextCapture) { + command.error('error: --stages applies to database ingest only; it cannot be combined with --text or --file'); + } + if (hasTextCapture) { if (connectionId !== undefined) { command.error( @@ -66,6 +116,7 @@ export function registerIngestCommands( userId: options.userId, json: options.json === true, failFast: options.failFast === true, + ...(options.verbatim === true ? { verbatim: true } : {}), }, context.io, context.deps, @@ -87,6 +138,7 @@ export function registerIngestCommands( inputMode: options.input === false ? 'disabled' : 'auto', queryHistory, ...(options.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: options.queryHistoryWindowDays } : {}), + ...(options.stages ? { stages: options.stages } : {}), cliVersion: context.packageInfo.version, runtimeInstallPolicy: runtimeInstallPolicyFromFlags(options), }; diff --git a/packages/cli/src/commands/knowledge-commands.ts b/packages/cli/src/commands/knowledge-commands.ts index b601b688..ee537458 100644 --- a/packages/cli/src/commands/knowledge-commands.ts +++ b/packages/cli/src/commands/knowledge-commands.ts @@ -27,6 +27,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon .usage('[options] [query...]') .argument('[query...]', 'Search query; omit to list all pages') .option('--user-id ', 'Local user id', 'local') + .option('-c, --connection ', 'Scope results to one connection (unscoped pages plus pages tagged with it)') .option('--limit ', 'Maximum search results (search mode only)', parsePositiveIntegerOption) .addOption( new Option('--output ', 'Output mode: pretty (default in TTY), plain (TSV), or json').choices([ @@ -46,6 +47,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon query: string[], options: { userId: string; + connection?: string; limit?: number; output?: 'pretty' | 'plain' | 'json'; json?: boolean; @@ -57,6 +59,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon command: 'list', projectDir: resolveCommandProjectDir(command), userId: options.userId, + ...(options.connection !== undefined ? { connectionId: options.connection } : {}), output: options.output, json: options.json, cliVersion: context.packageInfo.version, @@ -68,6 +71,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon projectDir: resolveCommandProjectDir(command), query: query.join(' '), userId: options.userId, + ...(options.connection !== undefined ? { connectionId: options.connection } : {}), output: options.output, json: options.json, ...(isDebugEnabled(command) ? { debug: true } : {}), diff --git a/packages/cli/src/connection-drivers.ts b/packages/cli/src/connection-drivers.ts index be98746b..511637f5 100644 --- a/packages/cli/src/connection-drivers.ts +++ b/packages/cli/src/connection-drivers.ts @@ -1,6 +1,7 @@ import type { KtxProjectConnectionConfig } from './context/project/config.js'; -const KTX_DATABASE_DRIVER_IDS = new Set([ +/** @internal Canonical SQL-warehouse driver ids; the dialect-notes coverage test derives its required coverage from this set. */ +export const KTX_DATABASE_DRIVER_IDS = [ 'sqlite', 'postgres', 'mysql', @@ -8,8 +9,11 @@ const KTX_DATABASE_DRIVER_IDS = new Set([ 'sqlserver', 'bigquery', 'snowflake', - 'mongodb', -]); +] as const; + +// mongodb is a database driver but has no SQL dialect, so it sits outside the +// dialect-notes coverage set above. +const databaseDriverIds = new Set([...KTX_DATABASE_DRIVER_IDS, 'mongodb']); export function normalizeConnectionDriver(connection: KtxProjectConnectionConfig): string { return String(connection.driver ?? '') @@ -18,5 +22,5 @@ export function normalizeConnectionDriver(connection: KtxProjectConnectionConfig } export function isDatabaseDriver(driver: string): boolean { - return KTX_DATABASE_DRIVER_IDS.has(driver.trim().toLowerCase()); + return databaseDriverIds.has(driver.trim().toLowerCase()); } diff --git a/packages/cli/src/connectors/bigquery/connector.ts b/packages/cli/src/connectors/bigquery/connector.ts index e4e284b6..69a166d5 100644 --- a/packages/cli/src/connectors/bigquery/connector.ts +++ b/packages/cli/src/connectors/bigquery/connector.ts @@ -1,8 +1,14 @@ import { BigQuery, type TableField } from '@google-cloud/bigquery'; -import { normalizeBigQueryProjectId, normalizeBigQueryRegion } from '../../context/connections/bigquery-identifiers.js'; +import { + normalizeBigQueryDatasetId, + normalizeBigQueryProjectId, + normalizeBigQueryRegion, +} from '../../context/connections/bigquery-identifiers.js'; import { getSqlDialectForDriver } from '../../context/connections/dialects.js'; +import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js'; import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js'; import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js'; +import { tryIntrospectObject } from '../../context/scan/object-introspection.js'; import { scopedTableNames } from '../../context/scan/table-ref.js'; import { connectorTestFailure, @@ -35,14 +41,25 @@ export interface KtxBigQueryConnectionConfig { credentials_json?: string; location?: string; max_bytes_billed?: number | string; - job_timeout_ms?: number; + query_timeout_ms?: number; [key: string]: unknown; } +/** + * A dataset to introspect, paired with the project that hosts it. `project` + * defaults to the billing project (`credentials.project_id`) when an entry has + * no `project.` prefix; a fully-qualified `project.dataset` entry resolves to + * its own host project. Jobs always bill in `credentials.project_id`. + */ +export interface BigQueryDatasetRef { + project: string; + dataset: string; +} + export interface KtxBigQueryResolvedConnectionConfig { projectId: string; credentials: Record; - datasetIds: string[]; + datasetIds: BigQueryDatasetRef[]; location?: string; } @@ -95,7 +112,7 @@ export interface KtxBigQueryDataset { export interface KtxBigQueryClient { getDatasets(input?: { maxResults?: number }): Promise<[Array<{ id?: string }>, ...unknown[]]>; - dataset(datasetId: string): KtxBigQueryDataset; + dataset(datasetId: string, projectId: string): KtxBigQueryDataset; createQueryJob(input: { query: string; location?: string; @@ -116,7 +133,6 @@ export interface KtxBigQueryScanConnectorOptions { env?: NodeJS.ProcessEnv; now?: () => Date; maxBytesBilled?: number | string; - queryTimeoutMs?: number; } class DefaultBigQueryClientFactory implements KtxBigQueryClientFactory { @@ -124,8 +140,8 @@ class DefaultBigQueryClientFactory implements KtxBigQueryClientFactory { const client = new BigQuery(input); return { getDatasets: (options) => client.getDatasets(options) as Promise<[Array<{ id?: string }>, ...unknown[]]>, - dataset: (datasetId) => { - const dataset = client.dataset(datasetId); + dataset: (datasetId, projectId) => { + const dataset = client.dataset(datasetId, { projectId }); return { get: () => dataset.get() as Promise, getTables: () => dataset.getTables() as Promise<[KtxBigQueryTableRef[], ...unknown[]]>, @@ -145,14 +161,48 @@ function stringConfigValue( return typeof value === 'string' && value.trim().length > 0 ? resolveStringReference(value.trim(), env) : undefined; } -function datasetIds(connection: KtxBigQueryConnectionConfig, env: NodeJS.ProcessEnv): string[] { - if (Array.isArray(connection.dataset_ids) && connection.dataset_ids.length > 0) { - return connection.dataset_ids - .filter((dataset) => dataset.trim().length > 0) - .map((dataset) => resolveStringReference(dataset, env)); +/** + * Parse one `dataset_ids` / `dataset_id` entry into a canonical + * {@link BigQueryDatasetRef}. A `project.dataset` prefix selects the host + * project; a bare entry defaults to `defaultProject` (the billing project). + * More than one dot, or an empty segment, is a config error naming the + * connection — never a silent mis-introspection at scan time. + */ +function parseBigQueryDatasetEntry(entry: string, defaultProject: string, connectionId: string): BigQueryDatasetRef { + const context = `connections.${connectionId}.dataset_ids entry "${entry}"`; + const parts = entry.split('.'); + if (parts.length === 1) { + return { project: defaultProject, dataset: normalizeBigQueryDatasetId(parts[0]!, context) }; } - const datasetId = stringConfigValue(connection, 'dataset_id', env); - return datasetId ? [datasetId] : []; + if (parts.length === 2) { + const [project, dataset] = parts; + if (!project || !dataset) { + throw new Error(`Invalid BigQuery dataset entry for ${context}: empty project or dataset segment`); + } + return { + project: normalizeBigQueryProjectId(project, context), + dataset: normalizeBigQueryDatasetId(dataset, context), + }; + } + throw new Error( + `Invalid BigQuery dataset entry for ${context}: expected "dataset" or "project.dataset", got more than one "."`, + ); +} + +function resolveDatasetRefs( + connection: KtxBigQueryConnectionConfig, + env: NodeJS.ProcessEnv, + defaultProject: string, + connectionId: string, +): BigQueryDatasetRef[] { + const rawEntries = + Array.isArray(connection.dataset_ids) && connection.dataset_ids.length > 0 + ? connection.dataset_ids.map((dataset) => resolveStringReference(dataset, env)) + : [stringConfigValue(connection, 'dataset_id', env)].filter((value): value is string => Boolean(value)); + return rawEntries + .map((entry) => entry.trim()) + .filter((entry) => entry.length > 0) + .map((entry) => parseBigQueryDatasetEntry(entry, defaultProject, connectionId)); } function bigQueryMaxBytesBilledFromConnection( @@ -169,12 +219,25 @@ function bigQueryMaxBytesBilledFromConnection( return undefined; } -function bigQueryJobTimeoutMsFromConnection(connection: KtxBigQueryConnectionConfig | undefined): number | undefined { - const value = connection?.job_timeout_ms; - if (typeof value !== 'number') { - return undefined; +// jobTimeoutMs cancels the job with a "Job timed out" message (or a timeout +// reason in the errors array) once the deadline elapses. +function isBigQueryTimeoutError(error: unknown): boolean { + if (!error || typeof error !== 'object') { + return false; } - return Number.isInteger(value) && value > 0 ? value : undefined; + const topMessage = (error as { message?: unknown }).message; + if (typeof topMessage === 'string' && /timed out|timeout/i.test(topMessage)) { + return true; + } + const errors = (error as { errors?: unknown }).errors; + return ( + Array.isArray(errors) && + errors.some((entry) => { + const reason = (entry as { reason?: unknown })?.reason; + const message = (entry as { message?: unknown })?.message; + return reason === 'timeout' || (typeof message === 'string' && /timed out|timeout/i.test(message)); + }) + ); } function tableKind(metadataType: string | undefined): KtxSchemaTable['kind'] { @@ -267,7 +330,7 @@ export function bigQueryConnectionConfigFromConfig(input: { if (!projectId) { throw new Error(`Native BigQuery connector requires credentials_json.project_id for connections.${input.connectionId}`); } - const resolvedDatasetIds = datasetIds(input.connection, env); + const resolvedDatasetIds = resolveDatasetRefs(input.connection, env, projectId, input.connectionId); const location = stringConfigValue(input.connection, 'location', env); return { projectId, credentials, datasetIds: resolvedDatasetIds, ...(location ? { location } : {}) }; } @@ -290,7 +353,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { private readonly clientFactory: KtxBigQueryClientFactory; private readonly now: () => Date; private readonly maxBytesBilled?: number | string; - private readonly queryTimeoutMs?: number; + private readonly deadlineMs: number; private readonly dialect = getSqlDialectForDriver('bigquery'); private client: KtxBigQueryClient | null = null; @@ -304,7 +367,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { this.clientFactory = options.clientFactory ?? new DefaultBigQueryClientFactory(); this.now = options.now ?? (() => new Date()); this.maxBytesBilled = options.maxBytesBilled ?? bigQueryMaxBytesBilledFromConnection(options.connection); - this.queryTimeoutMs = options.queryTimeoutMs ?? bigQueryJobTimeoutMsFromConnection(options.connection); + this.deadlineMs = resolveQueryDeadlineMs(options.connection); this.id = `bigquery:${options.connectionId}`; } @@ -312,8 +375,8 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { try { const client = this.getClient(); await client.getDatasets({ maxResults: 1 }); - for (const datasetId of this.resolved.datasetIds) { - await client.dataset(datasetId).get(); + for (const ref of this.resolved.datasetIds) { + await client.dataset(ref.dataset, ref.project).get(); } return { success: true }; } catch (error) { @@ -324,22 +387,23 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise { this.assertConnection(input.connectionId); const tables: KtxSchemaTable[] = []; - const datasetIds = this.requireDatasetIdsForScan(); + const datasetRefs = this.requireDatasetIdsForScan(); const snapshotWarnings: KtxScanWarning[] = []; - for (const datasetId of datasetIds) { + for (const ref of datasetRefs) { const scopedNames = input.tableScope - ? scopedTableNames(input.tableScope, { catalog: this.resolved.projectId, db: datasetId }) + ? scopedTableNames(input.tableScope, { catalog: ref.project, db: ref.dataset }) : null; - tables.push(...(await this.introspectDataset(datasetId, scopedNames, snapshotWarnings))); + tables.push(...(await this.introspectDataset(ref, scopedNames, snapshotWarnings))); } + const datasetLabels = datasetRefs.map((ref) => this.qualifiedDatasetLabel(ref)); return { connectionId: this.connectionId, driver: 'bigquery', extractedAt: this.now().toISOString(), - scope: { catalogs: [this.resolved.projectId], datasets: datasetIds }, + scope: { catalogs: [...new Set(datasetRefs.map((ref) => ref.project))], datasets: datasetLabels }, metadata: { project_id: this.resolved.projectId, - datasets: datasetIds, + datasets: datasetLabels, table_count: tables.length, total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0), }, @@ -400,11 +464,14 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { return { values: valueRows.filter((row) => row.val !== null).map((row) => String(row.val)), cardinality }; } - async getTableRowCount(tableName: string, datasetId = this.resolved.datasetIds[0]): Promise { - if (!datasetId) { + async getTableRowCount( + tableName: string, + ref: BigQueryDatasetRef | undefined = this.resolved.datasetIds[0], + ): Promise { + if (!ref) { return 0; } - const tables = await this.introspectDataset(datasetId, null, []); + const tables = await this.introspectDataset(ref, null, []); return tables.find((table) => table.name === tableName)?.estimatedRows ?? 0; } @@ -422,12 +489,28 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { } async listTables(datasetIds?: string[]): Promise { - const projectId = normalizeBigQueryProjectId(this.resolved.projectId, 'table discovery'); const region = normalizeBigQueryRegion(this.resolved.location ?? 'US', 'table discovery'); + if (!datasetIds || datasetIds.length === 0) { + return this.listTablesInProject(this.resolved.projectId, region); + } + const datasetsByProject = new Map(); + for (const entry of datasetIds) { + const ref = parseBigQueryDatasetEntry(entry.trim(), this.resolved.projectId, this.connectionId); + datasetsByProject.set(ref.project, [...(datasetsByProject.get(ref.project) ?? []), ref.dataset]); + } + const entries: KtxTableListEntry[] = []; + for (const [project, datasets] of datasetsByProject) { + entries.push(...(await this.listTablesInProject(project, region, datasets))); + } + return entries; + } + + private async listTablesInProject(project: string, region: string, datasets?: string[]): Promise { + const projectId = normalizeBigQueryProjectId(project, 'table discovery'); const params: Record = {}; - const filter = datasetIds && datasetIds.length > 0 ? 'AND table_schema IN UNNEST(@dataset_ids)' : ''; - if (datasetIds && datasetIds.length > 0) { - params.dataset_ids = datasetIds; + const filter = datasets && datasets.length > 0 ? 'AND table_schema IN UNNEST(@dataset_ids)' : ''; + if (datasets && datasets.length > 0) { + params.dataset_ids = datasets; } const rows = await this.queryRaw<{ table_schema: string; table_name: string; table_type: string }>( ` @@ -442,7 +525,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { params, ); return rows.map((row) => ({ - catalog: this.resolved.projectId, + catalog: project, schema: row.table_schema, name: row.table_name, kind: @@ -466,34 +549,48 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { return this.client; } - private requireDatasetIdsForScan(): string[] { + private requireDatasetIdsForScan(): BigQueryDatasetRef[] { if (this.resolved.datasetIds.length === 0) { throw new Error(`Native BigQuery scan requires connections.${this.connectionId}.dataset_ids or dataset_id`); } return this.resolved.datasetIds; } + // Bare in the billing project, qualified `project.dataset` otherwise, so the + // snapshot's scope/metadata stay unambiguous when two projects host the same + // dataset name. The dotless form is the unchanged single-project label. + private qualifiedDatasetLabel(ref: BigQueryDatasetRef): string { + return ref.project === this.resolved.projectId ? ref.dataset : `${ref.project}.${ref.dataset}`; + } + private async query(sql: string, params?: Record): Promise { - const [job] = await this.getClient().createQueryJob({ - query: sql, - ...(this.resolved.location ? { location: this.resolved.location } : {}), - ...(params && Object.keys(params).length > 0 ? { params } : {}), - ...(this.maxBytesBilled ? { maximumBytesBilled: String(this.maxBytesBilled) } : {}), - ...(this.queryTimeoutMs ? { jobTimeoutMs: this.queryTimeoutMs } : {}), - }); - const [rows, , response] = await job.getQueryResults(); - let headers = response?.schema?.fields?.map((field) => field.name || '') ?? []; - const headerTypes = response?.schema?.fields?.map((field) => String(field.type || 'STRING')) ?? []; - if (headers.length === 0 && rows.length > 0) { - headers = Object.keys(rows[0]!); + try { + const [job] = await this.getClient().createQueryJob({ + query: sql, + ...(this.resolved.location ? { location: this.resolved.location } : {}), + ...(params && Object.keys(params).length > 0 ? { params } : {}), + ...(this.maxBytesBilled ? { maximumBytesBilled: String(this.maxBytesBilled) } : {}), + jobTimeoutMs: this.deadlineMs, + }); + const [rows, , response] = await job.getQueryResults(); + let headers = response?.schema?.fields?.map((field) => field.name || '') ?? []; + const headerTypes = response?.schema?.fields?.map((field) => String(field.type || 'STRING')) ?? []; + if (headers.length === 0 && rows.length > 0) { + headers = Object.keys(rows[0]!); + } + return { + headers, + headerTypes: headerTypes.length > 0 ? headerTypes : undefined, + rows: rows.map((row) => headers.map((header) => normalizeValue(row[header]))), + totalRows: rows.length, + rowCount: rows.length, + }; + } catch (error) { + if (isBigQueryTimeoutError(error)) { + throw queryDeadlineExceededError(this.deadlineMs, { cause: error }); + } + throw error; } - return { - headers, - headerTypes: headerTypes.length > 0 ? headerTypes : undefined, - rows: rows.map((row) => headers.map((header) => normalizeValue(row[header]))), - totalRows: rows.length, - rowCount: rows.length, - }; } private async queryRaw>(sql: string, params?: Record): Promise { @@ -507,18 +604,18 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { } private async introspectDataset( - datasetId: string, + ref: BigQueryDatasetRef, scopedNames: readonly string[] | null, snapshotWarnings: KtxScanWarning[], ): Promise { if (scopedNames && scopedNames.length === 0) return []; - const dataset = this.getClient().dataset(datasetId); + const dataset = this.getClient().dataset(ref.dataset, ref.project); const [tableRefs] = await dataset.getTables(); const scopeSet = scopedNames ? new Set(scopedNames) : null; const filteredTableRefs = scopeSet ? tableRefs.filter((tableRef) => scopeSet.has(tableRef.id ?? '')) : tableRefs; const primaryKeysResult = await tryConstraintQuery( - { schema: datasetId, kind: 'primary_key', isDeniedError }, - () => this.primaryKeys(datasetId), + { schema: ref.dataset, kind: 'primary_key', isDeniedError }, + () => this.primaryKeys(ref), ); const primaryKeys = primaryKeysResult.ok ? primaryKeysResult.value : new Map>(); if (!primaryKeysResult.ok) { @@ -527,41 +624,51 @@ export class KtxBigQueryScanConnector implements KtxScanConnector { const tables: KtxSchemaTable[] = []; for (const tableRef of filteredTableRefs) { const tableName = tableRef.id || ''; - const [table] = await tableRef.get(); - const fields = table.metadata.schema?.fields ?? []; - tables.push({ - catalog: this.resolved.projectId, - db: datasetId, - name: tableName, - kind: tableKind(table.metadata.type), - comment: table.metadata.description || null, - estimatedRows: firstNumber(table.metadata.numRows) ?? 0, - columns: fields.map((field) => this.toSchemaColumn(tableName, field, primaryKeys)), - foreignKeys: [], - }); + const outcome = await tryIntrospectObject( + { object: tableName, catalog: ref.project, db: ref.dataset }, + async () => { + const [table] = await tableRef.get(); + const fields = table.metadata.schema?.fields ?? []; + return { + catalog: ref.project, + db: ref.dataset, + name: tableName, + kind: tableKind(table.metadata.type), + comment: table.metadata.description || null, + estimatedRows: firstNumber(table.metadata.numRows) ?? 0, + columns: fields.map((field) => this.toSchemaColumn(tableName, field, primaryKeys)), + foreignKeys: [], + }; + }, + ); + if (outcome.ok) { + tables.push(outcome.table); + } else { + snapshotWarnings.push(outcome.warning); + } } return tables; } - private async primaryKeys(datasetId: string): Promise>> { + private async primaryKeys(ref: BigQueryDatasetRef): Promise>> { const rows = await this.queryRaw<{ table_name: string; column_name: string }>( 'SELECT tc.table_name, kcu.column_name ' + 'FROM `' + - this.resolved.projectId + + ref.project + '.' + - datasetId + + ref.dataset + '.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` tc ' + 'JOIN `' + - this.resolved.projectId + + ref.project + '.' + - datasetId + + ref.dataset + '.INFORMATION_SCHEMA.KEY_COLUMN_USAGE` kcu ' + 'ON tc.constraint_name = kcu.constraint_name ' + 'AND tc.table_schema = kcu.table_schema ' + 'AND tc.table_name = kcu.table_name ' + "WHERE tc.constraint_type = 'PRIMARY KEY' " + "AND tc.table_schema = '" + - datasetId + + ref.dataset + "' " + "AND NOT REGEXP_CONTAINS(kcu.column_name, r'^(stacksync_record_id|sync_primary_key)_') " + 'ORDER BY tc.table_name, kcu.ordinal_position', diff --git a/packages/cli/src/connectors/clickhouse/connector.ts b/packages/cli/src/connectors/clickhouse/connector.ts index e08cc732..2f95ea4d 100644 --- a/packages/cli/src/connectors/clickhouse/connector.ts +++ b/packages/cli/src/connectors/clickhouse/connector.ts @@ -1,5 +1,6 @@ import { createClient } from '@clickhouse/client'; import { getSqlDialectForDriver } from '../../context/connections/dialects.js'; +import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js'; import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js'; import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxSchemaColumn, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableRef, type KtxTableSampleInput, type KtxTableListEntry, type KtxTableSampleResult } from '../../context/scan/types.js'; import { scopedTableNames } from '../../context/scan/table-ref.js'; @@ -144,6 +145,21 @@ function maybeNumber(value: unknown): number | undefined { return typeof value === 'number' && Number.isFinite(value) ? value : undefined; } +// ClickHouse error code 159 = TIMEOUT_EXCEEDED, raised when max_execution_time +// is hit. The client surfaces it via a numeric/string `code` or a "Code: 159" +// message prefix depending on transport. +function isClickHouseTimeoutError(error: unknown): boolean { + if (!error || typeof error !== 'object') { + return false; + } + const code = (error as { code?: unknown }).code; + if (code === 159 || code === '159') { + return true; + } + const message = (error as { message?: unknown }).message; + return typeof message === 'string' && (/\bCode:\s*159\b/.test(message) || message.includes('TIMEOUT_EXCEEDED')); +} + function parseClickHouseUrl(url: string): Partial { const parsed = new URL(url); return { @@ -284,6 +300,7 @@ export class KtxClickHouseScanConnector implements KtxScanConnector { private readonly clientFactory: KtxClickHouseClientFactory; private readonly endpointResolver?: KtxClickHouseEndpointResolver; private readonly now: () => Date; + private readonly deadlineMs: number; private readonly dialect = getSqlDialectForDriver('clickhouse'); private client: KtxClickHouseClient | null = null; private resolvedEndpoint: KtxClickHouseResolvedEndpoint | null = null; @@ -299,6 +316,7 @@ export class KtxClickHouseScanConnector implements KtxScanConnector { this.clientFactory = options.clientFactory ?? new DefaultClickHouseClientFactory(); this.endpointResolver = options.endpointResolver; this.now = options.now ?? (() => new Date()); + this.deadlineMs = resolveQueryDeadlineMs(this.connection); this.id = `clickhouse:${options.connectionId}`; } @@ -584,9 +602,13 @@ export class KtxClickHouseScanConnector implements KtxScanConnector { username: config.username, password: config.password ?? '', database: config.database, - request_timeout: 30_000, + // The server aborts at max_execution_time (seconds); request_timeout must + // outlast it so the HTTP client receives the code-159 error instead of + // giving up first and leaving the query running. + request_timeout: this.deadlineMs + 5_000, clickhouse_settings: { output_format_json_quote_64bit_integers: 1, + max_execution_time: Math.ceil(this.deadlineMs / 1000), }, ...(isProxied && config.ssl ? { @@ -613,19 +635,26 @@ export class KtxClickHouseScanConnector implements KtxScanConnector { private async query(sql: string, params?: Record): Promise> { const client = await this.clientForQuery(); - const resultSet = await client.query({ - query: assertReadOnlySql(sql), - format: 'JSONCompact', - ...(params ? { query_params: params } : {}), - }); - const response = (await resultSet.json()) as ClickHouseCompactResponse; - const meta = response.meta ?? []; - return { - headers: meta.map((field) => field.name), - headerTypes: meta.map((field) => field.type), - rows: response.data ?? [], - totalRows: response.rows ?? response.data?.length ?? 0, - }; + try { + const resultSet = await client.query({ + query: assertReadOnlySql(sql), + format: 'JSONCompact', + ...(params ? { query_params: params } : {}), + }); + const response = (await resultSet.json()) as ClickHouseCompactResponse; + const meta = response.meta ?? []; + return { + headers: meta.map((field) => field.name), + headerTypes: meta.map((field) => field.type), + rows: response.data ?? [], + totalRows: response.rows ?? response.data?.length ?? 0, + }; + } catch (error) { + if (isClickHouseTimeoutError(error)) { + throw queryDeadlineExceededError(this.deadlineMs, { cause: error }); + } + throw error; + } } private assertConnection(connectionId: string): void { diff --git a/packages/cli/src/connectors/mysql/connector.ts b/packages/cli/src/connectors/mysql/connector.ts index f3631d5b..62bdbb00 100644 --- a/packages/cli/src/connectors/mysql/connector.ts +++ b/packages/cli/src/connectors/mysql/connector.ts @@ -1,5 +1,6 @@ import mysql, { type FieldPacket, type Pool, type RowDataPacket } from 'mysql2/promise'; import { getSqlDialectForDriver } from '../../context/connections/dialects.js'; +import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js'; import { resolveStringReference } from '../shared/string-reference.js'; import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js'; import { @@ -282,6 +283,11 @@ function isDeniedError(error: unknown): boolean { ); } +// errno 3024 = ER_QUERY_TIMEOUT, raised when max_execution_time is exceeded. +function isMysqlTimeoutError(error: unknown): boolean { + return Boolean(error) && typeof error === 'object' && (error as { errno?: unknown }).errno === 3024; +} + function pushConstraintWarnings( warnings: KtxScanWarning[], schemas: readonly string[], @@ -391,6 +397,7 @@ export class KtxMysqlScanConnector implements KtxScanConnector { private readonly poolFactory: KtxMysqlPoolFactory; private readonly endpointResolver?: KtxMysqlEndpointResolver; private readonly now: () => Date; + private readonly deadlineMs: number; private readonly dialect = getSqlDialectForDriver('mysql'); private pool: KtxMysqlPool | null = null; private resolvedEndpoint: KtxMysqlResolvedEndpoint | null = null; @@ -406,6 +413,7 @@ export class KtxMysqlScanConnector implements KtxScanConnector { this.poolFactory = options.poolFactory ?? new DefaultMysqlPoolFactory(); this.endpointResolver = options.endpointResolver; this.now = options.now ?? (() => new Date()); + this.deadlineMs = resolveQueryDeadlineMs(this.connection); this.id = `mysql:${options.connectionId}`; } @@ -763,6 +771,9 @@ export class KtxMysqlScanConnector implements KtxScanConnector { const pool = await this.poolForQuery(); const connection = await pool.getConnection(); try { + // max_execution_time (ms) bounds read-only SELECTs server-side; our path + // only runs SELECT/WITH, so the session setting always applies. + await connection.query('SET SESSION max_execution_time = ?', [this.deadlineMs]); const [rows, fields] = await connection.query(assertReadOnlySql(sql), queryParams(params)); const headers = fields.map((field) => field.name); const headerTypes = fields.map((field) => String(field.type ?? 'unknown')); @@ -772,6 +783,11 @@ export class KtxMysqlScanConnector implements KtxScanConnector { rows: rows.map((row) => headers.map((header) => row[header])), totalRows: rows.length, }; + } catch (error) { + if (isMysqlTimeoutError(error)) { + throw queryDeadlineExceededError(this.deadlineMs, { cause: error }); + } + throw error; } finally { connection.release(); } diff --git a/packages/cli/src/connectors/postgres/connector.ts b/packages/cli/src/connectors/postgres/connector.ts index f863ed94..c2c0f6db 100644 --- a/packages/cli/src/connectors/postgres/connector.ts +++ b/packages/cli/src/connectors/postgres/connector.ts @@ -1,5 +1,6 @@ import { resolveStringReference } from '../shared/string-reference.js'; import { getSqlDialectForDriver } from '../../context/connections/dialects.js'; +import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js'; import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js'; import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js'; import { scopedTableNames } from '../../context/scan/table-ref.js'; @@ -260,6 +261,11 @@ function isDeniedError(error: unknown): boolean { return code === '42501' || code === '42P01'; } +// 57014 = query_canceled, which is how statement_timeout surfaces. +function isPostgresTimeoutError(error: unknown): boolean { + return Boolean(error) && typeof error === 'object' && (error as { code?: unknown }).code === '57014'; +} + function queryRows(result: KtxPostgresQueryResult): unknown[][] { const headers = (result.fields ?? []).map((field) => field.name); return result.rows.map((row) => headers.map((header) => row[header])); @@ -384,9 +390,13 @@ export function postgresPoolConfigFromConfig(input: { : { host, port: numberValue(merged.port) ?? 5432, database, user, password }), }; const searchPathSchemas = searchPathSchemasFromConnection(merged); + // statement_timeout (ms) bounds every query on connections from this pool, so + // the server itself aborts a runaway query and frees the connection cleanly. + const serverOptions = [`-c statement_timeout=${resolveQueryDeadlineMs(merged)}`]; if (searchPathSchemas.length > 0) { - config.options = `-c search_path=${searchPathSchemas.join(',')}`; + serverOptions.unshift(`-c search_path=${searchPathSchemas.join(',')}`); } + config.options = serverOptions.join(' '); if (merged.ssl && sslmode !== 'prefer' && sslmode !== 'disable') { config.ssl = { rejectUnauthorized: merged.rejectUnauthorized ?? true }; } @@ -412,6 +422,7 @@ export class KtxPostgresScanConnector implements KtxScanConnector { private readonly poolFactory: KtxPostgresPoolFactory; private readonly endpointResolver?: KtxPostgresEndpointResolver; private readonly now: () => Date; + private readonly deadlineMs: number; private readonly dialect = getSqlDialectForDriver('postgres'); private pool: KtxPostgresPool | null = null; private lastIdlePoolError: Error | null = null; @@ -428,6 +439,7 @@ export class KtxPostgresScanConnector implements KtxScanConnector { this.poolFactory = options.poolFactory ?? new DefaultPostgresPoolFactory(); this.endpointResolver = options.endpointResolver; this.now = options.now ?? (() => new Date()); + this.deadlineMs = resolveQueryDeadlineMs(this.connection); this.id = `postgres:${options.connectionId}`; } @@ -819,6 +831,11 @@ export class KtxPostgresScanConnector implements KtxScanConnector { totalRows: result.rows.length, rowCount: result.rows.length, }; + } catch (error) { + if (isPostgresTimeoutError(error)) { + throw queryDeadlineExceededError(this.deadlineMs, { cause: error }); + } + throw error; } finally { client.release(); } diff --git a/packages/cli/src/connectors/snowflake/connector.ts b/packages/cli/src/connectors/snowflake/connector.ts index 5b1c5bfa..fdb8dba3 100644 --- a/packages/cli/src/connectors/snowflake/connector.ts +++ b/packages/cli/src/connectors/snowflake/connector.ts @@ -1,5 +1,6 @@ import { createPrivateKey } from 'node:crypto'; import { getSqlDialectForDriver } from '../../context/connections/dialects.js'; +import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js'; import { resolveStringReference } from '../shared/string-reference.js'; import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js'; import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js'; @@ -60,6 +61,7 @@ export interface KtxSnowflakeResolvedConnectionConfig { passphrase?: string; role?: string; maxConnections: number; + deadlineMs: number; } export interface KtxSnowflakeRawColumnMetadata { @@ -181,6 +183,22 @@ function isDeniedError(error: unknown): boolean { return false; } +// Snowflake cancels with code 604 and a "reached its statement ... timeout" +// message once STATEMENT_TIMEOUT_IN_SECONDS elapses. +function isSnowflakeTimeoutError(error: unknown): boolean { + if (!error || typeof error !== 'object') { + return false; + } + const code = (error as { code?: unknown }).code; + const message = (error as { message?: unknown }).message; + return ( + code === 604 || + code === '604' || + code === '000604' || + (typeof message === 'string' && /reached its (statement|warehouse) .*timeout/i.test(message)) + ); +} + function normalizeSnowflakeValue(value: unknown, columnType?: string): unknown { if (columnType && DATE_TYPES.some((type) => columnType.toUpperCase().includes(type))) { if (typeof value === 'number') { @@ -282,6 +300,7 @@ export function snowflakeConnectionConfigFromConfig(input: { connectionId: input.connectionId, defaultValue: 4, }), + deadlineMs: resolveQueryDeadlineMs(input.connection), }; const role = stringConfigValue(input.connection, 'role', env); if (role) { @@ -339,13 +358,23 @@ class SnowflakeSdkDriver implements KtxSnowflakeDriver { async query(sql: string, params?: unknown): Promise { const binds = Array.isArray(params) ? toSnowflakeBinds(params) : undefined; + const statementTimeoutSeconds = Math.ceil(this.resolved.deadlineMs / 1000); try { const pool = await this.getPool(); - const result = await pool.use(async (connection: snowflake.Connection) => - this.executeSnowflakeQuery(connection, sql, binds), - ); + const result = await pool.use(async (connection: snowflake.Connection) => { + // Bound the statement server-side; Snowflake cancels and frees the + // warehouse slot when STATEMENT_TIMEOUT_IN_SECONDS is reached. + await this.executeSnowflakeQuery( + connection, + `ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = ${statementTimeoutSeconds}`, + ); + return this.executeSnowflakeQuery(connection, sql, binds); + }); return { ...result, totalRows: result.rows.length, rowCount: result.rows.length }; } catch (error) { + if (isSnowflakeTimeoutError(error)) { + throw queryDeadlineExceededError(this.resolved.deadlineMs, { cause: error }); + } const message = error instanceof Error ? error.message : String(error); if (/timeout/i.test(message) && /pool|acquire/i.test(message)) { throw new Error( diff --git a/packages/cli/src/connectors/sqlite/connector.ts b/packages/cli/src/connectors/sqlite/connector.ts index c46bc2dd..41970221 100644 --- a/packages/cli/src/connectors/sqlite/connector.ts +++ b/packages/cli/src/connectors/sqlite/connector.ts @@ -3,19 +3,44 @@ import { existsSync, readFileSync, statSync } from 'node:fs'; import { homedir } from 'node:os'; import { isAbsolute, resolve } from 'node:path'; import { fileURLToPath } from 'node:url'; +import { fork, type ChildProcess } from 'node:child_process'; import { getSqlDialectForDriver } from '../../context/connections/dialects.js'; +import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js'; import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js'; import { normalizeQueryRows } from '../../context/connections/query-executor.js'; -import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxSchemaForeignKey, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableListEntry, type KtxTableRef, type KtxTableSampleInput, type KtxTableSampleResult } from '../../context/scan/types.js'; +import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxScanWarning, type KtxSchemaForeignKey, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableListEntry, type KtxTableRef, type KtxTableSampleInput, type KtxTableSampleResult } from '../../context/scan/types.js'; import { scopedTableNames } from '../../context/scan/table-ref.js'; +import { tryIntrospectObject } from '../../context/scan/object-introspection.js'; export interface KtxSqliteConnectionConfig { driver?: string; path?: string; url?: string; + query_timeout_ms?: number; [key: string]: unknown; } +// In dist, connector.js and read-query-child.js are siblings; under vitest the +// compiled .js is absent and Node strips types from the .ts when forking it. +const readQueryChildUrl = existsSync(fileURLToPath(new URL('./read-query-child.js', import.meta.url))) + ? new URL('./read-query-child.js', import.meta.url) + : new URL('./read-query-child.ts', import.meta.url); + +/** @internal */ +export function forkReadQueryChild(): ChildProcess { + // Empty execArgv so the child is a clean Node process (no inherited vitest / + // inspector flags); advanced serialization preserves BigInt/Buffer in rows. + return fork(readQueryChildUrl, { + execArgv: [], + serialization: 'advanced', + stdio: ['ignore', 'ignore', 'inherit', 'ipc'], + }); +} + +type ReadQueryChildMessage = + | { ok: true; headers: string[]; rows: unknown[]; totalRows: number } + | { ok: false; message: string }; + /** @internal */ export interface SqliteDatabasePathInput { connectionId: string; @@ -25,6 +50,8 @@ export interface SqliteDatabasePathInput { export interface KtxSqliteScanConnectorOptions extends SqliteDatabasePathInput { now?: () => Date; + /** @internal Test seam: spawn the read-query child so tests can observe its lifecycle. */ + spawnReadQueryChild?: () => ChildProcess; } export interface KtxSqliteReadOnlyQueryInput extends KtxReadOnlyQueryInput { @@ -133,6 +160,8 @@ export class KtxSqliteScanConnector implements KtxScanConnector { private readonly connectionId: string; private readonly dbPath: string; private readonly now: () => Date; + private readonly deadlineMs: number; + private readonly spawnReadQueryChild: () => ChildProcess; private readonly dialect = getSqlDialectForDriver('sqlite'); private db: Database.Database | null = null; @@ -140,6 +169,8 @@ export class KtxSqliteScanConnector implements KtxScanConnector { this.connectionId = options.connectionId; this.dbPath = sqliteDatabasePathFromConfig(options); this.now = options.now ?? (() => new Date()); + this.deadlineMs = resolveQueryDeadlineMs(options.connection); + this.spawnReadQueryChild = options.spawnReadQueryChild ?? forkReadQueryChild; this.id = `sqlite:${options.connectionId}`; } @@ -158,17 +189,27 @@ export class KtxSqliteScanConnector implements KtxScanConnector { async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise { this.assertConnection(input.connectionId); const database = this.database(); - const scopedNames = input.tableScope ? scopedTableNames(input.tableScope, { catalog: null, db: null }) : null; - const scopeClause = scopedNames ? `AND name IN (${scopedNames.map(() => '?').join(', ')})` : ''; - const rawTables = - scopedNames && scopedNames.length === 0 - ? [] - : (database - .prepare( - `SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view') AND name NOT LIKE 'sqlite_%' ${scopeClause} ORDER BY name`, - ) - .all(...(scopedNames ?? [])) as SqliteMasterRow[]); - const tables = rawTables.map((table) => this.readTable(database, table)); + const allObjects = database + .prepare( + `SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view') AND name NOT LIKE 'sqlite_%' ORDER BY name`, + ) + .all() as SqliteMasterRow[]; + const scopedNames = input.tableScope + ? new Set(scopedTableNames(input.tableScope, { catalog: null, db: null })) + : null; + const selectedObjects = scopedNames ? allObjects.filter((object) => scopedNames.has(object.name)) : allObjects; + + const tables: KtxSchemaTable[] = []; + const warnings: KtxScanWarning[] = []; + for (const object of selectedObjects) { + const outcome = await tryIntrospectObject({ object: object.name }, () => this.readTable(database, object)); + if (outcome.ok) { + tables.push(outcome.table); + } else { + warnings.push(outcome.warning); + } + } + const fileStats = existsSync(this.dbPath) ? statSync(this.dbPath) : null; return { connectionId: this.connectionId, @@ -180,8 +221,12 @@ export class KtxSqliteScanConnector implements KtxScanConnector { file_size: fileStats ? fileStats.size : 0, table_count: tables.length, total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0), + // Carries the full object inventory so a zero-match enabled_tables scope + // can report which objects were actually available. + ...(scopedNames ? { discovered_object_names: allObjects.map((object) => object.name) } : {}), }, tables, + ...(warnings.length > 0 ? { warnings } : {}), }; } @@ -229,12 +274,81 @@ export class KtxSqliteScanConnector implements KtxScanConnector { return null; } - async executeReadOnly(input: KtxSqliteReadOnlyQueryInput, _ctx: KtxScanContext): Promise { + async executeReadOnly(input: KtxSqliteReadOnlyQueryInput, ctx: KtxScanContext): Promise { this.assertConnection(input.connectionId); - const result = this.query(limitSqlForExecution(input.sql, input.maxRows), input.params); + // Validate and row-limit on the main thread so invalid SQL fails instantly + // without spawning a process and read-only enforcement stays at the boundary. + const sql = limitSqlForExecution(input.sql, input.maxRows); + const result = await this.runReadQueryOffProcess(sql, input.params, ctx.signal); return { ...result, rowCount: result.rows.length }; } + // The LLM-SQL path runs off the event loop in a short-lived child process so a + // pathological scan cannot freeze the MCP server, and the deadline is enforced + // by SIGKILL-ing that process. A synchronous better-sqlite3 scan never yields, + // so a worker-thread terminate cannot interrupt it — only the OS reclaiming the + // whole process frees the CPU. One short-lived process per call; killed on + // completion, deadline, or external abort. + private runReadQueryOffProcess( + sql: string, + params: Record | unknown[] | undefined, + signal: AbortSignal | undefined, + ): Promise> { + const deadlineMs = this.deadlineMs; + const dbPath = this.dbPath; + return new Promise((resolvePromise, rejectPromise) => { + const child = this.spawnReadQueryChild(); + let settled = false; + const onDeadline = () => settle(() => rejectPromise(queryDeadlineExceededError(deadlineMs))); + const timer = setTimeout(onDeadline, deadlineMs); + function settle(finish: () => void): void { + if (settled) { + return; + } + settled = true; + clearTimeout(timer); + signal?.removeEventListener('abort', onDeadline); + if (child.exitCode === null && child.signalCode === null) { + child.kill('SIGKILL'); + } + finish(); + } + child.on('message', (message: ReadQueryChildMessage) => { + if (message.ok) { + settle(() => + resolvePromise({ + headers: message.headers, + rows: normalizeQueryRows(message.rows), + totalRows: message.totalRows, + }), + ); + } else { + settle(() => rejectPromise(new Error(message.message))); + } + }); + child.on('error', (error) => settle(() => rejectPromise(error))); + child.on('exit', (code, processSignal) => { + if (!settled) { + settle(() => + rejectPromise( + new Error(`SQLite read process exited before returning a result (code ${code}, signal ${processSignal}).`), + ), + ); + } + }); + if (signal?.aborted) { + onDeadline(); + return; + } + signal?.addEventListener('abort', onDeadline, { once: true }); + try { + child.send({ dbPath, sql, params }); + } catch (error) { + settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error)))); + } + }); + } + async getColumnDistinctValues( table: KtxTableRef, columnName: string, @@ -310,16 +424,7 @@ export class KtxSqliteScanConnector implements KtxScanConnector { const foreignKeys = database .prepare(`PRAGMA foreign_key_list(${this.dialect.quoteIdentifier(table.name)})`) .all() as SqliteForeignKeyRow[]; - const estimatedRows = - table.type === 'table' - ? Number( - ( - database - .prepare(`SELECT COUNT(*) AS count FROM ${this.dialect.quoteIdentifier(table.name)}`) - .get() as { count: unknown } - ).count, - ) - : null; + const estimatedRows = table.type === 'table' ? this.readRowCount(database, table.name) : null; return { catalog: null, db: null, @@ -340,6 +445,19 @@ export class KtxSqliteScanConnector implements KtxScanConnector { }; } + // A row-count read is profiling, not structure: a failure here leaves the + // object's structure intact rather than skipping the whole object. + private readRowCount(database: Database.Database, name: string): number | null { + try { + const row = database.prepare(`SELECT COUNT(*) AS count FROM ${this.dialect.quoteIdentifier(name)}`).get() as { + count: unknown; + }; + return Number(row.count); + } catch { + return null; + } + } + private mapForeignKeys(rows: SqliteForeignKeyRow[]): KtxSchemaForeignKey[] { return rows .sort((a, b) => a.id - b.id || a.seq - b.seq) diff --git a/packages/cli/src/connectors/sqlite/read-query-child.ts b/packages/cli/src/connectors/sqlite/read-query-child.ts new file mode 100644 index 00000000..ae876e38 --- /dev/null +++ b/packages/cli/src/connectors/sqlite/read-query-child.ts @@ -0,0 +1,40 @@ +import Database from 'better-sqlite3'; + +// Runs on a forked child process (no bundler, no test transform), so it imports +// only better-sqlite3 and node builtins. The SQL is already read-only-validated +// and row-limited by the parent; this process just executes it and posts the +// structured-cloneable raw rows back over IPC. Its only cancellation mechanism +// is the parent sending SIGKILL: a synchronous better-sqlite3 scan never yields, +// so neither a worker-thread terminate nor any in-process timer can interrupt +// it — only the OS reclaiming the whole process can. + +interface ReadQueryRequest { + dbPath: string; + sql: string; + params?: Record | unknown[]; +} + +type ReadQueryResponse = + | { ok: true; headers: string[]; rows: unknown[]; totalRows: number } + | { ok: false; message: string }; + +process.once('message', (request: ReadQueryRequest) => { + let db: Database.Database | undefined; + let response: ReadQueryResponse; + try { + db = new Database(request.dbPath, { readonly: true, fileMustExist: true }); + const statement = db.prepare(request.sql); + const rows = (request.params ? statement.all(request.params) : statement.all()) as unknown[]; + response = { + ok: true, + headers: statement.columns().map((column) => column.name), + rows, + totalRows: rows.length, + }; + } catch (error) { + response = { ok: false, message: error instanceof Error ? error.message : String(error) }; + } finally { + db?.close(); + } + process.send?.(response, () => process.exit(0)); +}); diff --git a/packages/cli/src/connectors/sqlserver/connector.ts b/packages/cli/src/connectors/sqlserver/connector.ts index 9f101578..7976ecb0 100644 --- a/packages/cli/src/connectors/sqlserver/connector.ts +++ b/packages/cli/src/connectors/sqlserver/connector.ts @@ -1,5 +1,6 @@ import { assertReadOnlySql, hoistLeadingCte, stripTrailingSqlNoise } from '../../context/connections/read-only-sql.js'; import { getSqlDialectForDriver } from '../../context/connections/dialects.js'; +import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js'; import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js'; import { scopedTableNames } from '../../context/scan/table-ref.js'; import { @@ -50,6 +51,8 @@ export interface KtxSqlServerPoolConfig { database: string; user: string; password?: string; + // ms; on expiry mssql sends a TDS attention that cancels the query server-side. + requestTimeout: number; options: { encrypt: true; trustServerCertificate: boolean }; pool: { max: number; min: number; idleTimeoutMillis: number }; } @@ -269,6 +272,11 @@ function isDeniedError(error: unknown): boolean { return number === 229 || number === 230 || number === 297; } +// mssql raises a RequestError with code 'ETIMEOUT' once requestTimeout elapses. +function isSqlServerTimeoutError(error: unknown): boolean { + return Boolean(error) && typeof error === 'object' && (error as { code?: unknown }).code === 'ETIMEOUT'; +} + function limitSqlForSqlServerExecution(sqlText: string, maxRows: number | undefined): string { const trimmed = stripTrailingSqlNoise(assertReadOnlySql(sqlText)); if (!maxRows) { @@ -328,6 +336,7 @@ export function sqlServerConnectionPoolConfigFromConfig(input: { database, user, password: stringConfigValue(merged, 'password', env), + requestTimeout: resolveQueryDeadlineMs(merged), options: { encrypt: true, trustServerCertificate: merged.trustServerCertificate ?? true }, pool: { max: maxConnections, min: 0, idleTimeoutMillis: 30000 }, }; @@ -353,6 +362,7 @@ export class KtxSqlServerScanConnector implements KtxScanConnector { private readonly poolFactory: KtxSqlServerPoolFactory; private readonly endpointResolver?: KtxSqlServerEndpointResolver; private readonly now: () => Date; + private readonly deadlineMs: number; private readonly dialect = getSqlDialectForDriver('sqlserver'); private pool: KtxSqlServerPool | null = null; private resolvedEndpoint: KtxSqlServerResolvedEndpoint | null = null; @@ -370,6 +380,7 @@ export class KtxSqlServerScanConnector implements KtxScanConnector { this.poolFactory = options.poolFactory ?? new DefaultSqlServerPoolFactory(); this.endpointResolver = options.endpointResolver; this.now = options.now ?? (() => new Date()); + this.deadlineMs = resolveQueryDeadlineMs(this.connection); this.id = `sqlserver:${options.connectionId}`; } @@ -804,7 +815,15 @@ export class KtxSqlServerScanConnector implements KtxScanConnector { request.input(key, value); } } - const result = await request.query(assertReadOnlySql(query)); + let result: KtxSqlServerQueryResult; + try { + result = await request.query(assertReadOnlySql(query)); + } catch (error) { + if (isSqlServerTimeoutError(error)) { + throw queryDeadlineExceededError(this.deadlineMs, { cause: error }); + } + throw error; + } const recordset = result.recordset ?? []; const columnMetadata = recordset.columns ?? {}; const metadataHeaders = Object.keys(columnMetadata); diff --git a/packages/cli/src/context-build-view.ts b/packages/cli/src/context-build-view.ts index 042a517a..a8def9c2 100644 --- a/packages/cli/src/context-build-view.ts +++ b/packages/cli/src/context-build-view.ts @@ -98,6 +98,7 @@ export interface ContextBuildArgs { queryHistory?: Extract['queryHistory']; queryHistoryWindowDays?: number; scanMode?: Extract['scanMode']; + stages?: Extract['stages']; detectRelationships?: boolean; cliVersion?: string; runtimeInstallPolicy?: KtxManagedPythonInstallPolicy; @@ -990,6 +991,7 @@ export async function runContextBuild( ...(args.queryHistory ? { queryHistory: args.queryHistory } : {}), ...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}), ...(args.scanMode ? { scanMode: args.scanMode } : {}), + ...(args.stages ? { stages: args.stages } : {}), ...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}), ...(args.cliVersion ? { cliVersion: args.cliVersion } : {}), ...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}), diff --git a/packages/cli/src/context/connections/bigquery-identifiers.ts b/packages/cli/src/context/connections/bigquery-identifiers.ts index f2aa29f9..0abc904a 100644 --- a/packages/cli/src/context/connections/bigquery-identifiers.ts +++ b/packages/cli/src/context/connections/bigquery-identifiers.ts @@ -1,4 +1,5 @@ const BIGQUERY_PROJECT_ID_PATTERN = /^[A-Za-z0-9_-]+$/; +const BIGQUERY_DATASET_ID_PATTERN = /^[A-Za-z0-9_]+$/; const BIGQUERY_REGION_PATTERN = /^[a-z0-9-]+$/; export function normalizeBigQueryProjectId(value: string, context = 'historic-SQL ingest'): string { @@ -8,6 +9,13 @@ export function normalizeBigQueryProjectId(value: string, context = 'historic-SQ return value; } +export function normalizeBigQueryDatasetId(value: string, context = 'historic-SQL ingest'): string { + if (!BIGQUERY_DATASET_ID_PATTERN.test(value)) { + throw new Error(`Invalid BigQuery dataset id for ${context}: ${value}`); + } + return value; +} + export function normalizeBigQueryRegion(value: string, context = 'historic-SQL ingest'): string { const normalized = value.trim().toLowerCase().replace(/^region-/, ''); if (!BIGQUERY_REGION_PATTERN.test(normalized)) { diff --git a/packages/cli/src/context/connections/configured-connections.ts b/packages/cli/src/context/connections/configured-connections.ts new file mode 100644 index 00000000..96d6087d --- /dev/null +++ b/packages/cli/src/context/connections/configured-connections.ts @@ -0,0 +1,24 @@ +import type { KtxProjectConnectionConfig } from '../project/config.js'; + +function listConfiguredConnectionIds(connections: Record): string[] { + return Object.keys(connections).sort(); +} + +/** + * Validate a connection id supplied as an explicit command/tool argument against + * the canonical `ktx.yaml` connections map. Returns the id when configured; + * otherwise throws an error that lists the configured ids so the caller can fix + * the typo. Use for explicit arguments only — persisted page frontmatter that + * references a since-removed connection must warn, not fail. + */ +export function assertConfiguredConnectionId( + connections: Record, + connectionId: string, +): string { + if (Object.hasOwn(connections, connectionId)) { + return connectionId; + } + const ids = listConfiguredConnectionIds(connections); + const configured = ids.length > 0 ? ids.join(', ') : '(none configured)'; + throw new Error(`Unknown connection "${connectionId}". Configured connections: ${configured}.`); +} diff --git a/packages/cli/src/context/connections/query-deadline.ts b/packages/cli/src/context/connections/query-deadline.ts new file mode 100644 index 00000000..610fba53 --- /dev/null +++ b/packages/cli/src/context/connections/query-deadline.ts @@ -0,0 +1,45 @@ +import { KtxQueryError } from '../../errors.js'; + +/** + * Canonical default bound on read-query execution time. Generous headroom over + * any indexed aggregate or normal profiling probe; a pathological nested-loop + * scan blows past it immediately. Overridable per-connection via + * `query_timeout_ms`. Production reads it through {@link resolveQueryDeadlineMs}; + * exported for the resolver's own unit tests. + * @internal + */ +export const DEFAULT_QUERY_TIMEOUT_MS = 30_000; + +interface QueryTimeoutConnectionConfig { + query_timeout_ms?: unknown; + [key: string]: unknown; +} + +/** + * Single source of truth for the read-query deadline: the per-connection + * `query_timeout_ms` override (milliseconds) when present, else the default. + * Every connector resolves through here so the default and override precedence + * live in exactly one place. A malformed override (zero, negative, non-integer, + * non-number) is a config error — surfaced here even though `ktx.yaml` + * validation also rejects it, so programmatically-built connectors cannot + * silently run unbounded. + */ +export function resolveQueryDeadlineMs(connection: QueryTimeoutConnectionConfig | undefined): number { + const raw = connection?.query_timeout_ms; + if (raw === undefined || raw === null) { + return DEFAULT_QUERY_TIMEOUT_MS; + } + if (typeof raw !== 'number' || !Number.isInteger(raw) || raw <= 0) { + throw new Error(`query_timeout_ms must be a positive integer in milliseconds, received ${JSON.stringify(raw)}.`); + } + return raw; +} + +/** + * The canonical, driver-independent timeout error an agent sees regardless of + * which connector enforced the deadline. Reads in whole seconds. Remote + * connectors pass the driver's own timeout error as `cause`. + */ +export function queryDeadlineExceededError(deadlineMs: number, options?: ErrorOptions): KtxQueryError { + return new KtxQueryError(`query exceeded ${Math.round(deadlineMs / 1000)}s`, options); +} diff --git a/packages/cli/src/context/ingest/adapters/live-database/daemon-introspection.ts b/packages/cli/src/context/ingest/adapters/live-database/daemon-introspection.ts index 03e5953d..1d52a664 100644 --- a/packages/cli/src/context/ingest/adapters/live-database/daemon-introspection.ts +++ b/packages/cli/src/context/ingest/adapters/live-database/daemon-introspection.ts @@ -3,8 +3,9 @@ import { request as httpRequest } from 'node:http'; import { request as httpsRequest } from 'node:https'; import { URL } from 'node:url'; import type { KtxProjectConnectionConfig } from '../../../project/config.js'; +import { isKtxScanWarningCode } from '../../../scan/local-structural-artifacts.js'; import { tableRefFromKey } from '../../../scan/table-ref.js'; -import type { KtxSchemaColumn, KtxSchemaForeignKey, KtxSchemaSnapshot, KtxSchemaTable } from '../../../scan/types.js'; +import type { KtxScanWarning, KtxSchemaColumn, KtxSchemaForeignKey, KtxSchemaSnapshot, KtxSchemaTable } from '../../../scan/types.js'; import { inferKtxDimensionType, normalizeKtxNativeType } from '../../../scan/type-normalization.js'; import type { LiveDatabaseIntrospectionOptions, LiveDatabaseIntrospectionPort } from './types.js'; @@ -206,10 +207,32 @@ function mapTable(raw: Record): KtxSchemaTable { }; } +function mapWarning(raw: Record): KtxScanWarning | null { + const code = optionalString(raw.code); + // Drop codes Node cannot render, keeping the daemon and Node warning catalogs + // in parity rather than surfacing an unknown code downstream. + if (!code || !isKtxScanWarningCode(code)) return null; + const table = optionalString(raw.table); + const column = optionalString(raw.column); + return { + code, + message: requiredString(raw.message, 'warnings[].message'), + recoverable: raw.recoverable !== false, + ...(table ? { table } : {}), + ...(column ? { column } : {}), + ...(raw.metadata && typeof raw.metadata === 'object' && !Array.isArray(raw.metadata) + ? { metadata: recordValue(raw.metadata) } + : {}), + }; +} + function mapDaemonSnapshot( raw: Record, input: { connectionId: string; extractedAt: string; schemas: string[] }, ): KtxSchemaSnapshot { + const warnings = recordArray(raw.warnings) + .map(mapWarning) + .filter((warning): warning is KtxScanWarning => warning !== null); return { connectionId: requiredString(raw.connection_id, 'connection_id') || input.connectionId, driver: 'postgres', @@ -217,6 +240,7 @@ function mapDaemonSnapshot( scope: { schemas: input.schemas }, metadata: recordValue(raw.metadata), tables: recordArray(raw.tables).map(mapTable), + ...(warnings.length > 0 ? { warnings } : {}), }; } diff --git a/packages/cli/src/context/ingest/adapters/live-database/fetch-report.ts b/packages/cli/src/context/ingest/adapters/live-database/fetch-report.ts new file mode 100644 index 00000000..88c9b28b --- /dev/null +++ b/packages/cli/src/context/ingest/adapters/live-database/fetch-report.ts @@ -0,0 +1,48 @@ +import { readFile } from 'node:fs/promises'; +import { join } from 'node:path'; +import type { SourceFetchReport } from '../../types.js'; +import { LIVE_DATABASE_WARNINGS_FILE } from './stage.js'; + +const OBJECT_SKIP_CODE = 'object_introspection_failed'; + +interface RawWarning { + code?: unknown; + message?: unknown; + table?: unknown; +} + +/** + * Derives the fetch report from the staged `warnings.json`: objects that failed + * introspection become `skipped` entries so the run report, ingest summary, and + * `ktx status` can surface them. Returns null when nothing was skipped, keeping + * clean ingests free of an empty report. + */ +export async function readLiveDatabaseFetchReport(stagedDir: string): Promise { + let parsed: unknown; + try { + parsed = JSON.parse(await readFile(join(stagedDir, LIVE_DATABASE_WARNINGS_FILE), 'utf8')); + } catch { + return null; + } + const warnings = + parsed && typeof parsed === 'object' && Array.isArray((parsed as { warnings?: unknown }).warnings) + ? ((parsed as { warnings: RawWarning[] }).warnings) + : []; + + const skipped = warnings + .filter((warning) => warning.code === OBJECT_SKIP_CODE) + .map((warning) => ({ + rawPath: '', + entityType: 'database_object', + entityId: typeof warning.table === 'string' ? warning.table : null, + severity: 'warning' as const, + statusCode: null, + message: typeof warning.message === 'string' ? warning.message : 'introspection failed', + retryRecommended: false, + })); + + if (skipped.length === 0) { + return null; + } + return { status: 'partial', retryRecommended: false, skipped, warnings: [] }; +} diff --git a/packages/cli/src/context/ingest/adapters/live-database/live-database.adapter.ts b/packages/cli/src/context/ingest/adapters/live-database/live-database.adapter.ts index 68087bc0..89aa81bf 100644 --- a/packages/cli/src/context/ingest/adapters/live-database/live-database.adapter.ts +++ b/packages/cli/src/context/ingest/adapters/live-database/live-database.adapter.ts @@ -1,5 +1,7 @@ -import type { ChunkResult, DiffSet, FetchContext, SourceAdapter } from '../../types.js'; +import type { ChunkResult, DiffSet, FetchContext, SourceAdapter, SourceFetchReport } from '../../types.js'; import { chunkLiveDatabaseStagedDir } from './chunk.js'; +import { readLiveDatabaseFetchReport } from './fetch-report.js'; +import { assertLiveDatabaseScanOutcome } from './scan-outcome.js'; import { detectLiveDatabaseStagedDir, writeLiveDatabaseSnapshot } from './stage.js'; import type { LiveDatabaseSourceAdapterDeps } from './types.js'; @@ -13,14 +15,20 @@ export class LiveDatabaseSourceAdapter implements SourceAdapter { return detectLiveDatabaseStagedDir(stagedDir); } + readFetchReport(stagedDir: string): Promise { + return readLiveDatabaseFetchReport(stagedDir); + } + async fetch(_pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise { const tableScope = ctx.tableScope; const snapshot = await this.deps.introspection.extractSchema(ctx.connectionId, { tableScope }); - await writeLiveDatabaseSnapshot(stagedDir, { + const finalized = { ...snapshot, connectionId: ctx.connectionId, extractedAt: snapshot.extractedAt ?? (this.deps.now ?? (() => new Date()))().toISOString(), - }); + }; + assertLiveDatabaseScanOutcome({ connectionId: ctx.connectionId, scope: tableScope, snapshot: finalized }); + await writeLiveDatabaseSnapshot(stagedDir, finalized); } chunk(stagedDir: string, diffSet?: DiffSet): Promise { diff --git a/packages/cli/src/context/ingest/adapters/live-database/manifest.ts b/packages/cli/src/context/ingest/adapters/live-database/manifest.ts index 2e864528..c5ae33fb 100644 --- a/packages/cli/src/context/ingest/adapters/live-database/manifest.ts +++ b/packages/cli/src/context/ingest/adapters/live-database/manifest.ts @@ -162,7 +162,8 @@ function getShardKey(connectionType: string, catalog: string | null, db: string } } -function buildTableRef(name: string, catalog: string | null, db: string | null): string { +/** @internal */ +export function buildTableRef(name: string, catalog: string | null, db: string | null): string { const parts: string[] = []; if (catalog) { parts.push(catalog); @@ -273,7 +274,10 @@ export function buildLiveDatabaseManifestShards( for (const table of input.tables) { const shardKey = getShardKey(input.connectionType, table.catalog, table.db); const shard = shards.get(shardKey) ?? { tables: {} }; - const existingDescriptions = input.existingDescriptions?.get(table.name); + // Existing descriptions/usage are keyed by the fully-qualified ref so two + // same-named tables in different schemas never share an entry. + const fullRef = buildTableRef(table.name, table.catalog, table.db); + const existingDescriptions = input.existingDescriptions?.get(fullRef); const columns: LiveDatabaseManifestColumn[] = table.columns.map((column) => { const manifestColumn: LiveDatabaseManifestColumn = { @@ -297,7 +301,7 @@ export function buildLiveDatabaseManifestShards( }); const entry: LiveDatabaseManifestTableEntry = { - table: buildTableRef(table.name, table.catalog, table.db), + table: fullRef, columns, }; @@ -306,7 +310,7 @@ export function buildLiveDatabaseManifestShards( entry.descriptions = tableDescriptions; } - const usage = mergeUsagePreservingExternal(input.existingUsage?.get(table.name), table.usage); + const usage = mergeUsagePreservingExternal(input.existingUsage?.get(fullRef), table.usage); if (usage) { entry.usage = usage; } diff --git a/packages/cli/src/context/ingest/adapters/live-database/scan-outcome.ts b/packages/cli/src/context/ingest/adapters/live-database/scan-outcome.ts new file mode 100644 index 00000000..33e488f4 --- /dev/null +++ b/packages/cli/src/context/ingest/adapters/live-database/scan-outcome.ts @@ -0,0 +1,55 @@ +import { KtxExpectedError } from '../../../../errors.js'; +import { tableRefFromKey, type KtxTableRefKey } from '../../../scan/table-ref.js'; +import type { KtxSchemaSnapshot } from '../../../scan/types.js'; + +const OBJECT_SKIP_CODE = 'object_introspection_failed'; + +function formatScopeEntry(key: KtxTableRefKey): string { + const ref = tableRefFromKey(key); + return [ref.catalog, ref.db, ref.name].filter((part): part is string => Boolean(part)).join('.'); +} + +function discoveredObjectNames(snapshot: KtxSchemaSnapshot): string[] { + const raw = (snapshot.metadata as Record).discovered_object_names; + return Array.isArray(raw) ? raw.filter((value): value is string => typeof value === 'string') : []; +} + +/** + * Enforces the partial-vs-total outcome rules for a live-database snapshot, + * uniformly for every connector. Outcomes follow from object counts, not a + * mode: a connection with at least one ingested object succeeds (any broken + * objects ride along as warnings); a connection where every introspected object + * failed, or a non-empty enabled_tables scope that matched nothing, raises a + * clear connection error instead of staging an empty layer that would later + * surface as the generic "did not recognize" message. A legitimately empty + * database (no scope, no objects) succeeds with an empty layer. + */ +export function assertLiveDatabaseScanOutcome(input: { + connectionId: string; + scope: ReadonlySet | undefined; + snapshot: KtxSchemaSnapshot; +}): void { + const { connectionId, scope, snapshot } = input; + if (snapshot.tables.length > 0) { + return; + } + + const skipped = (snapshot.warnings ?? []).filter((warning) => warning.code === OBJECT_SKIP_CODE); + if (skipped.length > 0) { + const detail = skipped.map((warning) => `${warning.table ?? 'object'}: ${warning.message}`).join('; '); + throw new KtxExpectedError( + `Connection "${connectionId}" produced no semantic layer: all ${skipped.length} introspected ` + + `${skipped.length === 1 ? 'object' : 'objects'} failed (${detail}).`, + ); + } + + if (scope && scope.size > 0) { + const requested = [...scope].map(formatScopeEntry).sort(); + const available = discoveredObjectNames(snapshot); + const availableClause = available.length > 0 ? ` Available objects: ${available.join(', ')}.` : ''; + throw new KtxExpectedError( + `enabled_tables for connection "${connectionId}" matched no objects ` + + `(looked for: ${requested.join(', ')}).${availableClause}`, + ); + } +} diff --git a/packages/cli/src/context/ingest/adapters/live-database/stage.ts b/packages/cli/src/context/ingest/adapters/live-database/stage.ts index 5dd21afd..c970cdc5 100644 --- a/packages/cli/src/context/ingest/adapters/live-database/stage.ts +++ b/packages/cli/src/context/ingest/adapters/live-database/stage.ts @@ -136,13 +136,13 @@ export async function readLiveDatabaseTableFiles(stagedDir: string): Promise { + // A valid live-database staging is identified by its connection.json marker. + // An empty table set is a legitimate outcome (an empty database), so the + // presence of table files is not required — the total-vs-partial decision is + // made earlier by assertLiveDatabaseScanOutcome, before staging. try { const meta = JSON.parse(await readFile(join(stagedDir, LIVE_DATABASE_META_FILE), 'utf8')) as unknown; - if (!meta || typeof meta !== 'object' || Array.isArray(meta)) { - return false; - } - const files = await readLiveDatabaseTableFiles(stagedDir); - return files.length > 0; + return Boolean(meta) && typeof meta === 'object' && !Array.isArray(meta); } catch { return false; } diff --git a/packages/cli/src/context/ingest/adapters/metabase/types.ts b/packages/cli/src/context/ingest/adapters/metabase/types.ts index da84de0c..eea85654 100644 --- a/packages/cli/src/context/ingest/adapters/metabase/types.ts +++ b/packages/cli/src/context/ingest/adapters/metabase/types.ts @@ -3,7 +3,7 @@ import { z } from 'zod'; const metabaseSyncModeSchema = z.enum(['ALL', 'ONLY', 'EXCEPT']); export type MetabaseSyncMode = z.infer; -const metabaseLocalConnectionIdSchema = z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/); +const metabaseLocalConnectionIdSchema = z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/); /** * The lean config the adapter needs at `fetch()` time. Lives in the BullMQ payload's diff --git a/packages/cli/src/context/ingest/ingest-bundle.runner.ts b/packages/cli/src/context/ingest/ingest-bundle.runner.ts index 45953adf..e054fce8 100644 --- a/packages/cli/src/context/ingest/ingest-bundle.runner.ts +++ b/packages/cli/src/context/ingest/ingest-bundle.runner.ts @@ -1081,6 +1081,7 @@ export class IngestBundleRunner { skillsPrompt: input.skillsPrompt, syncId: input.syncId, sourceKey: input.job.sourceKey, + connectionId: input.job.connectionId, canonicalPins: input.canonicalPins, }); diff --git a/packages/cli/src/context/ingest/local-bundle-runtime.ts b/packages/cli/src/context/ingest/local-bundle-runtime.ts index 46847646..69f0baa5 100644 --- a/packages/cli/src/context/ingest/local-bundle-runtime.ts +++ b/packages/cli/src/context/ingest/local-bundle-runtime.ts @@ -478,11 +478,11 @@ function parseKnowledgeIndexPath(file: string): { scope: 'GLOBAL' | 'USER'; page const segments = file.split('/'); if (segments.length === 2 && segments[0] === 'global') { const pageKey = segments[1].replace(/\.md$/, ''); - return /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'GLOBAL', pageKey } : null; + return /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'GLOBAL', pageKey } : null; } if (segments.length === 3 && segments[0] === 'user') { const pageKey = segments[2].replace(/\.md$/, ''); - return /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'USER', pageKey } : null; + return /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'USER', pageKey } : null; } return null; } diff --git a/packages/cli/src/context/ingest/local-ingest.ts b/packages/cli/src/context/ingest/local-ingest.ts index 5f7f8c5a..2cf76353 100644 --- a/packages/cli/src/context/ingest/local-ingest.ts +++ b/packages/cli/src/context/ingest/local-ingest.ts @@ -104,7 +104,7 @@ class LocalIngestPhase implements IngestJobPhase { } function safeSegment(kind: string, value: string): string { - if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) { + if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) { throw new Error(`Unsafe ${kind}: ${value}`); } return value; diff --git a/packages/cli/src/context/ingest/local-stage-ingest.ts b/packages/cli/src/context/ingest/local-stage-ingest.ts index f10a4a78..e6a24786 100644 --- a/packages/cli/src/context/ingest/local-stage-ingest.ts +++ b/packages/cli/src/context/ingest/local-stage-ingest.ts @@ -10,7 +10,7 @@ import type { MemoryFlowEventSink, MemoryFlowPlannedWorkUnit } from './memory-fl import { buildSyncId } from './raw-sources-paths.js'; import { SqliteLocalIngestStore } from './sqlite-local-ingest-store.js'; import type { KtxTableRefKey } from '../scan/table-ref.js'; -import type { IngestTrigger, SourceAdapter, WorkUnit } from './types.js'; +import type { IngestTrigger, SourceAdapter, SourceFetchReport, WorkUnit } from './types.js'; type LocalIngestStatus = 'running' | 'done' | 'error'; @@ -46,6 +46,8 @@ export interface LocalIngestRunRecord { workUnits: Array>; evictionDeletedRawPaths: string[]; errors: string[]; + /** Fetch-phase outcome (e.g. objects skipped during introspection). */ + fetch?: SourceFetchReport; } export type LocalIngestReport = LocalIngestRunRecord & { @@ -70,7 +72,7 @@ const LOCAL_AUTHOR = 'ktx'; const LOCAL_AUTHOR_EMAIL = 'ktx@example.com'; function safeSegment(kind: string, value: string): string { - if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) { + if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) { throw new Error(`Unsafe ${kind}: ${value}`); } return value; @@ -291,6 +293,8 @@ async function runLocalStageOnlyIngestInner(options: RunLocalStageOnlyIngestOpti throw new Error(`Adapter "${adapter.source}" did not recognize ${sourceDir ?? 'fetched source output'}`); } + const fetchReport = adapter.readFetchReport ? await adapter.readFetchReport(stagedDir) : null; + const relativeFiles = await walkFiles(stagedDir); options.memoryFlow?.update({ sourceDir }); options.memoryFlow?.emit({ @@ -405,6 +409,7 @@ async function runLocalStageOnlyIngestInner(options: RunLocalStageOnlyIngestOpti })), evictionDeletedRawPaths: chunkResult.eviction?.deletedRawPaths ?? [], errors: [], + ...(fetchReport ? { fetch: fetchReport } : {}), }; if (!options.dryRun) { diff --git a/packages/cli/src/context/ingest/stages/build-wu-context.ts b/packages/cli/src/context/ingest/stages/build-wu-context.ts index f8fb4af4..81c21b63 100644 --- a/packages/cli/src/context/ingest/stages/build-wu-context.ts +++ b/packages/cli/src/context/ingest/stages/build-wu-context.ts @@ -26,14 +26,16 @@ export function buildWuSystemPrompt(params: { skillsPrompt: string; syncId: string; sourceKey: string; + connectionId?: string; canonicalPins?: CanonicalPin[]; }): string { + const connectionLine = params.connectionId ? `\nconnectionId: ${params.connectionId}` : ''; const parts = [ params.baseFraming.trimEnd(), VERIFICATION_LEDGER_PROMPT, params.skillsPrompt.trimEnd(), buildCanonicalPinsPromptBlock(params.canonicalPins ?? []), - `\n\nsyncId: ${params.syncId}\nsource: ${params.sourceKey}\n`, + `\n\nsyncId: ${params.syncId}\nsource: ${params.sourceKey}${connectionLine}\n`, ]; return parts.filter(Boolean).join('\n'); } diff --git a/packages/cli/src/context/ingest/tools/warehouse-verification/discover-data.tool.ts b/packages/cli/src/context/ingest/tools/warehouse-verification/discover-data.tool.ts index 6c3380bc..7c671f19 100644 --- a/packages/cli/src/context/ingest/tools/warehouse-verification/discover-data.tool.ts +++ b/packages/cli/src/context/ingest/tools/warehouse-verification/discover-data.tool.ts @@ -4,7 +4,7 @@ import { BaseTool, type ToolContext, type ToolOutput } from '../../../../context const discoverDataInputSchema = z.object({ query: z.string().optional(), - connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/).optional(), + connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/).optional(), limit: z.number().int().positive().max(50).optional().default(10), sourceName: z.string().optional(), }).strict(); diff --git a/packages/cli/src/context/ingest/tools/warehouse-verification/entity-details.tool.ts b/packages/cli/src/context/ingest/tools/warehouse-verification/entity-details.tool.ts index 45ecba2b..60037fcd 100644 --- a/packages/cli/src/context/ingest/tools/warehouse-verification/entity-details.tool.ts +++ b/packages/cli/src/context/ingest/tools/warehouse-verification/entity-details.tool.ts @@ -14,7 +14,7 @@ const targetSchema = z.union([ ]); const entityDetailsInputSchema = z.object({ - connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/), + connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/), targets: z.array(targetSchema).min(1).max(50), }).strict(); diff --git a/packages/cli/src/context/ingest/tools/warehouse-verification/sql-execution.tool.ts b/packages/cli/src/context/ingest/tools/warehouse-verification/sql-execution.tool.ts index 9122d1e6..96948406 100644 --- a/packages/cli/src/context/ingest/tools/warehouse-verification/sql-execution.tool.ts +++ b/packages/cli/src/context/ingest/tools/warehouse-verification/sql-execution.tool.ts @@ -6,7 +6,7 @@ import type { SqlAnalysisPort } from '../../../../context/sql-analysis/ports.js' import { BaseTool, type ToolContext, type ToolOutput } from '../../../../context/tools/base-tool.js'; const sqlExecutionInputSchema = z.object({ - connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/), + connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/), sql: z.string().min(1), rowLimit: z.number().int().positive().max(1000).optional().default(100), }).strict(); diff --git a/packages/cli/src/context/llm/ai-sdk-runtime.ts b/packages/cli/src/context/llm/ai-sdk-runtime.ts index a6776f49..7787279d 100644 --- a/packages/cli/src/context/llm/ai-sdk-runtime.ts +++ b/packages/cli/src/context/llm/ai-sdk-runtime.ts @@ -172,6 +172,12 @@ export class AiSdkKtxLlmRuntime implements KtxLlmRuntimePort { this.logger = deps.logger ?? noopLogger; } + // HTTP backend: abortSignal cancels the underlying fetch natively, so there is + // no SDK-owned child to tree-kill. + subprocessForkSpec(): null { + return null; + } + private async generateTextWithRateLimitRetry( provider: RateLimitProvider, abortSignal: AbortSignal | undefined, diff --git a/packages/cli/src/context/llm/claude-code-runtime.ts b/packages/cli/src/context/llm/claude-code-runtime.ts index 185fd5b6..fc80c1c7 100644 --- a/packages/cli/src/context/llm/claude-code-runtime.ts +++ b/packages/cli/src/context/llm/claude-code-runtime.ts @@ -6,6 +6,7 @@ import { type SDKResultMessage, } from '@anthropic-ai/claude-agent-sdk'; import { z } from 'zod'; +import type { KtxModelRole } from '../../llm/types.js'; import { createAbortError, isAbortError, throwIfAborted } from '../core/abort.js'; import { createKtxClaudeCodeEnv } from './claude-code-env.js'; import { resolveClaudeCodeModel } from './claude-code-models.js'; @@ -13,6 +14,7 @@ import type { RateLimitGovernor, RateLimitSignal } from './rate-limit-governor.j import { createClaudeSdkTools, mcpToolIds } from './runtime-tools.js'; import type { KtxGenerateObjectInput, + KtxGenerateStructuredJsonInput, KtxGenerateTextInput, KtxLlmRuntimePort, KtxRuntimeToolSet, @@ -20,6 +22,7 @@ import type { RunLoopParams, RunLoopResult, RunLoopStopReason, + SubprocessRuntimeForkSpec, } from './runtime-port.js'; type QueryResult = AsyncIterable & { @@ -389,9 +392,15 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort { return result.result; } - async generateObject>( - input: KtxGenerateObjectInput, - ): Promise { + // Structured generation has no tools, so generateObject and + // generateStructuredJson (the kill-boundary child path) share this one query. + private async runStructuredQuery(input: { + role: KtxModelRole; + prompt: string; + system?: string; + jsonSchema: Record; + abortSignal?: AbortSignal; + }): Promise { const options = { ...baseOptions({ projectDir: this.deps.projectDir, @@ -403,19 +412,30 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort { // 5 leaves headroom without enabling unbounded loops; the json_schema // constraint still forces the final answer to be the schema. maxTurns: 5, - tools: input.tools, }), - outputFormat: { type: 'json_schema' as const, schema: jsonSchema(input.schema as z.ZodType) }, + outputFormat: { type: 'json_schema' as const, schema: input.jsonSchema }, }; - const startedAt = Date.now(); - const result = await collectResultWithRateLimitRetry({ + return collectResultWithRateLimitRetry({ query: this.runQuery, prompt: [input.system, input.prompt].filter(Boolean).join('\n\n'), options, - allowedToolIds: new Set([...mcpToolIds(input.tools ?? {}), STRUCTURED_OUTPUT_TOOL_NAME]), - expectedMcpServerNames: expectedMcpServerNames(input.tools), + allowedToolIds: new Set([STRUCTURED_OUTPUT_TOOL_NAME]), + expectedMcpServerNames: new Set(), rateLimitGovernor: this.deps.rateLimitGovernor, - abortSignal: input.abortSignal, + ...(input.abortSignal ? { abortSignal: input.abortSignal } : {}), + }); + } + + async generateObject>( + input: KtxGenerateObjectInput, + ): Promise { + const startedAt = Date.now(); + const result = await this.runStructuredQuery({ + role: input.role, + prompt: input.prompt, + ...(input.system !== undefined ? { system: input.system } : {}), + jsonSchema: jsonSchema(input.schema as z.ZodType), + ...(input.abortSignal ? { abortSignal: input.abortSignal } : {}), }); input.onMetrics?.({ totalMs: Date.now() - startedAt, usage: claudeTokenUsage(result) }); const error = resultError(result); @@ -428,6 +448,28 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort { return (input.schema as z.ZodType).parse(result.structured_output); } + async generateStructuredJson(input: KtxGenerateStructuredJsonInput): Promise { + const result = await this.runStructuredQuery({ + role: input.role, + prompt: input.prompt, + ...(input.system !== undefined ? { system: input.system } : {}), + jsonSchema: input.jsonSchema, + ...(input.abortSignal ? { abortSignal: input.abortSignal } : {}), + }); + const error = resultError(result); + if (error) { + throw error; + } + if (result.subtype !== 'success') { + throw new Error(`Claude Code query failed (${result.subtype})`); + } + return result.structured_output; + } + + subprocessForkSpec(): SubprocessRuntimeForkSpec { + return { backend: 'claude-code', projectDir: this.deps.projectDir, modelSlots: this.deps.modelSlots }; + } + async runAgentLoop(params: RunLoopParams): Promise { const startedAt = Date.now(); try { diff --git a/packages/cli/src/context/llm/codex-runtime.ts b/packages/cli/src/context/llm/codex-runtime.ts index ce6f609c..a31a188e 100644 --- a/packages/cli/src/context/llm/codex-runtime.ts +++ b/packages/cli/src/context/llm/codex-runtime.ts @@ -9,14 +9,17 @@ import { resolveCodexModel } from './codex-models.js'; import { buildCodexRuntimeConfig } from './codex-runtime-config.js'; import { CodexSdkCliRunner, type CodexSdkRunner } from './codex-sdk-runner.js'; import type { RateLimitGovernor } from './rate-limit-governor.js'; +import type { KtxModelRole } from '../../llm/types.js'; import type { KtxGenerateObjectInput, + KtxGenerateStructuredJsonInput, KtxGenerateTextInput, KtxLlmRuntimePort, KtxRuntimeToolSet, LlmTokenUsage, RunLoopParams, RunLoopResult, + SubprocessRuntimeForkSpec, } from './runtime-port.js'; export interface CodexKtxLlmRuntimeDeps { @@ -249,56 +252,78 @@ export class CodexKtxLlmRuntime implements KtxLlmRuntimePort { } } + // Structured generation has no tools, so it skips the MCP server that + // generateText/runAgentLoop need; generateObject and generateStructuredJson + // (the kill-boundary child path) share this one streaming implementation. + private async streamStructuredText(input: { + role: KtxModelRole; + prompt: string; + system?: string; + jsonSchema: Record; + abortSignal?: AbortSignal; + }): Promise<{ text: string; summary: CodexExecEventSummary; startedAt: number }> { + const startedAt = Date.now(); + const model = modelForRole(this.deps.modelSlots, input.role); + const config = buildCodexRuntimeConfig({ model }); + const result = await this.runWithRateLimitRetry( + input.abortSignal, + async () => { + const collected = await collectEvents( + await this.runner.runStreamed({ + projectDir: this.deps.projectDir, + model, + prompt: promptWithSystem(input.system, input.prompt), + configOverrides: config.configOverrides, + env: config.env, + outputSchema: input.jsonSchema, + ...(input.abortSignal ? { signal: input.abortSignal } : {}), + }), + ); + const summary = summarizeCodexExecEvents(collected.events, { startedAt }); + return { collected, summary }; + }, + ({ collected, summary }) => summaryError(summary, collected.streamError), + ); + return { + text: assertSuccessfulText(result.summary, result.collected.streamError), + summary: result.summary, + startedAt, + }; + } + async generateObject>( input: KtxGenerateObjectInput, ): Promise { - const startedAt = Date.now(); - const model = modelForRole(this.deps.modelSlots, input.role); - const mcp = await mcpForTools({ - projectDir: this.deps.projectDir, - toolSet: input.tools, - startMcpServer: this.deps.startMcpServer, + const { text, summary, startedAt } = await this.streamStructuredText({ + role: input.role, + prompt: input.prompt, + ...(input.system !== undefined ? { system: input.system } : {}), + jsonSchema: z.toJSONSchema(input.schema, { target: 'draft-7' }) as Record, + ...(input.abortSignal ? { abortSignal: input.abortSignal } : {}), + }); + input.onMetrics?.(metrics(summary, startedAt)); + return parseStructuredOutput(input.schema, text); + } + + async generateStructuredJson(input: KtxGenerateStructuredJsonInput): Promise { + const { text } = await this.streamStructuredText({ + role: input.role, + prompt: input.prompt, + ...(input.system !== undefined ? { system: input.system } : {}), + jsonSchema: input.jsonSchema, + ...(input.abortSignal ? { abortSignal: input.abortSignal } : {}), }); try { - const config = buildCodexRuntimeConfig({ - model, - ...(mcp - ? { - mcp: { - url: mcp.url, - bearerTokenEnvVar: mcp.bearerTokenEnvVar, - bearerToken: mcp.bearerToken, - toolNames: runtimeToolNames(input.tools), - }, - } - : {}), - }); - const result = await this.runWithRateLimitRetry( - input.abortSignal, - async () => { - const collected = await collectEvents( - await this.runner.runStreamed({ - projectDir: this.deps.projectDir, - model, - prompt: promptWithSystem(input.system, input.prompt), - configOverrides: config.configOverrides, - env: config.env, - outputSchema: z.toJSONSchema(input.schema, { target: 'draft-7' }) as Record, - ...(input.abortSignal ? { signal: input.abortSignal } : {}), - }), - ); - const summary = summarizeCodexExecEvents(collected.events, { startedAt }); - return { collected, summary }; - }, - ({ collected, summary }) => summaryError(summary, collected.streamError), - ); - input.onMetrics?.(metrics(result.summary, startedAt)); - return parseStructuredOutput(input.schema, assertSuccessfulText(result.summary, result.collected.streamError)); - } finally { - await mcp?.close(); + return JSON.parse(text); + } catch (error) { + throw new Error(`Codex structured output is not valid JSON: ${error instanceof Error ? error.message : String(error)}`); } } + subprocessForkSpec(): SubprocessRuntimeForkSpec { + return { backend: 'codex', projectDir: this.deps.projectDir, modelSlots: this.deps.modelSlots }; + } + async runAgentLoop(params: RunLoopParams): Promise { const startedAt = Date.now(); const model = modelForRole(this.deps.modelSlots, params.modelRole); diff --git a/packages/cli/src/context/llm/runtime-port.ts b/packages/cli/src/context/llm/runtime-port.ts index c55e3c7a..1f45bbdd 100644 --- a/packages/cli/src/context/llm/runtime-port.ts +++ b/packages/cli/src/context/llm/runtime-port.ts @@ -72,12 +72,38 @@ export interface KtxGenerateObjectInput; + abortSignal?: AbortSignal; +} + +/** Serializable recipe to rebuild a subprocess-backed runtime inside a ktx-owned + * child the parent can tree-kill. Returned by {@link KtxLlmRuntimePort.subprocessForkSpec}. */ +export interface SubprocessRuntimeForkSpec { + backend: 'codex' | 'claude-code'; + projectDir: string; + modelSlots: { default: string } & Partial>; +} + export interface KtxLlmRuntimePort { generateText(input: KtxGenerateTextInput): Promise; generateObject>( input: KtxGenerateObjectInput, ): Promise; runAgentLoop(params: RunLoopParams): Promise; + /** + * Non-null when this runtime drives an SDK-owned child process that ktx cannot + * cancel by abort alone (codex/claude-code spawn a binary the SDK owns and only + * SIGTERM on abort). ktx routes such calls through a tree-killable boundary. + * Null for HTTP backends, whose native fetch abort already settles promptly. + */ + subprocessForkSpec(): SubprocessRuntimeForkSpec | null; } export interface AgentRunnerPort { diff --git a/packages/cli/src/context/llm/subprocess-generate-object-child.ts b/packages/cli/src/context/llm/subprocess-generate-object-child.ts new file mode 100644 index 00000000..dfdea0ba --- /dev/null +++ b/packages/cli/src/context/llm/subprocess-generate-object-child.ts @@ -0,0 +1,39 @@ +import { ClaudeCodeKtxLlmRuntime } from './claude-code-runtime.js'; +import { CodexKtxLlmRuntime } from './codex-runtime.js'; +import type { SubprocessRuntimeForkSpec } from './runtime-port.js'; +import type { SubprocessGenerateObjectRequest, SubprocessGenerateObjectResponse } from './subprocess-generate-object.js'; + +// Forked by the parent as a process-group leader it can SIGKILL as a tree. Hosts +// one structured LLM call for a subprocess-backed runtime (codex/claude-code); +// the SDK spawns the model binary as this process's own child, so a parent +// tree-kill reaps the wedged model too. Credentials flow via inherited env — the +// runtimes re-derive their allowlisted env from process.env — never over IPC. + +function buildRuntime(forkSpec: SubprocessRuntimeForkSpec): CodexKtxLlmRuntime | ClaudeCodeKtxLlmRuntime { + if (forkSpec.backend === 'codex') { + return new CodexKtxLlmRuntime({ projectDir: forkSpec.projectDir, modelSlots: forkSpec.modelSlots }); + } + return new ClaudeCodeKtxLlmRuntime({ projectDir: forkSpec.projectDir, modelSlots: forkSpec.modelSlots }); +} + +// The parent owns this process's lifecycle. If the parent dies its IPC channel +// drops; exit rather than linger as an orphan holding a provider connection. +process.once('disconnect', () => process.exit(0)); + +process.once('message', (request: SubprocessGenerateObjectRequest) => { + void (async () => { + let response: SubprocessGenerateObjectResponse; + try { + const output = await buildRuntime(request.forkSpec).generateStructuredJson({ + role: request.role, + prompt: request.prompt, + ...(request.system !== undefined ? { system: request.system } : {}), + jsonSchema: request.jsonSchema, + }); + response = { ok: true, output }; + } catch (error) { + response = { ok: false, message: error instanceof Error ? error.message : String(error) }; + } + process.send?.(response, () => process.exit(0)); + })(); +}); diff --git a/packages/cli/src/context/llm/subprocess-generate-object.ts b/packages/cli/src/context/llm/subprocess-generate-object.ts new file mode 100644 index 00000000..9bf0a3a3 --- /dev/null +++ b/packages/cli/src/context/llm/subprocess-generate-object.ts @@ -0,0 +1,152 @@ +import { fork, spawn, type ChildProcess } from 'node:child_process'; +import { existsSync } from 'node:fs'; +import { fileURLToPath } from 'node:url'; +import type { z } from 'zod'; +import type { KtxModelRole } from '../../llm/types.js'; +import { createAbortError } from '../core/abort.js'; +import type { SubprocessRuntimeForkSpec } from './runtime-port.js'; + +export interface SubprocessGenerateObjectRequest { + forkSpec: SubprocessRuntimeForkSpec; + role: KtxModelRole; + prompt: string; + system?: string; + jsonSchema: Record; +} + +export type SubprocessGenerateObjectResponse = { ok: true; output: unknown } | { ok: false; message: string }; + +// In dist, this file and the child are siblings; under vitest the compiled .js is +// absent and Node strips types from the .ts. The real child imports the codex / +// claude SDKs (which use constructor parameter properties), so it only runs as +// built .js — tests inject a fake child via the spawnChild seam. +function childUrl(): URL { + const builtChild = new URL('./subprocess-generate-object-child.js', import.meta.url); + return existsSync(fileURLToPath(builtChild)) ? builtChild : new URL('./subprocess-generate-object-child.ts', import.meta.url); +} + +function forkSubprocessGenerateObjectChild(): ChildProcess { + // detached: the child becomes a process-group leader so the SDK's grandchild + // (the codex/claude binary) inherits its group and a negative-pid SIGKILL reaps + // the whole tree. Empty execArgv keeps it a clean Node process. + return fork(childUrl(), { + execArgv: [], + serialization: 'advanced', + detached: true, + stdio: ['ignore', 'ignore', 'inherit', 'ipc'], + }); +} + +/** A per-table enrichment subprocess that did not return before its deadline. */ +export class KtxSubprocessDeadlineError extends Error { + constructor(public readonly deadlineMs: number) { + super(`enrichment subprocess exceeded ${Math.round(deadlineMs / 1000)}s`); + this.name = 'KtxSubprocessDeadlineError'; + } +} + +// SIGTERM is too gentle for a child wedged on a hung provider socket; the SDK +// grandchild ignores it and survives. Kill the whole tree: the detached process +// group on POSIX, the process tree via taskkill /T on Windows. +function killProcessTree(child: ChildProcess): void { + if (child.pid === undefined) { + return; + } + if (process.platform === 'win32') { + spawn('taskkill', ['/pid', String(child.pid), '/T', '/F'], { stdio: 'ignore' }).on('error', () => undefined); + return; + } + try { + process.kill(-child.pid, 'SIGKILL'); + } catch { + try { + child.kill('SIGKILL'); + } catch { + // Already exited. + } + } +} + +export interface RunGenerateObjectInSubprocessInput> { + forkSpec: SubprocessRuntimeForkSpec; + role: KtxModelRole; + prompt: string; + system?: string; + schema: TSchema; + jsonSchema: Record; + deadlineMs: number; + signal?: AbortSignal; + /** @internal Test seam: spawn the child so tests can observe its lifecycle. */ + spawnChild?: () => ChildProcess; +} + +/** + * Run one structured LLM call for a subprocess-backed runtime behind a boundary + * ktx can hard-kill. On the deadline or an external abort, the whole process + * group/tree is SIGKILLed (reaping the SDK's wedged model child) and the promise + * settles promptly; on success the raw output is validated against the Zod schema. + */ +export function runGenerateObjectInSubprocess>( + input: RunGenerateObjectInSubprocessInput, +): Promise { + return new Promise((resolvePromise, rejectPromise) => { + const child = (input.spawnChild ?? forkSubprocessGenerateObjectChild)(); + let settled = false; + const onDeadline = () => settle(() => rejectPromise(new KtxSubprocessDeadlineError(input.deadlineMs))); + const onAbort = () => settle(() => rejectPromise(createAbortError())); + const timer = setTimeout(onDeadline, input.deadlineMs); + function settle(finish: () => void): void { + if (settled) { + return; + } + settled = true; + clearTimeout(timer); + input.signal?.removeEventListener('abort', onAbort); + if (child.exitCode === null && child.signalCode === null) { + killProcessTree(child); + } + finish(); + } + child.on('message', (message: SubprocessGenerateObjectResponse) => { + if (message.ok) { + let parsed: TOutput; + try { + parsed = input.schema.parse(message.output); + } catch (error) { + settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error)))); + return; + } + settle(() => resolvePromise(parsed)); + } else { + settle(() => rejectPromise(new Error(message.message))); + } + }); + child.on('error', (error) => settle(() => rejectPromise(error))); + child.on('exit', (code, processSignal) => { + if (!settled) { + settle(() => + rejectPromise( + new Error(`enrichment subprocess exited before returning a result (code ${code}, signal ${processSignal}).`), + ), + ); + } + }); + if (input.signal?.aborted) { + onAbort(); + return; + } + input.signal?.addEventListener('abort', onAbort, { once: true }); + try { + const request: SubprocessGenerateObjectRequest = { + forkSpec: input.forkSpec, + role: input.role, + prompt: input.prompt, + ...(input.system !== undefined ? { system: input.system } : {}), + jsonSchema: input.jsonSchema, + }; + child.send(request); + } catch (error) { + settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error)))); + } + }); +} diff --git a/packages/cli/src/context/mcp/context-tools.ts b/packages/cli/src/context/mcp/context-tools.ts index 94b889c4..525f9540 100644 --- a/packages/cli/src/context/mcp/context-tools.ts +++ b/packages/cli/src/context/mcp/context-tools.ts @@ -11,6 +11,7 @@ import { } from '../../telemetry/index.js'; import { collectTelemetryRedactionSecrets } from '../../telemetry/redaction-secrets.js'; import { formatErrorDetail, scrubErrorClass } from '../../telemetry/scrubber.js'; +import { mcpSlowToolMs, serializeMcpError, type KtxMcpLogger } from './logger.js'; import type { KtxMcpClientInfo, KtxMcpContextPorts, @@ -29,6 +30,7 @@ export interface RegisterKtxContextToolsDeps { userContext: KtxMcpUserContext; projectDir?: string; io?: KtxCliIo; + logger?: KtxMcpLogger; getClientInfo?: () => KtxMcpClientInfo | undefined; } @@ -50,6 +52,7 @@ const toolAnnotations = { sl_read_source: { title: 'Semantic Layer Read Source', readOnlyHint: true, idempotentHint: true, openWorldHint: false }, sl_query: { title: 'Semantic Layer Query', readOnlyHint: true, openWorldHint: false }, sql_execution: { title: 'SQL Execution', readOnlyHint: true, openWorldHint: false }, + sql_dialect_notes: { title: 'SQL Dialect Notes', readOnlyHint: true, idempotentHint: true, openWorldHint: false }, memory_ingest: { title: 'Memory Ingest', destructiveHint: true, openWorldHint: false }, memory_ingest_status: { title: 'Memory Ingest Status', readOnlyHint: true, openWorldHint: false }, } satisfies Record; @@ -60,7 +63,7 @@ const toolDescriptions = { discover_data: 'Search across ktx wiki pages, semantic-layer sources, measures, dimensions, raw tables, and columns. Example: discover_data({ query: "monthly orders by customer", connectionId: "warehouse", kinds: ["sl_source", "table"] }).', wiki_search: - 'Search ktx wiki pages for reusable business context. Example: wiki_search({ query: "revenue recognition", limit: 5 }).', + 'Search ktx wiki pages for reusable business context. Pass connectionId to scope results to one warehouse (unscoped pages plus pages tagged with that connection) when a concept name collides across databases. Example: wiki_search({ query: "revenue recognition", connectionId: "warehouse", limit: 5 }).', wiki_read: 'Read a ktx wiki page by key returned from wiki_search. Example: wiki_read({ key: "global/revenue" }).', entity_details: 'Read table and column metadata from the latest live-database scan snapshot. Example: entity_details({ connectionId: "warehouse", entities: [{ table: { catalog: null, db: "public", name: "orders" }, columns: ["id"] }] }).', @@ -72,6 +75,8 @@ const toolDescriptions = { 'Execute a semantic-layer query and return headers, rows, and total row count, plus correctness notes (e.g. compile-only or fan-out) when relevant. The generated SQL and full query plan are omitted by default; request them with include: ["sql"] and/or include: ["plan"]. Example: sl_query({ connectionId: "warehouse", measures: ["orders.order_count"], dimensions: [{ field: "orders.created_at", granularity: "month" }], include: ["sql"] }).', sql_execution: 'Execute one parser-validated read-only SQL query against a configured ktx connection. Example: sql_execution({ connectionId: "warehouse", sql: "select count(*) from public.orders", maxRows: 100 }).', + sql_dialect_notes: + 'Return the SQL syntax conventions for the dialect of a ktx connection: fully-qualified table-name form, identifier quoting and case-folding, date/time functions, top-N / window-filtering idiom, and JSON access. Call this before writing raw sql_execution SQL against a connection so the SQL matches that engine. Example: sql_dialect_notes({ connectionId: "warehouse" }).', memory_ingest: 'Ingest free-form markdown knowledge into durable ktx memory. Use this for business rules, metric definitions, schema gotchas, recurring findings, or explicit user requests to remember something. Example: memory_ingest({ connectionId: "warehouse", content: "ARR is reported in cents in this warehouse." }).', memory_ingest_status: @@ -83,6 +88,11 @@ const connectionListSchema = z.object({}); const knowledgeSearchSchema = z.object({ query: z.string().min(1).describe('Natural-language wiki search query, e.g. "revenue recognition policy".'), limit: z.number().int().min(1).max(50).default(10).describe('Maximum wiki pages to return.'), + connectionId: connectionIdSchema + .optional() + .describe( + 'Scope results to one connection: returns unscoped pages plus pages tagged with this connection. Omit to search all pages.', + ), }); const knowledgeReadSchema = z.object({ @@ -203,6 +213,10 @@ const sqlExecutionSchema = z.object({ maxRows: z.number().int().min(1).max(10_000).default(1000).optional().describe('Maximum rows to return.'), }); +const sqlDialectNotesSchema = z.object({ + connectionId: connectionIdSchema.describe('Connection id whose engine dialect conventions to return.'), +}); + const memoryIngestSchema = z.object({ content: z .string() @@ -405,6 +419,12 @@ const sqlExecutionOutputSchema = z.object({ rowCount: z.number(), }); +const sqlDialectNotesOutputSchema = z.object({ + connectionId: z.string(), + dialect: z.string(), + notes: z.string(), +}); + const memoryIngestOutputSchema = z.object({ runId: z.string(), }); @@ -566,6 +586,63 @@ function clientTelemetryFields( }; } +function toolResultIsError(result: unknown): boolean { + return ( + typeof result === 'object' && result !== null && 'isError' in result && (result as { isError?: unknown }).isError === true + ); +} + +/** Tool-agnostic size: byte length of the serialized text content the client reads. */ +function toolResultSize(result: unknown): number { + if (typeof result !== 'object' || result === null || !('content' in result)) { + return 0; + } + const content = (result as { content?: unknown }).content; + if (!Array.isArray(content)) { + return 0; + } + let size = 0; + for (const item of content) { + if (item && typeof item === 'object' && (item as { type?: unknown }).type === 'text') { + const text = (item as { text?: unknown }).text; + if (typeof text === 'string') { + size += Buffer.byteLength(text, 'utf8'); + } + } + } + return size; +} + +function toolResultErrorText(result: unknown): string { + if (typeof result === 'object' && result !== null && 'content' in result) { + const content = (result as { content?: unknown }).content; + if (Array.isArray(content)) { + const text = content + .filter( + (item): item is { type: 'text'; text: string } => + !!item && + typeof item === 'object' && + (item as { type?: unknown }).type === 'text' && + typeof (item as { text?: unknown }).text === 'string', + ) + .map((item) => item.text) + .join('\n'); + if (text.length > 0) { + return text; + } + } + } + return 'Tool returned an error result.'; +} + +interface InstrumentMcpServerDeps { + projectDir?: string; + io?: KtxCliIo; + logger?: KtxMcpLogger; + slowToolMs: number; + getClientInfo?: () => KtxMcpClientInfo | undefined; +} + // Tools registered via registerParsedTool catch their own errors and return an // isError result, so the telemetry layer never sees the thrown Error. Recover // the failure message from the result's text content (the same string the agent @@ -588,68 +665,91 @@ function mcpErrorResultDetail(result: unknown): string | undefined { return formatErrorDetail(text); } -function instrumentMcpServer( - server: KtxMcpServerLike, - telemetry: { projectDir?: string; io?: KtxCliIo; getClientInfo?: () => KtxMcpClientInfo | undefined }, -): KtxMcpServerLike { +function instrumentMcpServer(server: KtxMcpServerLike, deps: InstrumentMcpServerDeps): KtxMcpServerLike { return { registerTool(name, config, handler) { server.registerTool(name, config, async (input, context) => { + const callId = randomUUID(); + const callLogger = deps.logger?.child({ + tool: name, + callId, + ...(context?.sessionId ? { sessionId: context.sessionId } : {}), + }); const startedAt = performance.now(); + // Synchronous, before the (possibly blocking) handler: a runaway query that never + // returns still leaves this start line — with its exact params — on disk. + callLogger?.info({ params: input }, 'tool.start'); try { const result = await handler(input, context); - if (telemetry.io && telemetry.projectDir && shouldEmitMcpTelemetry()) { - const isError = - typeof result === 'object' && result !== null && 'isError' in result && result.isError === true; + const durationMs = Math.max(0, performance.now() - startedAt); + const isError = toolResultIsError(result); + if (deps.io && deps.projectDir && shouldEmitMcpTelemetry()) { const errorDetail = isError ? mcpErrorResultDetail(result) : undefined; await emitTelemetryEvent({ name: 'mcp_request_completed', - projectDir: telemetry.projectDir, - io: telemetry.io, + projectDir: deps.projectDir, + io: deps.io, fields: { toolName: name, outcome: isError ? 'error' : 'ok', - durationMs: Math.max(0, performance.now() - startedAt), + durationMs, sampleRate: mcpTelemetrySampleRate(), ...(errorDetail ? { errorDetail } : {}), - ...clientTelemetryFields(telemetry.getClientInfo), + ...clientTelemetryFields(deps.getClientInfo), }, }); } + if (callLogger) { + if (isError) { + callLogger.error( + { durationMs, outcome: 'error', err: serializeMcpError(toolResultErrorText(result)) }, + 'tool.end', + ); + } else { + const fields = { durationMs, outcome: 'ok' as const, resultSize: toolResultSize(result) }; + if (durationMs > deps.slowToolMs) { + callLogger.warn(fields, 'tool.end'); + } else { + callLogger.info(fields, 'tool.end'); + } + } + } return result; } catch (error) { - if (telemetry.io) { + const durationMs = Math.max(0, performance.now() - startedAt); + if (deps.io) { await reportException({ error, context: { source: `mcp:${name}`, handled: true, fatal: false }, - projectDir: telemetry.projectDir, - io: telemetry.io, + projectDir: deps.projectDir, + io: deps.io, redactionSecrets: await collectTelemetryRedactionSecrets({ - projectDir: telemetry.projectDir, + projectDir: deps.projectDir, includeLlm: true, includeEmbeddings: true, env: process.env, }), }); } - if (telemetry.io && telemetry.projectDir && shouldEmitMcpTelemetry()) { + if (deps.io && deps.projectDir && shouldEmitMcpTelemetry()) { const errorClass = scrubErrorClass(error); const errorDetail = formatErrorDetail(error); await emitTelemetryEvent({ name: 'mcp_request_completed', - projectDir: telemetry.projectDir, - io: telemetry.io, + projectDir: deps.projectDir, + io: deps.io, fields: { toolName: name, outcome: 'error', ...(errorClass ? { errorClass } : {}), ...(errorDetail ? { errorDetail } : {}), - durationMs: Math.max(0, performance.now() - startedAt), + durationMs, sampleRate: mcpTelemetrySampleRate(), - ...clientTelemetryFields(telemetry.getClientInfo), + ...clientTelemetryFields(deps.getClientInfo), }, }); } + callLogger?.error({ durationMs, outcome: 'error', err: serializeMcpError(error) }, 'tool.end'); throw error; } }); @@ -663,6 +763,8 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void const server = instrumentMcpServer(deps.server, { projectDir: deps.projectDir, io: deps.io, + logger: deps.logger, + slowToolMs: mcpSlowToolMs(), getClientInfo: deps.getClientInfo, }); @@ -703,6 +805,7 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void userId: userContext.userId, query: input.query, limit: input.limit, + ...(input.connectionId !== undefined ? { connectionId: input.connectionId } : {}), }), ), toolTelemetry, @@ -867,6 +970,24 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void ); } + if (ports.dialectNotes) { + const dialectNotes = ports.dialectNotes; + registerParsedTool( + server, + 'sql_dialect_notes', + { + title: toolAnnotations.sql_dialect_notes.title!, + description: toolDescriptions.sql_dialect_notes, + inputSchema: sqlDialectNotesSchema.shape, + outputSchema: sqlDialectNotesOutputSchema, + annotations: toolAnnotations.sql_dialect_notes, + }, + sqlDialectNotesSchema, + async (input) => jsonToolResult(await dialectNotes.read(input)), + toolTelemetry, + ); + } + if (ports.memoryIngest) { const memoryIngest = ports.memoryIngest; registerParsedTool( diff --git a/packages/cli/src/context/mcp/local-project-ports.ts b/packages/cli/src/context/mcp/local-project-ports.ts index 6348bfa4..fc44e7da 100644 --- a/packages/cli/src/context/mcp/local-project-ports.ts +++ b/packages/cli/src/context/mcp/local-project-ports.ts @@ -1,5 +1,8 @@ import type { KtxSqlQueryExecutorPort } from '../../context/connections/query-executor.js'; import { KtxExpectedError, KtxQueryError, isNativeProgrammingFault } from '../../errors.js'; +import { isDatabaseDriver, normalizeConnectionDriver } from '../../connection-drivers.js'; +import { sqlDialectNotes } from '../../context/sql-analysis/dialect-notes.js'; +import type { KtxProjectConnectionConfig } from '../../context/project/config.js'; import { executeProjectReadOnlySql } from '../../context/connections/project-sql-executor.js'; import { FEDERATED_CONNECTION_ID, federatedConnectionListing } from '../../context/connections/federation.js'; import { assertSqlQueryableConnection } from '../../context/connections/dialects.js'; @@ -20,6 +23,7 @@ import { compileLocalSlQuery } from '../../context/sl/local-query.js'; import { createKtxDictionarySearchService } from '../../context/sl/dictionary-search.js'; import { readLocalSlSource } from '../../context/sl/local-sl.js'; import { assertSafeConnectionId } from '../../context/sl/source-files.js'; +import { assertConfiguredConnectionId } from '../../context/connections/configured-connections.js'; import { readLocalKnowledgePage, searchLocalKnowledgePages } from '../wiki/local-knowledge.js'; import type { KtxMcpContextPorts, KtxMcpProgressCallback, KtxSqlExecutionResponse } from './types.js'; @@ -94,6 +98,24 @@ async function executeValidatedReadOnlySql( return response; } +/** @internal Resolves a connection's dialect SQL notes; throws KtxExpectedError for an unknown or non-SQL-warehouse connection. */ +export function resolveDialectNotesForConnection( + connectionId: string, + connection: KtxProjectConnectionConfig | undefined, +): { connectionId: string; dialect: string; notes: string } { + if (!connection) { + throw new KtxExpectedError(`Connection "${connectionId}" is not configured in ktx.yaml`); + } + const driver = normalizeConnectionDriver(connection); + if (!isDatabaseDriver(driver)) { + throw new KtxExpectedError( + `Connection "${connectionId}" uses the "${driver}" context source, not a SQL warehouse; sql_dialect_notes applies only to SQL database connections.`, + ); + } + const dialect = sqlAnalysisDialectForDriver(driver); + return { connectionId, dialect, notes: sqlDialectNotes(dialect) }; +} + export function createLocalProjectMcpContextPorts( project: KtxLocalProject, options: CreateLocalProjectMcpContextPortsOptions, @@ -121,11 +143,16 @@ export function createLocalProjectMcpContextPorts( }, knowledge: { async search(input) { + const connectionId = + input.connectionId === undefined + ? undefined + : assertConfiguredConnectionId(project.config.connections, input.connectionId); const results = await searchLocalKnowledgePages(project, { query: input.query, userId: input.userId, limit: input.limit, embeddingService, + ...(connectionId !== undefined ? { connectionId } : {}), }); return { results: results.slice(0, input.limit).map((result) => ({ @@ -196,6 +223,12 @@ export function createLocalProjectMcpContextPorts( return createKtxDiscoverDataService(project, { userId: 'local', embeddingService }).search(input); }, }, + dialectNotes: { + async read(input) { + const connectionId = assertSafeConnectionId(input.connectionId); + return resolveDialectNotesForConnection(connectionId, project.config.connections[connectionId]); + }, + }, }; if (options.sqlAnalysis && options.localScan?.createConnector) { diff --git a/packages/cli/src/context/mcp/logger.ts b/packages/cli/src/context/mcp/logger.ts new file mode 100644 index 00000000..4d5b088c --- /dev/null +++ b/packages/cli/src/context/mcp/logger.ts @@ -0,0 +1,58 @@ +import { Writable } from 'node:stream'; +import pino, { type DestinationStream, type Logger } from 'pino'; +import PinoPretty from 'pino-pretty'; +import type { KtxCliIo } from '../../cli-runtime.js'; + +export type KtxMcpLogger = Logger; + +const LOG_LEVELS = new Set(['trace', 'debug', 'info', 'warn', 'error', 'fatal', 'silent']); + +const DEFAULT_LEVEL = 'info'; +const DEFAULT_SLOW_TOOL_MS = 10_000; + +/** @internal */ +export function mcpLogLevel(env: NodeJS.ProcessEnv = process.env): string { + const raw = env.KTX_MCP_LOG_LEVEL?.trim().toLowerCase(); + return raw && LOG_LEVELS.has(raw) ? raw : DEFAULT_LEVEL; +} + +/** @internal */ +export function mcpSlowToolMs(env: NodeJS.ProcessEnv = process.env): number { + const raw = Number(env.KTX_MCP_SLOW_TOOL_MS); + return Number.isFinite(raw) && raw >= 0 ? raw : DEFAULT_SLOW_TOOL_MS; +} + +/** + * Serialize an error for a structured `err` field. Genuine `Error`s get pino's + * standard serializer (type + message + stack); everything else is reduced to a + * message — the in-band tool-error path has already lost the original stack. + */ +export function serializeMcpError(error: unknown): Record { + if (error instanceof Error) { + return { ...pino.stdSerializers.err(error) }; + } + return { message: typeof error === 'string' ? error : String(error) }; +} + +/** + * One synchronous pino logger per MCP server process, written to the `io.stderr` + * sink. stderr is the only universally-correct sink: the stdio transport reserves + * stdout for JSON-RPC, and the HTTP daemon redirects stderr into `.ktx/logs/mcp.log`. + * Synchronous writes are load-bearing — a `tool.start` line must reach the fd before + * a blocking handler runs, so a runaway query still leaves its start record on disk. + * Format follows the terminal, not a flag: pretty for a TTY, plain JSON otherwise. + */ +export function createMcpLogger(io: KtxCliIo, options: { isTTY?: boolean } = {}): KtxMcpLogger { + const level = mcpLogLevel(); + const isTTY = options.isTTY ?? process.stderr.isTTY === true; + if (isTTY) { + const sink = new Writable({ + write(chunk: Buffer | string, _encoding, callback) { + io.stderr.write(typeof chunk === 'string' ? chunk : chunk.toString('utf8')); + callback(); + }, + }); + return pino({ level }, PinoPretty({ colorize: true, sync: true, destination: sink })); + } + return pino({ level }, io.stderr as DestinationStream); +} diff --git a/packages/cli/src/context/mcp/server.ts b/packages/cli/src/context/mcp/server.ts index 85871467..9ce0d5e9 100644 --- a/packages/cli/src/context/mcp/server.ts +++ b/packages/cli/src/context/mcp/server.ts @@ -11,6 +11,7 @@ export function createKtxMcpServer(deps: KtxMcpServerDeps): KtxMcpServerDeps['se userContext: deps.userContext, projectDir: deps.projectDir, io: deps.io, + logger: deps.logger, getClientInfo: deps.getClientInfo, }); } @@ -31,6 +32,7 @@ export function createDefaultKtxMcpServer( contextTools: deps.contextTools, projectDir: deps.projectDir, io: deps.io, + logger: deps.logger, // The SDK populates the client identity after the initialize handshake, so // read it lazily at emit time rather than at registration (undefined here). getClientInfo: () => server.server.getClientVersion(), diff --git a/packages/cli/src/context/mcp/types.ts b/packages/cli/src/context/mcp/types.ts index e48d0975..c8dbd480 100644 --- a/packages/cli/src/context/mcp/types.ts +++ b/packages/cli/src/context/mcp/types.ts @@ -1,5 +1,6 @@ import type { MemoryIngestService } from '../../context/memory/memory-runs.js'; import type { KtxCliIo } from '../../cli-runtime.js'; +import type { KtxMcpLogger } from './logger.js'; import type { KtxEntityDetailsInput, KtxEntityDetailsResponse } from '../scan/entity-details.js'; import type { KtxDiscoverDataInput, KtxDiscoverDataResponse } from '../../context/search/discover.js'; import type { KtxDictionarySearchInput, KtxDictionarySearchResponse } from '../../context/sl/dictionary-search.js'; @@ -28,6 +29,8 @@ interface KtxMcpProgressEvent { export type KtxMcpProgressCallback = (event: KtxMcpProgressEvent) => void | Promise; export interface KtxMcpToolHandlerContext { + /** Present for the HTTP StreamableHTTP transport (one per session); absent for stdio. */ + sessionId?: string; _meta?: { progressToken?: string | number; [key: string]: unknown }; sendNotification?: (notification: { method: 'notifications/progress'; @@ -113,7 +116,12 @@ interface KtxKnowledgePage { /** @internal */ export interface KtxKnowledgeMcpPort { - search(input: { userId: string; query: string; limit: number }): Promise; + search(input: { + userId: string; + query: string; + limit: number; + connectionId?: string; + }): Promise; read(input: { userId: string; key: string }): Promise; } @@ -172,6 +180,11 @@ export interface KtxSqlExecutionMcpPort { ): Promise; } +/** @internal */ +export interface KtxDialectNotesMcpPort { + read(input: { connectionId: string }): Promise<{ connectionId: string; dialect: string; notes: string }>; +} + export interface KtxMcpContextPorts { connections?: KtxConnectionsMcpPort; knowledge?: KtxKnowledgeMcpPort; @@ -180,6 +193,7 @@ export interface KtxMcpContextPorts { dictionarySearch?: KtxDictionarySearchMcpPort; discover?: KtxDiscoverDataMcpPort; sqlExecution?: KtxSqlExecutionMcpPort; + dialectNotes?: KtxDialectNotesMcpPort; memoryIngest?: MemoryIngestPort; } @@ -189,6 +203,8 @@ export interface KtxMcpServerDeps { contextTools?: KtxMcpContextPorts; projectDir?: string; io?: KtxCliIo; + /** Shared per-process logger for tool-call observability; tool-call logging is off when absent. */ + logger?: KtxMcpLogger; /** Reads the connected client's identity once the initialize handshake completes. */ getClientInfo?: () => KtxMcpClientInfo | undefined; } diff --git a/packages/cli/src/context/memory/memory-agent.service.ts b/packages/cli/src/context/memory/memory-agent.service.ts index 491cd159..4c5aeb49 100644 --- a/packages/cli/src/context/memory/memory-agent.service.ts +++ b/packages/cli/src/context/memory/memory-agent.service.ts @@ -168,7 +168,7 @@ export class MemoryAgentService { : ''; const prompt = [ `# Wiki Index\n\n${wikiIndex}`, - hasSL ? `\n# Semantic Layer Sources\n\n${slIndex}` : '', + hasSL ? `\n# Semantic Layer Sources (connectionId: ${input.connectionId})\n\n${slIndex}` : '', '\n---\n', assistantSection, `\n## User Message\n\n${input.userMessage.trim()}`, diff --git a/packages/cli/src/context/project/config.ts b/packages/cli/src/context/project/config.ts index 32c58e51..92377b39 100644 --- a/packages/cli/src/context/project/config.ts +++ b/packages/cli/src/context/project/config.ts @@ -209,6 +209,11 @@ const scanRelationshipsSchema = z .union([z.literal('all'), z.int().nonnegative()]) .optional() .describe('Cap on validation queries per scan run. Use "all" for unlimited, an integer for a hard cap, or omit for the runtime default.'), + detectionBudgetMs: z + .int() + .positive() + .default(600_000) + .describe('Wall-clock budget (ms) for the whole relationship-detection stage. Checked at table-profile, LLM-proposal, candidate-validation, and composite-probe boundaries; above the per-query deadline. On exhaustion the stage stops scheduling new work and returns the relationships found so far, marked partial. Raise it to trigger a fresher, fuller run.'), }) .describe('Schema-scan relationship discovery and validation tunables.'); diff --git a/packages/cli/src/context/project/driver-schemas.ts b/packages/cli/src/context/project/driver-schemas.ts index 52c2bf3a..25fa3507 100644 --- a/packages/cli/src/context/project/driver-schemas.ts +++ b/packages/cli/src/context/project/driver-schemas.ts @@ -30,7 +30,15 @@ function warehouseConnectionSchema(driver: .array(z.string().min(1)) .optional() .describe( - 'Optional allowlist of fully-qualified table names ("schema.table") to ingest. When set, live-database ingest discards any table whose schema-qualified name is not in this list. Useful for smoke-testing ingest on a single table.', + 'Optional allowlist of object names to ingest. Accepted forms: "catalog.db.name", "db.name" (schema-qualified), or bare "name". When set, live-database ingest restricts the scan to the listed objects and fails with a clear error if none match. For SQLite, "main." and the bare "" are equivalent (SQLite exposes a single "main" schema). Useful for smoke-testing ingest on a single table.', + ), + query_timeout_ms: z + .number() + .int() + .positive() + .optional() + .describe( + 'Maximum execution time for a single read-only query, in milliseconds (default 30000). Enforced as a server-side statement timeout for remote engines and by SIGKILL-ing a forked query subprocess for in-process SQLite. A query exceeding it is cancelled and returns a "query exceeded Ns" error so the agent can revise.', ), }) .describe( diff --git a/packages/cli/src/context/project/project.ts b/packages/cli/src/context/project/project.ts index 156b200c..0e0bc56e 100644 --- a/packages/cli/src/context/project/project.ts +++ b/packages/cli/src/context/project/project.ts @@ -37,7 +37,7 @@ export interface InitKtxProjectResult extends KtxLocalProject { const TRACKED_SCAFFOLD_FILES: Array<{ path: string; content: string }> = [ { path: '.ktx/.gitignore', - content: 'cache/\ndb.sqlite\ndb.sqlite-*\ningest-transcripts/\nsecrets/\nsetup/\nagents/\n', + content: 'cache/\ndb.sqlite\ndb.sqlite-*\ningest-transcripts/\nlogs/\nsecrets/\nsetup/\nagents/\n', }, { path: '.ktx/prompts/.gitkeep', content: '' }, { path: '.ktx/skills/.gitkeep', content: '' }, diff --git a/packages/cli/src/context/project/setup-config.ts b/packages/cli/src/context/project/setup-config.ts index f790597e..ece71580 100644 --- a/packages/cli/src/context/project/setup-config.ts +++ b/packages/cli/src/context/project/setup-config.ts @@ -24,6 +24,7 @@ const SETUP_GITIGNORE_ENTRIES = [ 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', + 'logs/', 'secrets/', 'setup/', 'agents/', diff --git a/packages/cli/src/context/scan/description-generation.ts b/packages/cli/src/context/scan/description-generation.ts index 7125a8a4..3ea02c36 100644 --- a/packages/cli/src/context/scan/description-generation.ts +++ b/packages/cli/src/context/scan/description-generation.ts @@ -1,5 +1,10 @@ -import type { KtxLlmRuntimePort } from '../../context/llm/runtime-port.js'; +import type { ChildProcess } from 'node:child_process'; import { z } from 'zod'; +import type { KtxLlmRuntimePort } from '../../context/llm/runtime-port.js'; +import { + KtxSubprocessDeadlineError, + runGenerateObjectInSubprocess, +} from '../../context/llm/subprocess-generate-object.js'; import type { KtxColumnSampleInput, KtxColumnSampleResult, @@ -145,6 +150,8 @@ export interface KtxDescriptionGeneratorOptions { logger?: KtxScanLoggerPort; onWarning?: (warning: KtxScanWarning) => void; settings: KtxDescriptionGenerationSettings; + /** @internal Test seam: spawn the kill-boundary child for subprocess backends. */ + spawnSubprocessGenerateChild?: () => ChildProcess; } interface ColumnTaskResult { @@ -510,12 +517,14 @@ export class KtxDescriptionGenerator { private readonly logger?: KtxScanLoggerPort; private readonly onWarning?: (warning: KtxScanWarning) => void; private readonly settings: ResolvedKtxDescriptionGenerationSettings; + private readonly spawnSubprocessGenerateChild?: () => ChildProcess; constructor(options: KtxDescriptionGeneratorOptions) { this.llmRuntime = options.llmRuntime; this.cache = options.cache; this.logger = options.logger; this.onWarning = options.onWarning; + this.spawnSubprocessGenerateChild = options.spawnSubprocessGenerateChild; this.settings = { columnMaxWords: options.settings.columnMaxWords, tableMaxWords: options.settings.tableMaxWords, @@ -757,6 +766,21 @@ export class KtxDescriptionGenerator { let tableDescription: string | null = null; let structuredGenerationSucceeded = false; + // Bound + retry the per-table enrichment LLM call. A transient backend error + // (e.g. an "overloaded"/burst rejection when many tables enrich concurrently) + // otherwise nulls a whole table's descriptions on the FIRST failure — sampleTable + // already retries, this call did not, so transient errors silently dropped most + // tables of a db. retryAsync gives it the same 3-attempt backoff. A FRESH timeout + // per attempt still bounds a wedged wide table (it never returns a result message); + // a timeout is surfaced as KtxAbortedError so retryAsync does NOT retry it (one + // wedge stays one timeout, not 3×). Tune via KTX_ENRICH_LLM_TIMEOUT_MS (default + // 120s) and KTX_ENRICH_LLM_ATTEMPTS (default 3). + const rawEnrichTimeoutMs = Number(process.env.KTX_ENRICH_LLM_TIMEOUT_MS); + const enrichTimeoutMs = Number.isFinite(rawEnrichTimeoutMs) && rawEnrichTimeoutMs > 0 ? rawEnrichTimeoutMs : 120_000; + const enrichAttempts = Math.max(1, Number(process.env.KTX_ENRICH_LLM_ATTEMPTS ?? 3) || 3); + let llmStartedAt = 0; + let lastTimedOut = false; + try { const prompt = batchedPrompt({ table: input.table, @@ -765,15 +789,91 @@ export class KtxDescriptionGenerator { tableMaxWords: this.settings.tableMaxWords, columnMaxWords: this.settings.columnMaxWords, }); - const generated = await this.llmRuntime.generateObject< - BatchedTableDescriptionOutput, - typeof batchedTableDescriptionSchema - >({ - role: 'candidateExtraction', - system: prompt.system, - prompt: prompt.user, - schema: batchedTableDescriptionSchema, - temperature: this.settings.temperature, + llmStartedAt = Date.now(); + this.logger?.info( + `[enrich] llm:start table=${input.table.name} cols=${input.table.columns.length} promptChars=${prompt.user.length} timeoutMs=${enrichTimeoutMs} attempts=${enrichAttempts}`, + { connectorId: input.connector.id, table: input.table.name, columns: input.table.columns.length }, + ); + // Subprocess backends (codex/claude-code) own an SDK child that ignores the + // in-process abort, so each attempt runs behind a tree-killable boundary; + // HTTP backends keep the native abortSignal -> fetch cancellation. + const enrichForkSpec = this.llmRuntime.subprocessForkSpec(); + const enrichJsonSchema = enrichForkSpec + ? (z.toJSONSchema(batchedTableDescriptionSchema, { target: 'draft-7' }) as Record) + : null; + const generated = await retryAsync( + async () => { + if (enrichForkSpec && enrichJsonSchema) { + try { + return await runGenerateObjectInSubprocess< + BatchedTableDescriptionOutput, + typeof batchedTableDescriptionSchema + >({ + forkSpec: enrichForkSpec, + role: 'candidateExtraction', + system: prompt.system, + prompt: prompt.user, + schema: batchedTableDescriptionSchema, + jsonSchema: enrichJsonSchema, + deadlineMs: enrichTimeoutMs, + ...(input.context.signal ? { signal: input.context.signal } : {}), + ...(this.spawnSubprocessGenerateChild + ? { spawnChild: this.spawnSubprocessGenerateChild } + : {}), + }); + } catch (error) { + // The boundary tree-kills the wedged child on deadline; a per-table + // timeout is not worth retrying (it would just time out again), so + // surface it as KtxAbortedError so retryAsync stops immediately. + if (error instanceof KtxSubprocessDeadlineError && !input.context.signal?.aborted) { + lastTimedOut = true; + throw new KtxAbortedError(); + } + throw error; + } + } + const enrichTimeout = AbortSignal.timeout(enrichTimeoutMs); + const abortSignal = input.context.signal + ? AbortSignal.any([enrichTimeout, input.context.signal]) + : enrichTimeout; + try { + return await this.llmRuntime.generateObject< + BatchedTableDescriptionOutput, + typeof batchedTableDescriptionSchema + >({ + role: 'candidateExtraction', + system: prompt.system, + prompt: prompt.user, + schema: batchedTableDescriptionSchema, + temperature: this.settings.temperature, + abortSignal, + }); + } catch (error) { + // A per-table timeout is not worth retrying (it would just time out + // again); surface it as KtxAbortedError so retryAsync stops immediately. + // A genuine context cancellation is handled by retryAsync's own signal check. + if (enrichTimeout.aborted && !input.context.signal?.aborted) { + lastTimedOut = true; + throw new KtxAbortedError(); + } + throw error; + } + }, + { + attempts: enrichAttempts, + baseDelayMs: 500, + ...(input.context.signal ? { signal: input.context.signal } : {}), + onAttemptFailure: (error, attempt) => { + this.logger?.warn( + `[enrich] llm:retry table=${input.table.name} attempt=${attempt}: ${errorMessage(error)}`, + { connectorId: input.connector.id, table: input.table.name, attempt }, + ); + }, + }, + ); + this.logger?.info(`[enrich] llm:done table=${input.table.name} ms=${Date.now() - llmStartedAt}`, { + connectorId: input.connector.id, + table: input.table.name, }); structuredGenerationSucceeded = true; tableDescription = generated.tableDescription.trim() || null; @@ -794,16 +894,25 @@ export class KtxDescriptionGenerator { }); } } catch (error) { - this.logger?.warn(`Batched table description failed for ${input.table.name}: ${errorMessage(error)}`, { - connectorId: input.connector.id, - table: input.table.name, - }); + // A genuine cancellation propagates so the stage fails and resumes; a + // per-table timeout (context.signal not aborted) still degrades to null. + if (input.context.signal?.aborted) { + throw error; + } + const elapsedMs = llmStartedAt ? Date.now() - llmStartedAt : 0; + const timedOut = lastTimedOut; + this.logger?.warn( + `[enrich] llm:${timedOut ? 'TIMEOUT' : 'fail'} table=${input.table.name} cols=${input.table.columns.length} ms=${elapsedMs}: ${errorMessage(error)}`, + { connectorId: input.connector.id, table: input.table.name, timedOut, elapsedMs }, + ); this.onWarning?.({ - code: 'enrichment_failed', - message: `Failed to generate batched description for table ${input.table.name}: ${errorMessage(error)}`, + code: timedOut ? 'enrichment_timeout' : 'enrichment_failed', + message: `${ + timedOut ? `Timed out after ${elapsedMs}ms generating` : 'Failed to generate' + } batched description for table ${input.table.name}: ${errorMessage(error)}`, table: input.table.name, recoverable: true, - metadata: { connectorId: input.connector.id }, + metadata: { connectorId: input.connector.id, ...(timedOut ? { timeoutMs: enrichTimeoutMs } : {}) }, }); } diff --git a/packages/cli/src/context/scan/enabled-tables.ts b/packages/cli/src/context/scan/enabled-tables.ts index 96c94afd..a8368a51 100644 --- a/packages/cli/src/context/scan/enabled-tables.ts +++ b/packages/cli/src/context/scan/enabled-tables.ts @@ -10,21 +10,34 @@ import type { KtxTableRef } from './types.js'; * "catalog.db.name" — fully qualified * "db.name" — schema-qualified (catalog = null) * "name" — bare (catalog = db = null; SQLite-shape) + * + * SQLite exposes a single schema named `main` but the connector emits objects + * with `db: null`, so the `"main."` form is normalized to the bare shape + * to match. Both `"main.customers"` and `"customers"` therefore select the same + * object. */ export function resolveEnabledTables( connection: Record | undefined, ): ReadonlySet | null { const raw = connection?.enabled_tables; if (!Array.isArray(raw) || raw.length === 0) return null; + const driver = typeof connection?.driver === 'string' ? connection.driver : undefined; const refs: KtxTableRef[] = []; for (const value of raw) { const parsed = parseEnabledTableEntry(value); - if (parsed) refs.push(parsed); + if (parsed) refs.push(normalizeRefForDriver(parsed, driver)); } if (refs.length === 0) return null; return tableRefSet(refs); } +function normalizeRefForDriver(ref: KtxTableRef, driver: string | undefined): KtxTableRef { + if (driver === 'sqlite' && ref.catalog === null && ref.db === 'main') { + return { catalog: null, db: null, name: ref.name }; + } + return ref; +} + function parseEnabledTableEntry(value: unknown): KtxTableRef | null { if (typeof value === 'string') { return parseDottedTableEntry(value); diff --git a/packages/cli/src/context/scan/enrichment-state.ts b/packages/cli/src/context/scan/enrichment-state.ts index 4b4913e7..40975003 100644 --- a/packages/cli/src/context/scan/enrichment-state.ts +++ b/packages/cli/src/context/scan/enrichment-state.ts @@ -1,14 +1,19 @@ import { createHash } from 'node:crypto'; +import type { KtxScanRelationshipConfig } from '../project/config.js'; import type { KtxScanEnrichmentStage, KtxScanEnrichmentStateSummary, KtxScanMode, KtxSchemaSnapshot } from './types.js'; -const KTX_SCAN_ENRICHMENT_STAGES: readonly KtxScanEnrichmentStage[] = [ +/** + * Canonical enrichment-stage registry. The `--stages` CLI parser validates + * against this list, and stage selection / iteration derives its order here. + */ +export const KTX_SCAN_ENRICHMENT_STAGES: readonly KtxScanEnrichmentStage[] = [ 'descriptions', 'embeddings', 'relationships', ] as const; export interface KtxScanEnrichmentStageLookup { - runId: string; + connectionId: string; stage: KtxScanEnrichmentStage; inputHash: string; } @@ -47,6 +52,15 @@ export interface KtxScanEnrichmentStateStore { findCompletedStage( input: KtxScanEnrichmentStageLookup, ): Promise | null>; + /** + * The most recently completed row for a (connection, stage) pair regardless of + * input hash. Used by the staleness check to compare a stage's stored hash + * against its freshly recomputed one (D4). + */ + findLatestCompletedStage(input: { + connectionId: string; + stage: KtxScanEnrichmentStage; + }): Promise; saveCompletedStage( input: Omit, 'status' | 'errorMessage'>, ): Promise; @@ -54,12 +68,35 @@ export interface KtxScanEnrichmentStateStore { listRunStages(runId: string): Promise; } -export interface ComputeKtxScanEnrichmentInputHashInput { +/** Description-LLM identity: the inputs that change a description's content. */ +export interface KtxScanLlmIdentity { + model: string | null; + baseUrlConfigured: boolean; +} + +/** Embedding-model identity: the inputs that change an embedding vector. */ +export interface KtxScanEmbeddingIdentity { + model: string | null; + dimensions: number | null; + batchSize: number | null; +} + +export interface KtxDescriptionsStageHashInput { snapshot: KtxSchemaSnapshot; - mode: KtxScanMode; - detectRelationships: boolean; - providerIdentity: Record; - relationshipSettings?: unknown; + llmIdentity: KtxScanLlmIdentity; +} + +export interface KtxEmbeddingsStageHashInput { + snapshot: KtxSchemaSnapshot; + embeddingIdentity: KtxScanEmbeddingIdentity; + /** Digest of the resolved description text the embeddings consume (see {@link computeKtxScanDescriptionDigest}). */ + descriptionDigest: string; +} + +export interface KtxRelationshipsStageHashInput { + snapshot: KtxSchemaSnapshot; + relationshipSettings: KtxScanRelationshipConfig; + llmIdentity: KtxScanLlmIdentity; } function stableJson(value: unknown): string { @@ -75,8 +112,38 @@ function stableJson(value: unknown): string { return JSON.stringify(value); } -export function computeKtxScanEnrichmentInputHash(input: ComputeKtxScanEnrichmentInputHashInput): string { - return createHash('sha256').update(stableJson(input)).digest('hex'); +function sha256(value: unknown): string { + return createHash('sha256').update(stableJson(value)).digest('hex'); +} + +export function computeKtxDescriptionsStageHash(input: KtxDescriptionsStageHashInput): string { + return sha256({ snapshot: input.snapshot, llmIdentity: input.llmIdentity }); +} + +export function computeKtxEmbeddingsStageHash(input: KtxEmbeddingsStageHashInput): string { + return sha256({ + snapshot: input.snapshot, + embeddingIdentity: input.embeddingIdentity, + descriptionDigest: input.descriptionDigest, + }); +} + +export function computeKtxRelationshipsStageHash(input: KtxRelationshipsStageHashInput): string { + return sha256({ + snapshot: input.snapshot, + relationshipSettings: input.relationshipSettings, + llmIdentity: input.llmIdentity, + }); +} + +/** + * Content digest of the resolved per-column description text the embeddings + * stage consumes. Folding it into the embeddings hash content-addresses + * embeddings on their real upstream, so re-describing busts only the embeddings + * that depend on the changed text (D4 self-healing). + */ +export function computeKtxScanDescriptionDigest(texts: readonly string[]): string { + return sha256(texts); } function uniqueStages(stages: KtxScanEnrichmentStage[]): KtxScanEnrichmentStage[] { diff --git a/packages/cli/src/context/scan/local-enrichment-artifacts.ts b/packages/cli/src/context/scan/local-enrichment-artifacts.ts index 798107b8..fa18777a 100644 --- a/packages/cli/src/context/scan/local-enrichment-artifacts.ts +++ b/packages/cli/src/context/scan/local-enrichment-artifacts.ts @@ -1,10 +1,11 @@ import YAML from 'yaml'; -import { buildLiveDatabaseManifestShards, type LiveDatabaseManifestExistingDescriptions, type LiveDatabaseManifestJoinData, type LiveDatabaseManifestJoinEntry, type LiveDatabaseManifestShard, type LiveDatabaseManifestTableData } from '../../context/ingest/adapters/live-database/manifest.js'; +import { buildLiveDatabaseManifestShards, buildTableRef, type LiveDatabaseManifestExistingDescriptions, type LiveDatabaseManifestJoinData, type LiveDatabaseManifestJoinEntry, type LiveDatabaseManifestShard, type LiveDatabaseManifestTableData } from '../../context/ingest/adapters/live-database/manifest.js'; import type { TableUsageOutput } from '../../context/ingest/adapters/historic-sql/skill-schemas.js'; import type { KtxScanRelationshipConfig } from '../project/config.js'; import type { KtxLocalProject } from '../../context/project/project.js'; import { isSlYamlPath } from '../../context/sl/source-files.js'; import { deriveFederatedConnection } from '../connections/federation.js'; +import { tableRefKey } from './table-ref.js'; import type { KtxLocalScanEnrichmentResult } from './local-enrichment.js'; import { buildKtxRelationshipArtifacts, @@ -28,6 +29,12 @@ export interface WriteLocalScanManifestShardsInput { dryRun: boolean; descriptionUpdates?: KtxLocalScanEnrichmentResult['descriptionUpdates']; relationshipUpdate?: KtxLocalScanEnrichmentResult['relationshipUpdate']; + /** + * When set, write only the shards that contain one of these tables. All shards + * are still built (so merging preserves prior content); the unlisted shards are + * left untouched on disk. Used by the incremental flush to bound git commits. + */ + onlyChangedTableNames?: ReadonlySet; } export interface WriteLocalScanManifestShardsResult { @@ -75,9 +82,8 @@ function schemaDir(connectionId: string): string { function tableDescription( table: KtxSchemaTable, - descriptionUpdates: LocalDescriptionUpdates = [], + update: LocalDescriptionUpdates[number] | undefined, ): Record | undefined { - const update = descriptionUpdates.find((candidate) => candidate.table.name === table.name); const descriptions: Record = {}; if (table.comment) { descriptions.db = table.comment; @@ -89,11 +95,9 @@ function tableDescription( } function columnDescription( - table: KtxSchemaTable, column: KtxSchemaColumn, - descriptionUpdates: LocalDescriptionUpdates = [], + update: LocalDescriptionUpdates[number] | undefined, ): Record | undefined { - const update = descriptionUpdates.find((candidate) => candidate.table.name === table.name); const aiDescription = update?.columnDescriptions[column.name] ?? null; const descriptions: Record = {}; if (column.comment) { @@ -109,19 +113,25 @@ function snapshotTablesToManifestData( snapshot: KtxSchemaSnapshot, descriptionUpdates: LocalDescriptionUpdates = [], ): LiveDatabaseManifestTableData[] { - return snapshot.tables.map((table) => ({ - name: table.name, - catalog: table.catalog, - db: table.db, - descriptions: tableDescription(table, descriptionUpdates), - columns: table.columns.map((column) => ({ - name: column.name, - type: column.dimensionType, - ...(column.primaryKey ? { pk: true } : {}), - ...(column.nullable === false ? { nullable: false } : {}), - descriptions: columnDescription(table, column, descriptionUpdates), - })), - })); + // Resolve a table's descriptions by full identity: two same-named tables in + // different schemas must not collapse onto one update. + const updateByRef = new Map(descriptionUpdates.map((update) => [tableRefKey(update.table), update])); + return snapshot.tables.map((table) => { + const update = updateByRef.get(tableRefKey({ catalog: table.catalog, db: table.db, name: table.name })); + return { + name: table.name, + catalog: table.catalog, + db: table.db, + descriptions: tableDescription(table, update), + columns: table.columns.map((column) => ({ + name: column.name, + type: column.dimensionType, + ...(column.primaryKey ? { pk: true } : {}), + ...(column.nullable === false ? { nullable: false } : {}), + descriptions: columnDescription(column, update), + })), + }; + }); } function formalJoins(snapshot: KtxSchemaSnapshot): LiveDatabaseManifestJoinData[] { @@ -256,7 +266,10 @@ async function loadExistingManifestState( if (!validTableNames.has(tableName)) { continue; } - descriptions.set(tableName, { + // Descriptions/usage key on the fully-qualified `entry.table` ref so two + // same-named tables across schemas stay distinct; joins remain keyed by + // bare name to match the bare-name join graph. + descriptions.set(entry.table, { table: entry.descriptions ? { ...entry.descriptions } : undefined, columns: new Map( (entry.columns ?? []).flatMap((column) => @@ -265,7 +278,7 @@ async function loadExistingManifestState( ), }); if (entry.usage) { - usage.set(tableName, { ...entry.usage }); + usage.set(entry.table, { ...entry.usage }); } const joins = (entry.joins ?? []).filter((join) => { return ( @@ -286,6 +299,108 @@ async function loadExistingManifestState( return { descriptions, preservedJoins, usage }; } +/** + * Reconstructs the descriptions already persisted in the on-disk `_schema` as + * the in-memory `descriptionUpdates` shape, so a stage-selective run that skips + * the descriptions stage (e.g. `--stages relationships`/`--stages embeddings`) + * can still feed embeddings + relationships the prior AI descriptions. Tables or + * columns with no AI description carry `null`. + */ +export async function loadOnDiskDescriptionUpdates( + project: KtxLocalProject, + connectionId: string, + snapshot: KtxSchemaSnapshot, +): Promise { + const siblingTargets = await federatedSiblingTargets(project, connectionId); + const existing = await loadExistingManifestState(project, connectionId, snapshot, siblingTargets); + return snapshot.tables.map((table) => { + const entry = existing.descriptions.get(buildTableRef(table.name, table.catalog, table.db)); + const columnDescriptions: Record = {}; + for (const column of table.columns) { + columnDescriptions[column.name] = entry?.columns.get(column.name)?.ai ?? null; + } + return { + table: { catalog: table.catalog, db: table.db, name: table.name }, + tableDescription: entry?.table?.ai ?? null, + columnDescriptions, + }; + }); +} + +// The incremental descriptions resume record. It lives at a stable, NON-syncId +// path: a from-scratch interruption gets a fresh syncId on the next run, so a +// syncId-scoped record would be unreachable on resume. The manifest already lives +// at the same stable per-connection scope. +function descriptionsProgressPath(connectionId: string): string { + return `raw-sources/${connectionId}/${LIVE_DATABASE_ADAPTER}/enrichment-progress/descriptions.json`; +} + +interface DescriptionsProgressRecord { + inputHash: string; + descriptions: LocalDescriptionUpdates; +} + +export interface KtxScanDescriptionResumeStore { + /** Prior enriched descriptions when the durable record matches `inputHash`, else null. */ + load(inputHash: string): Promise; + /** Persist the descriptions so far + the manifest shards that gained a table this batch. */ + flush(input: { + inputHash: string; + snapshot: KtxSchemaSnapshot; + descriptionUpdates: LocalDescriptionUpdates; + changedTableNames: ReadonlySet; + }): Promise; +} + +export function createKtxScanDescriptionResumeStore(deps: { + project: KtxLocalProject; + connectionId: string; + syncId: string; + driver: KtxConnectionDriver; +}): KtxScanDescriptionResumeStore { + const path = descriptionsProgressPath(deps.connectionId); + return { + async load(inputHash) { + let content: string; + try { + ({ content } = await deps.project.fileStore.readFile(path)); + } catch { + return null; + } + try { + const record = JSON.parse(content) as DescriptionsProgressRecord | null; + // A changed inputHash (schema or enrichment settings changed) ignores the + // prior record and recomputes — spec-19's inputHash-gated resume semantics. + if (!record || record.inputHash !== inputHash || !Array.isArray(record.descriptions)) { + return null; + } + return record.descriptions; + } catch { + return null; + } + }, + async flush({ inputHash, snapshot, descriptionUpdates, changedTableNames }) { + const record: DescriptionsProgressRecord = { inputHash, descriptions: descriptionUpdates }; + await writeJsonArtifact( + deps.project, + path, + record, + `scan(${LIVE_DATABASE_ADAPTER}): flush enrichment descriptions progress syncId=${deps.syncId}`, + ); + await writeLocalScanManifestShards({ + project: deps.project, + connectionId: deps.connectionId, + syncId: deps.syncId, + driver: deps.driver, + snapshot, + descriptionUpdates, + dryRun: false, + onlyChangedTableNames: changedTableNames, + }); + }, + }; +} + async function writeJsonArtifact( project: KtxLocalProject, path: string, @@ -331,6 +446,9 @@ export async function writeLocalScanManifestShards( const manifestShards: string[] = []; for (const [shardKey, shard] of [...shards.entries()].sort(([left], [right]) => left.localeCompare(right))) { + if (input.onlyChangedTableNames && !Object.keys(shard.tables).some((table) => input.onlyChangedTableNames!.has(table))) { + continue; + } const path = `${schemaDir(input.connectionId)}/${shardKey}.yaml`; await input.project.fileStore.writeFile( path, @@ -348,23 +466,14 @@ export async function writeLocalScanManifestShards( }; } -export async function writeLocalScanEnrichmentArtifacts( - input: WriteLocalScanEnrichmentArtifactsInput, -): Promise { - if (input.dryRun) { - return { - enrichmentArtifacts: [], - manifestShards: [], - manifestShardsWritten: 0, - }; - } - - const enrichmentRoot = artifactDir(input.connectionId, input.syncId); - const descriptionsArtifact = `${enrichmentRoot}/descriptions.json`; - const embeddingsArtifact = `${enrichmentRoot}/embeddings.json`; - const relationshipsArtifact = `${enrichmentRoot}/relationships.json`; - const relationshipProfileArtifact = `${enrichmentRoot}/relationship-profile.json`; - const relationshipDiagnosticsArtifact = `${enrichmentRoot}/relationship-diagnostics.json`; +async function writeEnrichmentDescriptionArtifacts(input: { + project: KtxLocalProject; + enrichmentRoot: string; + syncId: string; + enrichment: KtxLocalScanEnrichmentResult; +}): Promise { + const descriptionsArtifact = `${input.enrichmentRoot}/descriptions.json`; + const embeddingsArtifact = `${input.enrichmentRoot}/embeddings.json`; const enrichmentArtifacts: string[] = []; if ( @@ -388,6 +497,67 @@ export async function writeLocalScanEnrichmentArtifacts( `scan(${LIVE_DATABASE_ADAPTER}): write enrichment embeddings syncId=${input.syncId}`, ); } + return enrichmentArtifacts; +} + +/** + * Promote the descriptions + embeddings into the queryable `_schema` manifest + * (and the raw enrichment artifacts) before relationship detection runs. The + * generated joins and the relationship diagnostics are deliberately left to the + * final write, so an interrupted relationship stage never loses the paid LLM + * enrichment and never emits empty relationship diagnostics. + */ +export async function writeLocalScanEnrichmentCheckpoint( + input: WriteLocalScanEnrichmentArtifactsInput, +): Promise { + if (input.dryRun) { + return { enrichmentArtifacts: [], manifestShards: [], manifestShardsWritten: 0 }; + } + + const enrichmentArtifacts = await writeEnrichmentDescriptionArtifacts({ + project: input.project, + enrichmentRoot: artifactDir(input.connectionId, input.syncId), + syncId: input.syncId, + enrichment: input.enrichment, + }); + const manifestResult = await writeLocalScanManifestShards({ + project: input.project, + connectionId: input.connectionId, + syncId: input.syncId, + driver: input.driver, + snapshot: input.enrichment.snapshot, + descriptionUpdates: input.enrichment.descriptionUpdates, + dryRun: false, + }); + + return { + enrichmentArtifacts, + manifestShards: manifestResult.manifestShards, + manifestShardsWritten: manifestResult.manifestShardsWritten, + }; +} + +export async function writeLocalScanEnrichmentArtifacts( + input: WriteLocalScanEnrichmentArtifactsInput, +): Promise { + if (input.dryRun) { + return { + enrichmentArtifacts: [], + manifestShards: [], + manifestShardsWritten: 0, + }; + } + + const enrichmentRoot = artifactDir(input.connectionId, input.syncId); + const relationshipsArtifact = `${enrichmentRoot}/relationships.json`; + const relationshipProfileArtifact = `${enrichmentRoot}/relationship-profile.json`; + const relationshipDiagnosticsArtifact = `${enrichmentRoot}/relationship-diagnostics.json`; + const enrichmentArtifacts = await writeEnrichmentDescriptionArtifacts({ + project: input.project, + enrichmentRoot, + syncId: input.syncId, + enrichment: input.enrichment, + }); enrichmentArtifacts.push(relationshipsArtifact, relationshipProfileArtifact, relationshipDiagnosticsArtifact); const hasResolvedRelationships = input.enrichment.resolvedRelationships !== null; const relationshipArtifacts = buildKtxRelationshipArtifacts({ @@ -413,6 +583,7 @@ export async function writeLocalScanEnrichmentArtifacts( artifacts: relationshipArtifacts, profile: relationshipProfile, warnings: input.enrichment.warnings, + partial: input.enrichment.relationshipPartial, thresholds: input.relationshipSettings ? { acceptThreshold: input.relationshipSettings.acceptThreshold, diff --git a/packages/cli/src/context/scan/local-enrichment.ts b/packages/cli/src/context/scan/local-enrichment.ts index 6addba4a..f391a6c2 100644 --- a/packages/cli/src/context/scan/local-enrichment.ts +++ b/packages/cli/src/context/scan/local-enrichment.ts @@ -6,11 +6,19 @@ import { KtxDescriptionGenerator } from './description-generation.js'; import { buildKtxColumnEmbeddingText } from './embedding-text.js'; import { completedKtxScanEnrichmentStateSummary, - computeKtxScanEnrichmentInputHash, + computeKtxDescriptionsStageHash, + computeKtxEmbeddingsStageHash, + computeKtxRelationshipsStageHash, + computeKtxScanDescriptionDigest, + KTX_SCAN_ENRICHMENT_STAGES, + type KtxScanEmbeddingIdentity, type KtxScanEnrichmentStateStore, + type KtxScanLlmIdentity, summarizeKtxScanEnrichmentState, } from './enrichment-state.js'; import { skippedKtxScanEnrichmentSummary } from './enrichment-summary.js'; +import type { KtxScanDescriptionResumeStore } from './local-enrichment-artifacts.js'; +import { tableRefKey } from './table-ref.js'; import type { KtxEmbeddingUpdate, KtxEnrichedColumn, @@ -21,6 +29,7 @@ import type { KtxRelationshipUpdate, } from './enrichment-types.js'; import type { KtxCompositeRelationshipCandidate } from './relationship-composite-candidates.js'; +import type { KtxRelationshipDetectionStopReason } from './relationship-detection-budget.js'; import type { KtxResolvedRelationshipDiscoveryCandidate } from './relationship-graph-resolver.js'; import { discoverKtxRelationships } from './relationship-discovery.js'; import type { KtxRelationshipProfileArtifact } from './relationship-profiling.js'; @@ -42,7 +51,13 @@ import type { KtxTableRef, } from './types.js'; -const DESCRIPTION_TABLE_CONCURRENCY = 4; +// Parallel per-table description generations. Default 4; raise via +// KTX_ENRICH_TABLE_CONCURRENCY for large schemas (the rate-limit governor still +// throttles if the provider pushes back, so a higher cap is safe headroom). +const DESCRIPTION_TABLE_CONCURRENCY = (() => { + const raw = Number(process.env.KTX_ENRICH_TABLE_CONCURRENCY); + return Number.isInteger(raw) && raw >= 1 && raw <= 64 ? raw : 4; +})(); export interface KtxLocalScanEnrichmentProviders { llmRuntime: KtxLlmRuntimePort; @@ -53,15 +68,45 @@ export interface KtxLocalScanEnrichmentInput { connectionId: string; mode: KtxScanMode; detectRelationships?: boolean; + /** + * Enrichment stages to (re)run this invocation. Undefined runs every eligible + * stage and respects the completed-stage short-circuit (spec-19 resume). When + * present, only the named stages run — each force-recomputes (bypassing the + * short-circuit) while unselected stages are left untouched on disk (D3). + */ + stages?: KtxScanEnrichmentStage[]; connector: KtxScanConnector; snapshot?: KtxSchemaSnapshot; context: KtxScanContext; providers: KtxLocalScanEnrichmentProviders | null; stateStore?: KtxScanEnrichmentStateStore | null; + /** + * Durable per-batch resume record for the descriptions stage. When present, an + * interrupted descriptions stage resumes by re-enriching only the tables not + * already flushed (inputHash-gated). Null/undefined disables incremental flush. + */ + descriptionResumeStore?: KtxScanDescriptionResumeStore | null; + /** + * Lazily loads the descriptions already persisted in the on-disk _schema, used + * to feed embeddings + relationships their description context when the + * descriptions stage does not run this invocation (e.g. `--stages relationships`). + * Called at most once and only when a downstream stage needs it, so a normal + * full run never pays the read. + */ + loadPriorDescriptions?: (snapshot: KtxSchemaSnapshot) => Promise; syncId?: string; - providerIdentity?: Record; + /** Description-LLM identity that keys the descriptions + relationships stage hashes. */ + llmIdentity?: KtxScanLlmIdentity; + /** Embedding-model identity that keys the embeddings stage hash. */ + embeddingIdentity?: KtxScanEmbeddingIdentity; relationshipSettings?: KtxScanRelationshipConfig; now?: () => Date; + /** + * Invoked once the last non-relationship stage completes and before + * relationship detection runs, so the descriptions + embeddings reach the + * queryable layer even if the relationship stage is later interrupted. + */ + onCheckpoint?: (checkpoint: KtxLocalScanEnrichmentResult) => Promise; } export interface KtxLocalScanEnrichmentResult { @@ -80,6 +125,7 @@ export interface KtxLocalScanEnrichmentResult { relationshipProfile: KtxRelationshipProfileArtifact | null; resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null; compositeRelationships: KtxCompositeRelationshipCandidate[] | null; + relationshipPartial: { reason: KtxRelationshipDetectionStopReason } | null; } function tableId(table: KtxSchemaTable): string { @@ -182,6 +228,17 @@ function providerlessEnrichedWarning(relationshipDetection: boolean): KtxScanWar }; } +function stagePrerequisiteReason(stage: KtxScanEnrichmentStage): string { + switch (stage) { + case 'descriptions': + return 'LLM enrichment is not configured (set scan.enrichment.mode and an LLM provider)'; + case 'embeddings': + return 'no embedding provider is configured (set scan.enrichment.embeddings)'; + case 'relationships': + return 'relationship discovery is disabled (scan.relationships.enabled is false)'; + } +} + export function createDeterministicLocalScanEnrichmentProviders(): KtxLocalScanEnrichmentProviders { return { llmRuntime: deterministicLlmRuntime(), @@ -209,18 +266,25 @@ function deterministicLlmRuntime(): KtxLlmRuntimePort { async runAgentLoop() { return { stopReason: 'natural' }; }, + subprocessForkSpec() { + return null; + }, }; } export function snapshotToKtxEnrichedSchema( snapshot: KtxSchemaSnapshot, embeddingsByColumnId: ReadonlyMap = new Map(), + descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [], ): KtxEnrichedSchema { + const descriptionByTable = new Map(descriptions.map((item) => [tableRefKey(item.table), item])); const tables: KtxEnrichedTable[] = snapshot.tables.map((table) => { const id = tableId(table); const ref = tableRef(table); + const tableDescription = descriptionByTable.get(tableRefKey(ref)); const columns: KtxEnrichedColumn[] = table.columns.map((column) => { const idForColumn = columnId(table, column); + const aiColumnDescription = tableDescription?.columnDescriptions[column.name] ?? null; return { id: idForColumn, tableId: id, @@ -234,6 +298,7 @@ export function snapshotToKtxEnrichedSchema( parentColumnId: null, descriptions: { ...(column.comment ? { db: column.comment } : {}), + ...(aiColumnDescription ? { ai: aiColumnDescription } : {}), }, embedding: embeddingsByColumnId.get(idForColumn) ?? null, sampleValues: null, @@ -246,6 +311,7 @@ export function snapshotToKtxEnrichedSchema( enabled: true, descriptions: { ...(table.comment ? { db: table.comment } : {}), + ...(tableDescription?.tableDescription ? { ai: tableDescription.tableDescription } : {}), }, columns, }; @@ -262,11 +328,31 @@ function embeddingBatchSize(maxBatchSize: number): number { return Number.isInteger(maxBatchSize) && maxBatchSize > 0 ? maxBatchSize : 100; } +type KtxScanDescriptionUpdate = KtxLocalScanEnrichmentResult['descriptionUpdates'][number]; + +// Per-batch flush cadence: bounds the at-risk window (and the manifest-rewrite / +// git-commit cost) to a small number of tables. +const DESCRIPTION_FLUSH_EVERY = 10; + +function isEnrichedDescriptionUpdate(update: KtxScanDescriptionUpdate): boolean { + return update.tableDescription !== null || Object.values(update.columnDescriptions).some((value) => value !== null); +} + +function nullDescriptionUpdate(table: KtxSchemaTable): KtxScanDescriptionUpdate { + return { + table: tableRef(table), + tableDescription: null, + columnDescriptions: Object.fromEntries(table.columns.map((column) => [column.name, null])), + }; +} + async function generateDescriptions(input: { snapshot: KtxSchemaSnapshot; connector: KtxScanConnector; context: KtxScanContext; providers: KtxLocalScanEnrichmentProviders; + inputHash: string; + resumeStore?: KtxScanDescriptionResumeStore | null; progress?: KtxProgressPort; warnings?: KtxScanWarning[]; }): Promise { @@ -289,67 +375,139 @@ async function generateDescriptions(input: { }, }); - const updates: KtxLocalScanEnrichmentResult['descriptionUpdates'] = []; const totalTables = input.snapshot.tables.length; if (totalTables === 0) { await input.progress?.update(1, 'No tables to describe'); - return updates; + return []; } + + // Resume: recover already-enriched tables (inputHash-gated) and re-issue LLM + // calls only for the remainder. A failed/skipped table carries null descriptions + // and is not recovered, so it is retried. + const recovered = input.resumeStore ? ((await input.resumeStore.load(input.inputHash)) ?? []) : []; + const enrichedById = new Map(); + for (const update of recovered) { + if (isEnrichedDescriptionUpdate(update)) { + enrichedById.set(tableRefKey(update.table), update); + } + } + const remaining = input.snapshot.tables.filter((table) => !enrichedById.has(tableRefKey(tableRef(table)))); + const recoveredCount = enrichedById.size; + if (recoveredCount > 0) { + input.context.logger?.info( + `[enrich] resume: recovered ${recoveredCount}/${totalTables} descriptions, enriching ${remaining.length}`, + ); + } + + const pendingChanged = new Set(); + let sinceFlush = 0; + let flushing = false; + const flush = async (force: boolean): Promise => { + if (!input.resumeStore || flushing || pendingChanged.size === 0) { + return; + } + if (!force && sinceFlush < DESCRIPTION_FLUSH_EVERY) { + return; + } + flushing = true; + const changedTableNames = new Set(pendingChanged); + pendingChanged.clear(); + sinceFlush = 0; + try { + await input.resumeStore.flush({ + inputHash: input.inputHash, + snapshot: input.snapshot, + descriptionUpdates: [...enrichedById.values()], + changedTableNames, + }); + } finally { + flushing = false; + } + }; + const limitTable = pLimit(DESCRIPTION_TABLE_CONCURRENCY); - const tableUpdates = await Promise.all( - input.snapshot.tables.map((table, index) => + await Promise.all( + remaining.map((table, index) => limitTable(async () => { await input.progress?.update( - (index + 1) / totalTables, - `Generating descriptions ${index + 1}/${totalTables} tables`, + (recoveredCount + index + 1) / totalTables, + `Generating descriptions ${recoveredCount + index + 1}/${totalTables} (${table.name}, ${table.columns.length} cols)`, { transient: true, }, ); - const batched = await generator.generateBatchedTableDescriptions({ - connectionId: input.snapshot.connectionId, - connector: input.connector, - context: input.context, - dataSourceType: input.snapshot.driver, - supportsNestedAnalysis: input.connector.capabilities.nestedAnalysis, - table: { - catalog: table.catalog, - db: table.db, - name: table.name, - rawDescriptions: table.comment ? { db: table.comment } : {}, - columns: table.columns.map((column) => ({ - name: column.name, - type: column.nativeType, - ...(column.comment ? { rawDescriptions: { db: column.comment } } : {}), - })), - }, - }); - return { - table: tableRef(table), - tableDescription: batched.tableDescription, - columnDescriptions: Object.fromEntries(batched.columnDescriptions), - }; + // Stage-level guarantee: a single table's failure costs one missing + // description, never the whole stage's output. (generateBatchedTableDescriptions + // already degrades its own failures to null descriptions; this backstop keeps + // the guarantee at the fan-out even if a future path throws.) A genuine + // cancellation still propagates so the stage fails and resumes. + let update: KtxScanDescriptionUpdate; + try { + const batched = await generator.generateBatchedTableDescriptions({ + connectionId: input.snapshot.connectionId, + connector: input.connector, + context: input.context, + dataSourceType: input.snapshot.driver, + supportsNestedAnalysis: input.connector.capabilities.nestedAnalysis, + table: { + catalog: table.catalog, + db: table.db, + name: table.name, + rawDescriptions: table.comment ? { db: table.comment } : {}, + columns: table.columns.map((column) => ({ + name: column.name, + type: column.nativeType, + ...(column.comment ? { rawDescriptions: { db: column.comment } } : {}), + })), + }, + }); + update = { + table: tableRef(table), + tableDescription: batched.tableDescription, + columnDescriptions: Object.fromEntries(batched.columnDescriptions), + }; + } catch (error) { + if (input.context.signal?.aborted) { + throw error; + } + const message = error instanceof Error ? error.message : String(error); + input.context.logger?.warn(`[enrich] table ${table.name} failed: ${message}`); + warningSink?.push({ + code: 'enrichment_failed', + message: `Failed to generate description for ${table.name}: ${message}`, + table: table.name, + recoverable: true, + metadata: {}, + }); + update = nullDescriptionUpdate(table); + } + if (isEnrichedDescriptionUpdate(update)) { + enrichedById.set(tableRefKey(tableRef(table)), update); + pendingChanged.add(table.name); + sinceFlush += 1; + await flush(false); + } }), ), ); - updates.push(...tableUpdates); + await flush(true); await input.progress?.update(1, `Generated descriptions for ${totalTables} tables`); - return updates; + // Full set in snapshot order: recovered + freshly enriched, null for any still-failed. + return input.snapshot.tables.map((table) => enrichedById.get(tableRefKey(tableRef(table))) ?? nullDescriptionUpdate(table)); } -async function buildEmbeddings(input: { - snapshot: KtxSchemaSnapshot; - embedding: KtxEmbeddingPort; - descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates']; - progress?: KtxProgressPort; -}): Promise<{ updates: KtxEmbeddingUpdate[]; byColumnId: Map }> { - const descriptionByTable = new Map(input.descriptions.map((item) => [item.table.name, item])); +// The exact per-column text fed to the embedding model. Shared by the embeddings +// stage and the descriptionDigest so the embeddings hash content-addresses the +// real text the model sees (D4). +function buildKtxColumnEmbeddingTexts( + snapshot: KtxSchemaSnapshot, + descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'], +): Array<{ columnId: string; text: string }> { + const descriptionByTable = new Map(descriptions.map((item) => [tableRefKey(item.table), item])); const texts: Array<{ columnId: string; text: string }> = []; - - for (const table of input.snapshot.tables) { - const tableDescriptions = descriptionByTable.get(table.name); + for (const table of snapshot.tables) { + const tableDescriptions = descriptionByTable.get(tableRefKey(tableRef(table))); for (const column of table.columns) { - const id = columnId(table, column); const text = buildKtxColumnEmbeddingText({ tableName: table.name, columnName: column.name, @@ -364,9 +522,18 @@ async function buildEmbeddings(input: { incoming: [], }, }); - texts.push({ columnId: id, text }); + texts.push({ columnId: columnId(table, column), text }); } } + return texts; +} + +async function buildEmbeddings(input: { + embedding: KtxEmbeddingPort; + texts: Array<{ columnId: string; text: string }>; + progress?: KtxProgressPort; +}): Promise<{ updates: KtxEmbeddingUpdate[]; byColumnId: Map }> { + const texts = input.texts; const embeddings: number[][] = []; const maxBatchSize = embeddingBatchSize(input.embedding.maxBatchSize); @@ -416,17 +583,26 @@ async function runEnrichmentStage(input: { resumedStages: KtxScanEnrichmentStage[]; completedStages: KtxScanEnrichmentStage[]; failedStages: KtxScanEnrichmentStage[]; + /** + * When true the stage re-enters compute() even if a completed row matches, + * skipping the spec-19 short-circuit. The intent of naming a stage in + * `--stages` is "recompute this" (D3); the inner compute() still honors the + * spec-20 per-table resume record. + */ + forceRecompute?: boolean; compute: () => Promise; }): Promise { - const existing = await input.stateStore?.findCompletedStage({ - runId: input.runId, - stage: input.stage, - inputHash: input.inputHash, - }); - if (existing) { - input.resumedStages.push(input.stage); - input.completedStages.push(input.stage); - return existing.output; + if (!input.forceRecompute) { + const existing = await input.stateStore?.findCompletedStage({ + connectionId: input.connectionId, + stage: input.stage, + inputHash: input.inputHash, + }); + if (existing) { + input.resumedStages.push(input.stage); + input.completedStages.push(input.stage); + return existing.output; + } } try { @@ -493,17 +669,39 @@ export async function runLocalScanEnrichment( const state = completedKtxScanEnrichmentStateSummary(); const syncId = input.syncId ?? input.context.runId; const relationshipSettings = input.relationshipSettings ?? buildDefaultKtxProjectConfig().scan.relationships; - const inputHash = computeKtxScanEnrichmentInputHash({ - snapshot, - mode: input.mode, - detectRelationships: input.detectRelationships ?? false, - providerIdentity: input.providerIdentity ?? {}, - relationshipSettings, - }); + const llmIdentity: KtxScanLlmIdentity = input.llmIdentity ?? { model: null, baseUrlConfigured: false }; + const embeddingIdentity: KtxScanEmbeddingIdentity = input.embeddingIdentity ?? { + model: null, + dimensions: null, + batchSize: null, + }; + const descriptionsHash = computeKtxDescriptionsStageHash({ snapshot, llmIdentity }); + const relationshipsHash = computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity }); const warnings: KtxScanWarning[] = []; + const selectedStages = input.stages; + const runsStage = (stage: KtxScanEnrichmentStage): boolean => + selectedStages === undefined || selectedStages.includes(stage); + const forcesStage = (stage: KtxScanEnrichmentStage): boolean => + selectedStages !== undefined && selectedStages.includes(stage); + let descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = []; + let descriptionsRanThisInvocation = false; + let priorDescriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] | null | undefined; + // Best-available descriptions for the downstream stages (embeddings, + // relationships): fresh ones when descriptions ran this invocation, else the + // descriptions persisted in the on-disk _schema. Behavior follows the input + // (did descriptions run?), not which stage subset the caller selected (D5). + const resolveDownstreamDescriptions = async (): Promise => { + if (descriptionsRanThisInvocation) { + return descriptions; + } + if (priorDescriptions === undefined) { + priorDescriptions = input.loadPriorDescriptions ? await input.loadPriorDescriptions(snapshot) : null; + } + return priorDescriptions ?? []; + }; + let embeddingUpdates: KtxEmbeddingUpdate[] = []; - let schema = snapshotToKtxEnrichedSchema(snapshot); const summary: KtxScanEnrichmentSummary = { ...skippedKtxScanEnrichmentSummary }; const relationshipDetectionEnabled = relationshipSettings.enabled; const shouldDetectRelationships = @@ -514,38 +712,70 @@ export async function runLocalScanEnrichment( warnings.push(providerlessEnrichedWarning(shouldDetectRelationships)); } + // A stage explicitly named in --stages whose prerequisite is missing must be + // surfaced, never silently no-op (D2). + if (selectedStages !== undefined) { + const stageEligible: Record = { + descriptions: input.mode === 'enriched' && input.providers != null, + embeddings: input.mode === 'enriched' && input.providers?.embedding != null, + relationships: shouldDetectRelationships, + }; + for (const stage of selectedStages) { + if (!stageEligible[stage]) { + warnings.push({ + code: 'enrichment_stage_skipped', + message: `Requested --stages ${stage}, but it cannot run: ${stagePrerequisiteReason(stage)}.`, + recoverable: true, + metadata: { stage }, + }); + } + } + } + if (input.mode === 'enriched' && input.providers) { const providers = input.providers; - const descriptionProgress = progress?.startPhase(0.45); - descriptions = await runEnrichmentStage({ - stateStore: input.stateStore, - runId: input.context.runId, - connectionId: input.connectionId, - syncId, - mode: input.mode, - stage: 'descriptions', - inputHash, - now, - resumedStages: state.resumedStages, - completedStages: state.completedStages, - failedStages: state.failedStages, - compute: () => - generateDescriptions({ - snapshot, - connector: input.connector, - context: input.context, - providers, - progress: descriptionProgress, - warnings, - }), - }); - summary.dataDictionary = input.connector.sampleColumn ? 'completed' : 'skipped'; - summary.tableDescriptions = 'completed'; - summary.columnDescriptions = 'completed'; + if (runsStage('descriptions')) { + const descriptionProgress = progress?.startPhase(0.45); + descriptions = await runEnrichmentStage({ + stateStore: input.stateStore, + runId: input.context.runId, + connectionId: input.connectionId, + syncId, + mode: input.mode, + stage: 'descriptions', + inputHash: descriptionsHash, + now, + forceRecompute: forcesStage('descriptions'), + resumedStages: state.resumedStages, + completedStages: state.completedStages, + failedStages: state.failedStages, + compute: () => + generateDescriptions({ + snapshot, + connector: input.connector, + context: input.context, + providers, + inputHash: descriptionsHash, + resumeStore: input.descriptionResumeStore, + progress: descriptionProgress, + warnings, + }), + }); + descriptionsRanThisInvocation = true; + summary.dataDictionary = input.connector.sampleColumn ? 'completed' : 'skipped'; + summary.tableDescriptions = 'completed'; + summary.columnDescriptions = 'completed'; + } - const embeddingProgress = progress?.startPhase(0.2); const embedding = providers.embedding; - if (embedding) { + if (embedding && runsStage('embeddings')) { + const embeddingProgress = progress?.startPhase(0.2); + const embeddingTexts = buildKtxColumnEmbeddingTexts(snapshot, await resolveDownstreamDescriptions()); + const embeddingsHash = computeKtxEmbeddingsStageHash({ + snapshot, + embeddingIdentity, + descriptionDigest: computeKtxScanDescriptionDigest(embeddingTexts.map((item) => item.text)), + }); embeddingUpdates = await runEnrichmentStage({ stateStore: input.stateStore, runId: input.context.runId, @@ -553,22 +783,21 @@ export async function runLocalScanEnrichment( syncId, mode: input.mode, stage: 'embeddings', - inputHash, + inputHash: embeddingsHash, now, + forceRecompute: forcesStage('embeddings'), resumedStages: state.resumedStages, completedStages: state.completedStages, failedStages: state.failedStages, compute: async () => { const embeddings = await buildEmbeddings({ - snapshot, embedding, - descriptions, + texts: embeddingTexts, progress: embeddingProgress, }); return embeddings.updates; }, }); - schema = snapshotToKtxEnrichedSchema(snapshot, embeddingsByColumnId(embeddingUpdates)); summary.embeddings = 'completed'; } } @@ -577,9 +806,40 @@ export async function runLocalScanEnrichment( let relationshipProfile: KtxRelationshipProfileArtifact | null = null; let resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null = null; let compositeRelationships: KtxCompositeRelationshipCandidate[] | null = null; + let relationshipPartial: { reason: KtxRelationshipDetectionStopReason } | null = null; let relationships: KtxScanRelationshipSummary = { accepted: 0, review: 0, rejected: 0, skipped: 0 }; - if (shouldDetectRelationships) { + + // Promote the paid descriptions + embeddings to the queryable layer at the + // cost boundary, before the slow, kill-prone relationship stage — so an + // interrupted relationship stage degrades to "no joins," never "no descriptions." + if (shouldDetectRelationships && summary.tableDescriptions === 'completed' && input.onCheckpoint) { + await input.onCheckpoint({ + snapshot, + summary: { ...summary }, + relationships, + state: summarizeKtxScanEnrichmentState(state), + warnings: [...warnings], + descriptionUpdates: descriptions, + embeddingUpdates, + relationshipUpdate: null, + relationshipProfile: null, + resolvedRelationships: null, + compositeRelationships: null, + relationshipPartial: null, + }); + } + + if (shouldDetectRelationships && runsStage('relationships')) { const relationshipProgress = progress?.startPhase(0.25); + // Relationship detection (incl. llmProposals) runs against the + // best-available descriptions + this run's embeddings, so the join-proposal + // prompt carries descriptions on both the full-run and relationships-only + // paths (D5). Embeddings are this run's only — they are not re-hydrated. + const relationshipSchema = snapshotToKtxEnrichedSchema( + snapshot, + embeddingsByColumnId(embeddingUpdates), + await resolveDownstreamDescriptions(), + ); const relationshipStage = await runEnrichmentStage({ stateStore: input.stateStore, runId: input.context.runId, @@ -587,8 +847,9 @@ export async function runLocalScanEnrichment( syncId, mode: input.mode, stage: 'relationships', - inputHash, + inputHash: relationshipsHash, now, + forceRecompute: forcesStage('relationships'), resumedStages: state.resumedStages, completedStages: state.completedStages, failedStages: state.failedStages, @@ -598,10 +859,12 @@ export async function runLocalScanEnrichment( connectionId: input.connectionId, dialect, connector: input.connector, - schema, + schema: relationshipSchema, context: input.context, settings: relationshipSettings, llmRuntime: input.providers?.llmRuntime ?? null, + ...(relationshipProgress ? { progress: relationshipProgress } : {}), + ...(input.now ? { now: () => input.now!().getTime() } : {}), }); await relationshipProgress?.update( @@ -617,6 +880,7 @@ export async function runLocalScanEnrichment( statisticalValidation: detection.statisticalValidation, llmRelationshipValidation: detection.llmRelationshipValidation, warnings: detection.warnings, + partial: detection.partial, }; }, }); @@ -629,21 +893,77 @@ export async function runLocalScanEnrichment( resolvedRelationships = relationshipStage.resolvedRelationships; compositeRelationships = relationshipStage.compositeRelationships; relationships = relationshipStage.relationships; + relationshipPartial = relationshipStage.partial; warnings.push(...relationshipStage.warnings); + if (relationshipPartial) { + warnings.push({ + code: 'relationship_detection_partial', + message: + relationshipPartial.reason === 'aborted' + ? 'Relationship detection was cancelled before completing; the joins found so far are partial.' + : 'Relationship detection hit its wall-clock budget (scan.relationships.detectionBudgetMs) before completing; the joins found so far are partial. Raise the budget to run a fuller pass.', + recoverable: true, + metadata: { reason: relationshipPartial.reason }, + }); + } + } + + // Derived staleness: after a selective run, surface (never silently leave) any + // unselected stage whose stored hash no longer matches its current inputs (D4). + // The embeddings hash includes the description digest, so a re-describe makes + // embeddings diverge here; relationships are deliberately decoupled (D5) and so + // never diverge from a description change. + if (selectedStages !== undefined && input.stateStore) { + const currentStageHash: Record Promise> = { + descriptions: () => Promise.resolve(descriptionsHash), + relationships: () => Promise.resolve(relationshipsHash), + embeddings: async () => { + const embeddingTexts = buildKtxColumnEmbeddingTexts(snapshot, await resolveDownstreamDescriptions()); + return computeKtxEmbeddingsStageHash({ + snapshot, + embeddingIdentity, + descriptionDigest: computeKtxScanDescriptionDigest(embeddingTexts.map((item) => item.text)), + }); + }, + }; + for (const stage of KTX_SCAN_ENRICHMENT_STAGES) { + if (selectedStages.includes(stage)) { + continue; + } + const completed = await input.stateStore.findLatestCompletedStage({ connectionId: input.connectionId, stage }); + if (!completed) { + continue; + } + if (completed.inputHash !== (await currentStageHash[stage]())) { + warnings.push({ + code: 'enrichment_stage_stale', + message: `The ${stage} enrichment stage is now stale: its inputs changed since it last ran. Refresh it with \`ktx ingest ${input.connectionId} --stages ${stage}\`.`, + recoverable: true, + metadata: { stage }, + }); + } + } } await progress?.update(1, 'Enrichment complete'); + // The manifest merge treats ai/db descriptions as scan-managed and overwrites + // them with whatever this run emits, so a subset run that skips descriptions + // must still emit the prior on-disk ones — else the write deletes them (D3 + // "unselected stages are left untouched on disk"). Fresh-this-run if descriptions + // ran, else loaded from the on-disk _schema. + const writtenDescriptionUpdates = await resolveDownstreamDescriptions(); return { snapshot, summary, relationships, state: summarizeKtxScanEnrichmentState(state), warnings, - descriptionUpdates: descriptions, + descriptionUpdates: writtenDescriptionUpdates, embeddingUpdates, relationshipUpdate, relationshipProfile, resolvedRelationships, compositeRelationships, + relationshipPartial, }; } diff --git a/packages/cli/src/context/scan/local-scan.ts b/packages/cli/src/context/scan/local-scan.ts index cc4c47f6..e9644726 100644 --- a/packages/cli/src/context/scan/local-scan.ts +++ b/packages/cli/src/context/scan/local-scan.ts @@ -6,25 +6,36 @@ import { getLocalStageOnlyIngestStatus, type LocalIngestRunRecord, runLocalStage import type { SourceAdapter } from '../../context/ingest/types.js'; import { createLocalKtxLlmRuntimeFromConfig } from '../../context/llm/local-config.js'; import { KtxScanEmbeddingPortAdapter } from '../../context/llm/embedding-port.js'; -import type { KtxProjectLlmConfig, KtxScanEnrichmentConfig, KtxScanRelationshipConfig } from '../project/config.js'; +import type { KtxProjectLlmConfig, KtxScanEnrichmentConfig } from '../project/config.js'; import type { KtxLocalProject } from '../../context/project/project.js'; import { ktxLocalStateDbPath } from '../project/local-state-db.js'; import { redactKtxScanReport } from './credentials.js'; import { resolveEnabledTables } from './enabled-tables.js'; -import { completedKtxScanEnrichmentStateSummary } from './enrichment-state.js'; +import { + completedKtxScanEnrichmentStateSummary, + type KtxScanEmbeddingIdentity, + type KtxScanLlmIdentity, +} from './enrichment-state.js'; import { failedKtxScanEnrichmentSummary, ktxScanErrorMessage } from './enrichment-summary.js'; import { createDeterministicLocalScanEnrichmentProviders, type KtxLocalScanEnrichmentProviders, runLocalScanEnrichment, } from './local-enrichment.js'; -import { writeLocalScanEnrichmentArtifacts, writeLocalScanManifestShards } from './local-enrichment-artifacts.js'; +import { + createKtxScanDescriptionResumeStore, + loadOnDiskDescriptionUpdates, + writeLocalScanEnrichmentArtifacts, + writeLocalScanEnrichmentCheckpoint, + writeLocalScanManifestShards, +} from './local-enrichment-artifacts.js'; import { readLocalScanStructuralSnapshot } from './local-structural-artifacts.js'; import { SqliteLocalScanEnrichmentStateStore } from './sqlite-local-enrichment-state-store.js'; import type { KtxConnectionDriver, KtxProgressPort, KtxScanConnector, + KtxScanEnrichmentStage, KtxScanEnrichmentStateSummary, KtxScanMode, KtxScanReport, @@ -68,6 +79,8 @@ export interface RunLocalScanOptions { connectionId: string; mode?: KtxScanMode; detectRelationships?: boolean; + /** Enrichment stages to (re)run; omit to run all eligible stages. */ + stages?: KtxScanEnrichmentStage[]; dryRun?: boolean; trigger?: KtxScanTrigger; databaseIntrospectionUrl?: string; @@ -80,6 +93,7 @@ export interface RunLocalScanOptions { enrichmentStateStore?: SqliteLocalScanEnrichmentStateStore | null; progress?: KtxProgressPort; embeddingProvider?: KtxEmbeddingProvider | null; + signal?: AbortSignal; } export interface LocalScanRunResult { @@ -233,19 +247,18 @@ function createLocalScanEnrichmentStateStore(options: RunLocalScanOptions): Sqli return new SqliteLocalScanEnrichmentStateStore({ dbPath: ktxLocalStateDbPath(options.project) }); } -function localScanProviderIdentity( - config: KtxScanEnrichmentConfig, - llmConfig: KtxProjectLlmConfig, - relationships: KtxScanRelationshipConfig, -): Record { +function localScanLlmIdentity(llmConfig: KtxProjectLlmConfig): KtxScanLlmIdentity { return { - mode: config.mode, - embeddingDimensions: config.embeddings?.dimensions ?? null, - llmModel: llmConfig.models.default ?? null, - embeddingModel: config.embeddings?.model ?? null, - batchSize: config.embeddings?.batchSize ?? null, + model: llmConfig.models.default ?? null, baseUrlConfigured: Boolean(llmConfig.provider.gateway?.base_url), - relationships, + }; +} + +function localScanEmbeddingIdentity(config: KtxScanEnrichmentConfig): KtxScanEmbeddingIdentity { + return { + model: config.embeddings?.model ?? null, + dimensions: config.embeddings?.dimensions ?? null, + batchSize: config.embeddings?.batchSize ?? null, }; } @@ -458,6 +471,13 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise> | null = null; if (!reusedExistingScanArtifacts && !report.dryRun && report.artifactPaths.rawSourcesDir) { await options.progress?.update(0.7, 'Writing schema artifacts'); const rawSnapshot = await readLocalScanStructuralSnapshot({ @@ -471,12 +491,20 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise + priorDescriptionUpdates + ? Promise.resolve(priorDescriptionUpdates) + : loadOnDiskDescriptionUpdates(options.project, options.connectionId, enrichedSnapshot), + llmIdentity: localScanLlmIdentity(options.project.config.llm), + embeddingIdentity: localScanEmbeddingIdentity(options.project.config.scan.enrichment), relationshipSettings: options.project.config.scan.relationships, now: options.now, + onCheckpoint: async (checkpoint) => { + await writeLocalScanEnrichmentCheckpoint({ + project: options.project, + connectionId: options.connectionId, + syncId: record.syncId, + driver, + enrichment: checkpoint, + dryRun: options.dryRun ?? false, + }); + }, }); const artifacts = await writeLocalScanEnrichmentArtifacts({ project: options.project, diff --git a/packages/cli/src/context/scan/local-structural-artifacts.ts b/packages/cli/src/context/scan/local-structural-artifacts.ts index 0e1eb602..2a71d3a5 100644 --- a/packages/cli/src/context/scan/local-structural-artifacts.ts +++ b/packages/cli/src/context/scan/local-structural-artifacts.ts @@ -45,8 +45,14 @@ const scanWarningCodes = new Set([ 'enrichment_failed', 'description_fallback_used', 'constraint_discovery_unauthorized', + 'object_introspection_failed', ]); +/** @internal */ +export function isKtxScanWarningCode(code: string): code is KtxScanWarning['code'] { + return scanWarningCodes.has(code as KtxScanWarning['code']); +} + function parseWarning(rawWarning: unknown, path: string): KtxScanWarning { if ( !isRecord(rawWarning) || diff --git a/packages/cli/src/context/scan/object-introspection.ts b/packages/cli/src/context/scan/object-introspection.ts new file mode 100644 index 00000000..da674f34 --- /dev/null +++ b/packages/cli/src/context/scan/object-introspection.ts @@ -0,0 +1,50 @@ +import { isNativeProgrammingFault } from '../../errors.js'; +import type { KtxScanWarning } from './types.js'; + +export interface IntrospectObjectContext { + /** Bare object name (table or view). */ + object: string; + catalog?: string | null; + db?: string | null; +} + +export type IntrospectObjectOutcome = { ok: true; table: T } | { ok: false; warning: KtxScanWarning }; + +function objectLabel(ctx: IntrospectObjectContext): string { + return [ctx.catalog, ctx.db, ctx.object].filter((part): part is string => Boolean(part)).join('.'); +} + +function objectIntrospectionWarning(ctx: IntrospectObjectContext, error: unknown): KtxScanWarning { + const reason = error instanceof Error ? error.message : String(error); + return { + code: 'object_introspection_failed', + message: reason, + table: ctx.object, + recoverable: true, + metadata: { + object: objectLabel(ctx), + ...(ctx.db ? { db: ctx.db } : {}), + ...(ctx.catalog ? { catalog: ctx.catalog } : {}), + }, + }; +} + +/** + * Runs a single-object metadata/profiling read and isolates its failure: a + * broken or inaccessible object becomes a recoverable warning instead of + * aborting the whole scan. Native programming faults (a ktx bug, not a broken + * object) still propagate so they are not masked as object skips. + */ +export async function tryIntrospectObject( + ctx: IntrospectObjectContext, + fn: () => T | Promise, +): Promise> { + try { + return { ok: true, table: await fn() }; + } catch (error) { + if (isNativeProgrammingFault(error)) { + throw error; + } + return { ok: false, warning: objectIntrospectionWarning(ctx, error) }; + } +} diff --git a/packages/cli/src/context/scan/relationship-composite-candidates.ts b/packages/cli/src/context/scan/relationship-composite-candidates.ts index 047e08cb..d993ce6f 100644 --- a/packages/cli/src/context/scan/relationship-composite-candidates.ts +++ b/packages/cli/src/context/scan/relationship-composite-candidates.ts @@ -1,10 +1,11 @@ import type { KtxSqlDialect } from '../connections/dialects.js'; import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable, KtxRelationshipType } from './enrichment-types.js'; +import type { KtxRelationshipDetectionBudget } from './relationship-detection-budget.js'; import { type KtxRelationshipProfileArtifact, type KtxRelationshipReadOnlyExecutor, } from './relationship-profiling.js'; -import type { KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js'; +import type { KtxProgressPort, KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js'; type KtxCompositeRelationshipStatus = 'accepted' | 'review' | 'rejected'; @@ -66,6 +67,8 @@ export interface DiscoverKtxCompositeRelationshipsInput { minPrimaryKeyUniqueness?: number; minSourceCoverage?: number; maxViolationRatio?: number; + budget?: KtxRelationshipDetectionBudget; + progress?: KtxProgressPort; } export interface DiscoverKtxCompositeRelationshipsResult { @@ -536,7 +539,13 @@ export async function discoverKtxCompositeRelationships( const primaryKeys: KtxCompositePrimaryKeyCandidate[] = []; let queryCount = 0; - for (const table of tables) { + for (const [index, table] of tables.entries()) { + if (input.budget?.check()) { + break; + } + await input.progress?.update((index + 1) / tables.length, `Probing composite keys ${index + 1}/${tables.length}`, { + transient: true, + }); const result = await detectCompositePrimaryKeys({ connectionId: input.connectionId, dialect: input.dialect, @@ -554,6 +563,9 @@ export async function discoverKtxCompositeRelationships( const relationships: KtxCompositeRelationshipCandidate[] = []; for (const targetKey of primaryKeys) { + if (input.budget?.check()) { + break; + } const targetTable = tableByName.get(targetKey.table.name); if (!targetTable) { continue; @@ -568,6 +580,9 @@ export async function discoverKtxCompositeRelationships( } for (const sourceTable of tables) { + if (input.budget?.check()) { + break; + } if (sourceTable.id === targetTable.id) { continue; } diff --git a/packages/cli/src/context/scan/relationship-detection-budget.ts b/packages/cli/src/context/scan/relationship-detection-budget.ts new file mode 100644 index 00000000..53655863 --- /dev/null +++ b/packages/cli/src/context/scan/relationship-detection-budget.ts @@ -0,0 +1,93 @@ +export type KtxRelationshipDetectionStopReason = 'budget' | 'aborted'; + +export interface KtxRelationshipDetectionBudget { + /** + * Returns a stop reason when the relationship stage must stop scheduling new + * work, else null. Calling it at a unit boundary records the first observed + * stop so the stage can be finalized as partial. + */ + check(): KtxRelationshipDetectionStopReason | null; + /** The first stop reason observed via check(), or null if the stage ran to completion. */ + stopReason(): KtxRelationshipDetectionStopReason | null; +} + +export interface CreateKtxRelationshipDetectionBudgetInput { + budgetMs: number; + signal?: AbortSignal; + now?: () => number; +} + +export function createKtxRelationshipDetectionBudget( + input: CreateKtxRelationshipDetectionBudgetInput, +): KtxRelationshipDetectionBudget { + const now = input.now ?? (() => Date.now()); + const deadline = now() + Math.max(0, input.budgetMs); + let tripped: KtxRelationshipDetectionStopReason | null = null; + return { + check() { + if (input.signal?.aborted) { + tripped = 'aborted'; + return 'aborted'; + } + if (now() >= deadline) { + tripped ??= 'budget'; + return 'budget'; + } + return null; + }, + stopReason() { + return tripped; + }, + }; +} + +export interface MapWithBudgetInput { + inputs: readonly TInput[]; + concurrency: number; + budget?: KtxRelationshipDetectionBudget; + onStart?: (index: number, total: number, item: TInput) => Promise | void; + mapOne: (item: TInput, index: number) => Promise; +} + +export interface MapWithBudgetResult { + /** Output aligned with inputs; entries skipped on budget exhaustion are undefined. */ + results: Array; + processedCount: number; +} + +/** + * Concurrent map that stops claiming new items once the budget trips. In-flight + * items finish; pending items are left undefined. With no budget it is a plain + * bounded-concurrency map. + */ +export async function mapWithBudget( + input: MapWithBudgetInput, +): Promise> { + const total = input.inputs.length; + const results: Array = new Array(total); + const safeConcurrency = Math.max(1, Math.floor(input.concurrency)); + let nextIndex = 0; + let processedCount = 0; + + async function worker(): Promise { + while (true) { + const index = nextIndex; + if (index >= total) { + return; + } + // Check the budget only when work remains, so a deadline that elapses + // after the last item is claimed never marks a fully-processed stage partial. + if (input.budget?.check()) { + return; + } + nextIndex += 1; + const item = input.inputs[index] as TInput; + await input.onStart?.(index, total, item); + results[index] = await input.mapOne(item, index); + processedCount += 1; + } + } + + await Promise.all(Array.from({ length: Math.min(safeConcurrency, total) }, () => worker())); + return { results, processedCount }; +} diff --git a/packages/cli/src/context/scan/relationship-diagnostics.ts b/packages/cli/src/context/scan/relationship-diagnostics.ts index 2437b21e..dc3cb9a5 100644 --- a/packages/cli/src/context/scan/relationship-diagnostics.ts +++ b/packages/cli/src/context/scan/relationship-diagnostics.ts @@ -79,6 +79,8 @@ export interface KtxRelationshipDiagnosticsArtifact { generatedAt: string; summary: KtxRelationshipDiagnosticsSummary; noAcceptedReason: string | null; + partial: boolean; + partialReason: string | null; candidateCountsBySource: Record; validation: KtxRelationshipDiagnosticsValidation; thresholds: KtxRelationshipDiagnosticsThresholds; @@ -101,6 +103,7 @@ export interface BuildKtxRelationshipDiagnosticsInput { warnings?: readonly KtxScanWarning[]; thresholds?: Partial; policy?: Partial; + partial?: { reason: string } | null; generatedAt?: string; } @@ -352,6 +355,8 @@ export function buildKtxRelationshipDiagnostics( generatedAt: input.generatedAt ?? new Date().toISOString(), summary, noAcceptedReason: noAcceptedReason({ artifacts: input.artifacts, profile: input.profile }), + partial: Boolean(input.partial), + partialReason: input.partial?.reason ?? null, candidateCountsBySource: candidateCountsBySource(input.artifacts), validation: { available: input.profile.sqlAvailable, diff --git a/packages/cli/src/context/scan/relationship-discovery.ts b/packages/cli/src/context/scan/relationship-discovery.ts index 2052d5b7..c4202413 100644 --- a/packages/cli/src/context/scan/relationship-discovery.ts +++ b/packages/cli/src/context/scan/relationship-discovery.ts @@ -11,6 +11,11 @@ import { discoverKtxCompositeRelationships, type KtxCompositeRelationshipCandidate, } from './relationship-composite-candidates.js'; +import { + createKtxRelationshipDetectionBudget, + type KtxRelationshipDetectionBudget, + type KtxRelationshipDetectionStopReason, +} from './relationship-detection-budget.js'; import { collectKtxFormalMetadataRelationships } from './relationship-formal-metadata.js'; import { type KtxResolvedRelationshipDiscoveryCandidate, @@ -25,6 +30,7 @@ import { } from './relationship-profiling.js'; import { validateKtxRelationshipDiscoveryCandidates } from './relationship-validation.js'; import type { + KtxProgressPort, KtxScanConnector, KtxScanContext, KtxScanEnrichmentSummary, @@ -40,6 +46,8 @@ export interface DiscoverKtxRelationshipsInput { context: KtxScanContext; settings: KtxScanRelationshipConfig; llmRuntime?: KtxLlmRuntimePort | null; + progress?: KtxProgressPort; + now?: () => number; } export interface DiscoverKtxRelationshipsResult { @@ -51,6 +59,7 @@ export interface DiscoverKtxRelationshipsResult { statisticalValidation: KtxScanEnrichmentSummary['statisticalValidation']; llmRelationshipValidation: KtxScanEnrichmentSummary['llmRelationshipValidation']; warnings: KtxScanWarning[]; + partial: { reason: KtxRelationshipDetectionStopReason } | null; } function relationshipFromResolved(candidate: KtxResolvedRelationshipDiscoveryCandidate): KtxEnrichedRelationship { @@ -128,6 +137,8 @@ async function detectCompositeRelationships(input: { executor: KtxRelationshipReadOnlyExecutor | null; context: DiscoverKtxRelationshipsInput['context']; warnings: KtxScanWarning[]; + budget: KtxRelationshipDetectionBudget; + progress?: KtxProgressPort; }): Promise { if (!input.executor || !input.profile.sqlAvailable || !input.dialect) { return []; @@ -141,6 +152,8 @@ async function detectCompositeRelationships(input: { profiles: input.profile, executor: input.executor, ctx: input.context, + budget: input.budget, + ...(input.progress ? { progress: input.progress } : {}), }); for (const warning of compositeDetection.warnings) { input.warnings.push({ @@ -220,6 +233,11 @@ export async function discoverKtxRelationships( input: DiscoverKtxRelationshipsInput, ): Promise { const { executor, warnings } = sqlExecutor(input); + const budget = createKtxRelationshipDetectionBudget({ + budgetMs: input.settings.detectionBudgetMs, + ...(input.context.signal ? { signal: input.context.signal } : {}), + ...(input.now ? { now: input.now } : {}), + }); const formalMetadata = collectKtxFormalMetadataRelationships(input.schema); const profileCache = createKtxRelationshipProfileCache(); const profile = await profileKtxRelationshipSchema({ @@ -232,6 +250,8 @@ export async function discoverKtxRelationships( profileSampleRows: input.settings.profileSampleRows, profileConcurrency: input.settings.profileConcurrency, cache: profileCache, + budget, + ...(input.progress ? { progress: input.progress } : {}), }); const deterministicCandidates: KtxRelationshipDiscoveryCandidate[] = generateKtxRelationshipDiscoveryCandidates( input.schema, @@ -240,17 +260,21 @@ export async function discoverKtxRelationships( profiles: profile, }, ); - const llmProposalResult = input.settings.llmProposals - ? await proposeKtxRelationshipCandidatesWithLlm({ - connectionId: input.connectionId, - schema: input.schema, - profile, - llmRuntime: input.llmRuntime ?? null, - settings: { - maxTablesPerBatch: input.settings.maxLlmTablesPerBatch, - }, - }) - : { candidates: [], warnings: [], llmCalls: 0, summary: 'skipped' as const }; + // The LLM proposal is one more unit of relationship work, so it honors the same + // budget/abort gate as profiling, validation, and composite probing — a stage + // that already exhausted its budget (or was aborted) must not start a fresh call. + const llmProposalResult = + input.settings.llmProposals && !budget.check() + ? await proposeKtxRelationshipCandidatesWithLlm({ + connectionId: input.connectionId, + schema: input.schema, + profile, + llmRuntime: input.llmRuntime ?? null, + settings: { + maxTablesPerBatch: input.settings.maxLlmTablesPerBatch, + }, + }) + : { candidates: [], warnings: [], llmCalls: 0, summary: 'skipped' as const }; const candidates = mergeKtxRelationshipDiscoveryCandidates([ ...deterministicCandidates, ...llmProposalResult.candidates, @@ -271,6 +295,8 @@ export async function discoverKtxRelationships( concurrency: input.settings.validationConcurrency, validationBudget: input.settings.validationBudget, }, + budget, + ...(input.progress ? { progress: input.progress } : {}), }); const graph = resolveKtxRelationshipGraph({ schema: input.schema, @@ -290,6 +316,8 @@ export async function discoverKtxRelationships( executor, context: input.context, warnings, + budget, + ...(input.progress ? { progress: input.progress } : {}), }); const inferredAccepted = nonFormalAcceptedRelationships({ formalIds: formalMetadata.acceptedIds, @@ -312,6 +340,7 @@ export async function discoverKtxRelationships( resolvedRelationships: graph.relationships, }); const compositeCounts = compositeSummary(compositeRelationships); + const stopReason = budget.stopReason(); return { relationshipUpdate: { @@ -329,8 +358,11 @@ export async function discoverKtxRelationships( profile, resolvedRelationships: graph.relationships, compositeRelationships, - statisticalValidation: profile.sqlAvailable ? 'completed' : 'skipped', + // A budget/abort stop means profiling did not finish, so report it as not + // completed even though the SQL capability was available. + statisticalValidation: profile.sqlAvailable && !stopReason ? 'completed' : 'skipped', llmRelationshipValidation: llmProposalResult.summary, warnings, + partial: stopReason ? { reason: stopReason } : null, }; } diff --git a/packages/cli/src/context/scan/relationship-llm-proposal.ts b/packages/cli/src/context/scan/relationship-llm-proposal.ts index 4a10d852..390c90e5 100644 --- a/packages/cli/src/context/scan/relationship-llm-proposal.ts +++ b/packages/cli/src/context/scan/relationship-llm-proposal.ts @@ -96,6 +96,10 @@ function rowCountForTable(profile: KtxRelationshipProfileArtifact, table: KtxEnr return profile.tables.find((item) => item.table.name.toLowerCase() === table.ref.name.toLowerCase())?.rowCount ?? null; } +function resolvedDescription(descriptions: Partial>): string | null { + return descriptions.ai ?? descriptions.db ?? null; +} + function buildEvidencePacket( schema: KtxEnrichedSchema, profile: KtxRelationshipProfileArtifact, @@ -107,13 +111,17 @@ function buildEvidencePacket( tables: schema.tables .filter((table) => table.enabled) .slice(0, settings.maxTablesPerBatch) - .map((table) => ({ + .map((table) => { + const tableDescription = resolvedDescription(table.descriptions); + return { name: table.ref.name, catalog: table.ref.catalog, db: table.ref.db, rowCount: rowCountForTable(profile, table), + ...(tableDescription ? { description: tableDescription } : {}), columns: table.columns.slice(0, settings.maxColumnsPerTable).map((column) => { const columnProfile = profileForColumn(profile, table, column); + const columnDescription = resolvedDescription(column.descriptions); return { name: column.name, nativeType: column.nativeType, @@ -121,6 +129,7 @@ function buildEvidencePacket( dimensionType: column.dimensionType, nullable: column.nullable, declaredPrimaryKey: column.primaryKey, + ...(columnDescription ? { description: columnDescription } : {}), profile: columnProfile ? { rowCount: columnProfile.rowCount, @@ -133,7 +142,8 @@ function buildEvidencePacket( : null, }; }), - })), + }; + }), }; } diff --git a/packages/cli/src/context/scan/relationship-profiling.ts b/packages/cli/src/context/scan/relationship-profiling.ts index d547e350..01074792 100644 --- a/packages/cli/src/context/scan/relationship-profiling.ts +++ b/packages/cli/src/context/scan/relationship-profiling.ts @@ -1,8 +1,9 @@ import type { KtxSqlDialect } from '../connections/dialects.js'; import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable } from './enrichment-types.js'; -import { mapWithConcurrency } from './relationship-validation.js'; +import { type KtxRelationshipDetectionBudget, mapWithBudget } from './relationship-detection-budget.js'; import type { KtxConnectionDriver, + KtxProgressPort, KtxQueryResult, KtxReadOnlyQueryInput, KtxScanContext, @@ -65,6 +66,8 @@ export interface ProfileKtxRelationshipSchemaInput { profileSampleRows?: number; profileConcurrency?: number; cache?: KtxRelationshipProfileCache; + budget?: KtxRelationshipDetectionBudget; + progress?: KtxProgressPort; } export function createKtxRelationshipProfileCache(): KtxRelationshipProfileCache { @@ -341,10 +344,14 @@ export async function profileKtxRelationshipSchema( const dialect = input.dialect; const enabledTables = input.schema.tables.filter((candidate) => candidate.enabled); - const tableResults = await mapWithConcurrency( - enabledTables, - input.profileConcurrency ?? 4, - async (table) => { + const { results: tableResults } = await mapWithBudget({ + inputs: enabledTables, + concurrency: input.profileConcurrency ?? 4, + budget: input.budget, + onStart: async (index, total) => { + await input.progress?.update((index + 1) / total, `Profiling table ${index + 1}/${total}`, { transient: true }); + }, + mapOne: async (table) => { const sampleValuesPerColumn = input.sampleValuesPerColumn ?? 5; const profileSampleRows = input.profileSampleRows ?? 10000; const cacheKey = tableProfileCacheKey({ @@ -387,9 +394,12 @@ export async function profileKtxRelationshipSchema( return { cached: cachedFailure, queryCount: 0 }; } }, - ); + }); for (const result of tableResults) { + if (!result) { + continue; + } if ('tableProfile' in result) { queryTotal += result.tableProfile.queryCount; tables.push(result.tableProfile.table); diff --git a/packages/cli/src/context/scan/relationship-validation.ts b/packages/cli/src/context/scan/relationship-validation.ts index ac985eec..4f98c06e 100644 --- a/packages/cli/src/context/scan/relationship-validation.ts +++ b/packages/cli/src/context/scan/relationship-validation.ts @@ -1,12 +1,14 @@ +import { KtxQueryError } from '../../errors.js'; import type { KtxSqlDialect } from '../connections/dialects.js'; import type { KtxRelationshipEndpoint } from './enrichment-types.js'; import { applyKtxRelationshipValidationBudget, type KtxRelationshipValidationBudget } from './relationship-budget.js'; import type { KtxRelationshipDiscoveryCandidate } from './relationship-candidates.js'; +import { type KtxRelationshipDetectionBudget, mapWithBudget } from './relationship-detection-budget.js'; import { type KtxRelationshipProfileArtifact, type KtxRelationshipReadOnlyExecutor, } from './relationship-profiling.js'; -import type { KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js'; +import type { KtxProgressPort, KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js'; type KtxValidatedRelationshipStatus = 'accepted' | 'review' | 'rejected'; @@ -51,6 +53,8 @@ export interface ValidateKtxRelationshipDiscoveryCandidatesInput { ctx: KtxScanContext; tableCount?: number; settings?: Partial; + budget?: KtxRelationshipDetectionBudget; + progress?: KtxProgressPort; } const DEFAULT_SETTINGS: KtxRelationshipValidationSettings = { @@ -182,31 +186,10 @@ function statusFor(input: { return 'rejected'; } -export async function mapWithConcurrency( - inputs: readonly TInput[], - concurrency: number, - mapOne: (input: TInput) => Promise, -): Promise { - const safeConcurrency = Math.max(1, Math.floor(concurrency)); - const outputs: TOutput[] = new Array(inputs.length); - let nextIndex = 0; - - async function worker(): Promise { - while (nextIndex < inputs.length) { - const index = nextIndex; - nextIndex += 1; - outputs[index] = await mapOne(inputs[index] as TInput); - } - } - - await Promise.all(Array.from({ length: Math.min(safeConcurrency, inputs.length) }, () => worker())); - return outputs; -} - function reviewWithoutValidation( candidate: KtxRelationshipDiscoveryCandidate, profiles: KtxRelationshipProfileArtifact, - reason: 'validation_unavailable' | 'profile_unavailable' | 'validation_unattempted', + reason: 'validation_unavailable' | 'profile_unavailable' | 'validation_unattempted' | 'validation_query_failed', ): KtxValidatedRelationshipDiscoveryCandidate { const sourceColumn = singleRelationshipColumn(candidate.from); const targetColumn = singleRelationshipColumn(candidate.to); @@ -257,21 +240,35 @@ export async function validateKtxRelationshipDiscoveryCandidates( return reviewWithoutValidation(candidate, input.profiles, 'profile_unavailable'); } - const result = await executor.executeReadOnly( - { - connectionId: input.connectionId, - sql: buildCoverageSql({ - dialect, - childTable: candidate.from.table, - childColumn: sourceColumn, - parentTable: candidate.to.table, - parentColumn: targetColumn, - maxDistinctSourceValues: settings.maxDistinctSourceValues, - }), - maxRows: 1, - }, - input.ctx, - ); + let result: KtxQueryResult; + try { + result = await executor.executeReadOnly( + { + connectionId: input.connectionId, + sql: buildCoverageSql({ + dialect, + childTable: candidate.from.table, + childColumn: sourceColumn, + parentTable: candidate.to.table, + parentColumn: targetColumn, + maxDistinctSourceValues: settings.maxDistinctSourceValues, + }), + maxRows: 1, + }, + input.ctx, + ); + } catch (error) { + // A bounded-query timeout (or other query rejection) on this one coverage + // probe is best-effort: skip the candidate to review rather than aborting + // the whole validation pass. + if (error instanceof KtxQueryError) { + input.ctx.logger?.warn( + `relationship validation query skipped for ${candidate.from.table.name}.${sourceColumn} -> ${candidate.to.table.name}.${targetColumn}: ${error.message}`, + ); + return reviewWithoutValidation(candidate, input.profiles, 'validation_query_failed'); + } + throw error; + } const childDistinct = numberAt(result, 'child_distinct'); const parentDistinct = numberAt(result, 'parent_distinct'); const overlap = numberAt(result, 'overlap'); @@ -330,18 +327,29 @@ export async function validateKtxRelationshipDiscoveryCandidates( budget: settings.validationBudget, score: (candidate) => candidate.confidence, }); - const validated = await mapWithConcurrency( - budgeted.toValidate.map((entry) => entry.candidate), - settings.concurrency, - validateCandidate, - ); + const { results: validated } = await mapWithBudget({ + inputs: budgeted.toValidate, + concurrency: settings.concurrency, + budget: input.budget, + onStart: async (index, total) => { + await input.progress?.update((index + 1) / total, `Validating candidate ${index + 1}/${total}`, { + transient: true, + }); + }, + mapOne: (entry) => validateCandidate(entry.candidate), + }); const byOriginalIndex = new Map(); for (let index = 0; index < budgeted.toValidate.length; index += 1) { - const originalIndex = budgeted.toValidate[index]?.originalIndex; - const candidate = validated[index]; - if (originalIndex !== undefined && candidate) { - byOriginalIndex.set(originalIndex, candidate); + const entry = budgeted.toValidate[index]; + if (!entry) { + continue; } + // A candidate left unvalidated by the wall-clock budget degrades to the + // same review status as one deferred by the validation count budget. + byOriginalIndex.set( + entry.originalIndex, + validated[index] ?? reviewWithoutValidation(entry.candidate, input.profiles, 'validation_unattempted'), + ); } for (const entry of budgeted.deferred) { byOriginalIndex.set( diff --git a/packages/cli/src/context/scan/sqlite-local-enrichment-state-store.ts b/packages/cli/src/context/scan/sqlite-local-enrichment-state-store.ts index e4570de4..50f649d4 100644 --- a/packages/cli/src/context/scan/sqlite-local-enrichment-state-store.ts +++ b/packages/cli/src/context/scan/sqlite-local-enrichment-state-store.ts @@ -61,6 +61,9 @@ function isSafeRunId(runId: string): boolean { return /^[a-zA-Z0-9][a-zA-Z0-9_.-]*$/.test(runId); } +const STAGES_TABLE = 'local_scan_enrichment_stages'; +const STAGES_PRIMARY_KEY = ['connection_id', 'stage', 'input_hash'] as const; + export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentStateStore { private readonly db: Database.Database; @@ -68,6 +71,10 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta mkdirSync(dirname(options.dbPath), { recursive: true }); this.db = new Database(options.dbPath); this.db.pragma('journal_mode = WAL'); + // Disposable local resume cache: if a prior ktx wrote the table with a + // different primary key, drop it rather than migrate. Losing it only means + // one ingest cannot resume; it never corrupts a queryable artifact. + this.dropStagesTableIfPrimaryKeyDiffers(); this.db.exec(` CREATE TABLE IF NOT EXISTS local_scan_enrichment_stages ( run_id TEXT NOT NULL, @@ -80,32 +87,53 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta output_json TEXT, error_message TEXT, updated_at TEXT NOT NULL, - PRIMARY KEY (run_id, stage) + PRIMARY KEY (connection_id, stage, input_hash) ); + CREATE INDEX IF NOT EXISTS local_scan_enrichment_stages_content_idx + ON local_scan_enrichment_stages (connection_id, stage, input_hash, updated_at); CREATE INDEX IF NOT EXISTS local_scan_enrichment_stages_run_idx ON local_scan_enrichment_stages (run_id, updated_at, stage); `); } + private dropStagesTableIfPrimaryKeyDiffers(): void { + const columns = this.db.prepare(`PRAGMA table_info(${STAGES_TABLE})`).all() as Array<{ + name: string; + pk: number; + }>; + if (columns.length === 0) { + return; + } + const primaryKey = columns + .filter((column) => column.pk > 0) + .sort((left, right) => left.pk - right.pk) + .map((column) => column.name); + const matches = + primaryKey.length === STAGES_PRIMARY_KEY.length && + primaryKey.every((name, index) => name === STAGES_PRIMARY_KEY[index]); + if (!matches) { + this.db.exec(`DROP TABLE ${STAGES_TABLE}`); + } + } + async findCompletedStage( input: KtxScanEnrichmentStageLookup, ): Promise | null> { - if (!isSafeRunId(input.runId)) { - return null; - } const row = this.db .prepare( ` SELECT * FROM local_scan_enrichment_stages - WHERE run_id = ? + WHERE connection_id = ? AND stage = ? AND input_hash = ? AND status = 'completed' + ORDER BY updated_at DESC + LIMIT 1 `, ) - .get(input.runId, input.stage, input.inputHash) as StageRow | undefined; + .get(input.connectionId, input.stage, input.inputHash) as StageRow | undefined; if (!row) { return null; @@ -114,6 +142,31 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta return parsed.status === 'completed' ? parsed : null; } + async findLatestCompletedStage(input: { + connectionId: string; + stage: KtxScanEnrichmentStage; + }): Promise { + const row = this.db + .prepare( + ` + SELECT * + FROM local_scan_enrichment_stages + WHERE connection_id = ? + AND stage = ? + AND status = 'completed' + ORDER BY updated_at DESC + LIMIT 1 + `, + ) + .get(input.connectionId, input.stage) as StageRow | undefined; + + if (!row) { + return null; + } + const parsed = parseStageRow(row); + return parsed.status === 'completed' ? parsed : null; + } + async saveCompletedStage( input: Omit, 'status' | 'errorMessage'>, ): Promise { @@ -144,9 +197,8 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta NULL, @updatedAt ) - ON CONFLICT(run_id, stage) DO UPDATE SET - input_hash = excluded.input_hash, - connection_id = excluded.connection_id, + ON CONFLICT(connection_id, stage, input_hash) DO UPDATE SET + run_id = excluded.run_id, sync_id = excluded.sync_id, mode = excluded.mode, status = excluded.status, @@ -195,9 +247,8 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta @errorMessage, @updatedAt ) - ON CONFLICT(run_id, stage) DO UPDATE SET - input_hash = excluded.input_hash, - connection_id = excluded.connection_id, + ON CONFLICT(connection_id, stage, input_hash) DO UPDATE SET + run_id = excluded.run_id, sync_id = excluded.sync_id, mode = excluded.mode, status = excluded.status, diff --git a/packages/cli/src/context/scan/types.ts b/packages/cli/src/context/scan/types.ts index 148203ef..0d269c37 100644 --- a/packages/cli/src/context/scan/types.ts +++ b/packages/cli/src/context/scan/types.ts @@ -385,12 +385,17 @@ type KtxScanWarningCode = | 'embedding_unavailable' | 'scan_enrichment_backend_not_configured' | 'relationship_validation_failed' + | 'relationship_detection_partial' + | 'enrichment_stage_skipped' + | 'enrichment_stage_stale' | 'relationship_llm_invalid_reference' | 'relationship_llm_proposal_failed' | 'credential_redacted' | 'enrichment_failed' + | 'enrichment_timeout' | 'description_fallback_used' - | 'constraint_discovery_unauthorized'; + | 'constraint_discovery_unauthorized' + | 'object_introspection_failed'; export interface KtxScanWarning { code: KtxScanWarningCode; diff --git a/packages/cli/src/context/sl/pglite-sl-search-prototype.ts b/packages/cli/src/context/sl/pglite-sl-search-prototype.ts index 95d07505..c8b08f83 100644 --- a/packages/cli/src/context/sl/pglite-sl-search-prototype.ts +++ b/packages/cli/src/context/sl/pglite-sl-search-prototype.ts @@ -93,7 +93,7 @@ async function loadCandidates( listed.files .map((path) => path.split('/')[1]) .filter((connectionId): connectionId is string => - typeof connectionId === 'string' && /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId), + typeof connectionId === 'string' && /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId), ), ), ].sort(); diff --git a/packages/cli/src/context/sl/semantic-layer.service.ts b/packages/cli/src/context/sl/semantic-layer.service.ts index e81a28ac..99837247 100644 --- a/packages/cli/src/context/sl/semantic-layer.service.ts +++ b/packages/cli/src/context/sl/semantic-layer.service.ts @@ -20,7 +20,7 @@ interface WriteSourceOptions { } const SL_DIR_PREFIX = 'semantic-layer'; -const CONNECTION_ID_PATTERN = /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/; +const CONNECTION_ID_PATTERN = /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/; export interface LoadAllSourcesResult { sources: SemanticLayerSource[]; diff --git a/packages/cli/src/context/sl/source-files.ts b/packages/cli/src/context/sl/source-files.ts index 6d2e361d..02a5cd48 100644 --- a/packages/cli/src/context/sl/source-files.ts +++ b/packages/cli/src/context/sl/source-files.ts @@ -39,7 +39,7 @@ export function assertSafeConnectionId(connectionId: string): string { } export function isSafeConnectionId(connectionId: string | undefined): connectionId is string { - return typeof connectionId === 'string' && /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId); + return typeof connectionId === 'string' && /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId); } export function sourceNameFromPath(path: string): string { diff --git a/packages/cli/src/context/sl/tools/connection-id-schema.ts b/packages/cli/src/context/sl/tools/connection-id-schema.ts index a4047128..ba0ea49a 100644 --- a/packages/cli/src/context/sl/tools/connection-id-schema.ts +++ b/packages/cli/src/context/sl/tools/connection-id-schema.ts @@ -3,4 +3,4 @@ import { z } from 'zod'; export const slToolConnectionIdSchema = z .string() .min(1) - .regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/, 'Connection id must be alphanumeric and may contain _ or -'); + .regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/, 'Connection id must be alphanumeric and may contain _ or -'); diff --git a/packages/cli/src/context/sql-analysis/dialect-notes.ts b/packages/cli/src/context/sql-analysis/dialect-notes.ts new file mode 100644 index 00000000..d0a7c634 --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialect-notes.ts @@ -0,0 +1,49 @@ +import { readFileSync } from 'node:fs'; +import { fileURLToPath } from 'node:url'; +import type { SqlAnalysisDialect } from './ports.js'; + +// Per-engine SQL syntax notes live as markdown files under ./dialects (one per +// dialect), served by the sql_dialect_notes MCP tool. They are package-internal: +// copy-runtime-assets.mjs ships them to dist, and they are never installed onto an +// agent target. The set covers every dialect reachable from a configured warehouse +// driver; duckdb/databricks are intentionally absent because no connector produces +// them. + +/** @internal Dialects with an authored ./dialects/.md file. */ +export const DIALECTS_WITH_NOTES = [ + 'postgres', + 'mysql', + 'snowflake', + 'bigquery', + 'sqlite', + 'clickhouse', + 'tsql', +] as const; + +type DialectWithNotes = (typeof DIALECTS_WITH_NOTES)[number]; + +const notesCache = new Map(); + +function readDialectNotes(dialect: DialectWithNotes): string { + const cached = notesCache.get(dialect); + if (cached !== undefined) { + return cached; + } + const path = fileURLToPath(new URL(`./dialects/${dialect}.md`, import.meta.url)); + const content = readFileSync(path, 'utf-8').trimEnd(); + notesCache.set(dialect, content); + return content; +} + +function hasNotes(dialect: SqlAnalysisDialect): dialect is DialectWithNotes { + return (DIALECTS_WITH_NOTES as readonly string[]).includes(dialect); +} + +/** + * SQL syntax notes for a resolved dialect. Falls back to `postgres` — the + * resolver's own default for unrecognized drivers — so any SQL connection yields + * usable guidance rather than an empty string. + */ +export function sqlDialectNotes(dialect: SqlAnalysisDialect): string { + return readDialectNotes(hasNotes(dialect) ? dialect : 'postgres'); +} diff --git a/packages/cli/src/context/sql-analysis/dialects/bigquery.md b/packages/cli/src/context/sql-analysis/dialects/bigquery.md new file mode 100644 index 00000000..4d469d2c --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialects/bigquery.md @@ -0,0 +1,13 @@ +**bigquery** SQL conventions: +- **FQTN:** backtick-quoted `` `project.dataset.table` `` (e.g. `` `my-proj.analytics.orders` ``); backticks are required when a name contains a dash. +- **Identifiers:** backtick to quote; column and field names are case-insensitive, dataset and table names are case-sensitive. +- **Date/time:** `DATE_TRUNC(d, MONTH)`, `EXTRACT(YEAR FROM ts)`, `PARSE_DATE('%Y-%m-%d', s)`, `FORMAT_DATE('%Y-%m', d)`, `CURRENT_DATE()`. +- **Series:** build a spine with `UNNEST(GENERATE_DATE_ARRAY('2023-01-01', '2023-12-01', INTERVAL 1 MONTH))` for dates (or `GENERATE_ARRAY(1, n)` for integers), then `LEFT JOIN` the aggregated facts onto it so empty periods still appear. +- **Rolling window over time:** `RANGE` frames are numeric, so range over an integer day key — `AVG(amount) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)` is a trailing 30-day average that tolerates gaps; or build a spine (see **Series**) and use a `ROWS` frame. +- **Safe cast:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(x AS NUMERIC)`) returns `NULL` instead of erroring on a value that does not parse, so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed. +- **Safe divide:** `SAFE_DIVIDE(num, den)` returns `NULL` instead of erroring when the denominator is `0`, so a rate/ratio/share is one expression with no `CASE den = 0` guard; multiply by `100` for a percentage. Prefer it over `num / den` for any computed measure whose denominator can be zero. +- **Top-N / windows:** `QUALIFY` filters on a window result, e.g. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1`. +- **JSON:** `JSON_VALUE(col, '$.k')` returns a scalar STRING, `JSON_QUERY(col, '$.k')` returns a subtree. +- **Nested & repeated data (ARRAY / STRUCT):** the defining BigQuery shape (e.g. GA360 `ga_sessions.hits`, GA4 `event_params`/`user_properties`). Flatten a repeated column by cross-joining `UNNEST` correlated to its row — `FROM t, UNNEST(t.hits) AS h, UNNEST(h.product) AS p` — and read STRUCT fields with dot notation (`h.page.pagePath`, `p.productRevenue`). Pull one value out of a key-value parameter array with a scalar subquery: `(SELECT ep.value.int_value FROM UNNEST(event_params) AS ep WHERE ep.key = 'page_view')`. An `UNNEST` multiplies the parent row by the array's length, so a `COUNT(*)`/`SUM` after it double-counts the parent — count the parent key with `COUNT(DISTINCT visitId)` (or aggregate *inside* the unnest); use `LEFT JOIN UNNEST(arr)` to keep rows whose array is empty. +- **Geospatial (GEOGRAPHY):** build a point with `ST_GEOGPOINT(longitude, latitude)` — **longitude first** — or parse text with `ST_GEOGFROMTEXT(wkt)` / `ST_GEOGFROMGEOJSON(s)`. Predicates: containment `ST_CONTAINS(area, pt)` / `ST_WITHIN(pt, area)` (`ST_WITHIN(a,b)=ST_CONTAINS(b,a)`); proximity `ST_DWITHIN(g1, g2, meters)` (geodesic); distance `ST_DISTANCE(g1, g2)` (meters); overlap `ST_INTERSECTS`. For areal allocation use `ST_AREA(g)` (m²) and `ST_AREA(ST_INTERSECTION(a, b))` for the overlapping area. Prefer these predicates over hand-rolled lat/lon `BETWEEN` boxes. +- **Sharded tables:** query a wildcard table `` `dataset.events_*` `` and filter the shard with the `_TABLE_SUFFIX` pseudo-column, e.g. `WHERE _TABLE_SUFFIX BETWEEN '20240101' AND '20240131'`. The wildcard spans only the shards that exist — before a measure that pins specific dates/periods, confirm the matching shards are actually present (an absent endpoint silently yields no rows, not an error). diff --git a/packages/cli/src/context/sql-analysis/dialects/clickhouse.md b/packages/cli/src/context/sql-analysis/dialects/clickhouse.md new file mode 100644 index 00000000..17a02356 --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialects/clickhouse.md @@ -0,0 +1,9 @@ +**clickhouse** SQL conventions: +- **FQTN:** `database.table` (e.g. `analytics.orders`). +- **Identifiers:** quote with backticks (`` `Order` ``) or double quotes; identifiers are case-sensitive. +- **Date/time:** native `Date`/`DateTime` types. Bucket with `toStartOfMonth(ts)`, `toStartOfDay(ts)`, `toYYYYMM(ts)`; parse with `toDate(s)` / `parseDateTimeBestEffort(s)`; format with `formatDateTime(ts, '%Y-%m')`. +- **Series:** `numbers(n)` / `range(n)` generate an integer sequence; offset a start date with `addMonths(toDate('2023-01-01'), number)` (or `arrayJoin`) to form a spine, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear. +- **Rolling window over time:** a numeric range frame over a `Date` column counts in days and tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)` is a trailing 30-day average (use seconds for a `DateTime` key; the `INTERVAL` form is unsupported); or build a spine (see **Series**) and use a `ROWS` frame. +- **Safe cast:** `toFloat64OrNull(x)` / `toDecimal64OrNull(x, s)` returns `NULL` on a value that does not parse (the `...OrZero` variants return `0` instead), so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed. +- **Top-N / windows:** use the `LIMIT n BY key` clause for n rows per key, or rank in a CTE with `ROW_NUMBER() OVER (...)` and filter outside it. +- **JSON:** extract from a String column with `JSONExtractString(col, 'k')`, `JSONExtractInt(col, 'k')`, etc.; a native `JSON`-typed column is traversed by dot path `col.k`. diff --git a/packages/cli/src/context/sql-analysis/dialects/mysql.md b/packages/cli/src/context/sql-analysis/dialects/mysql.md new file mode 100644 index 00000000..e4e2fc42 --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialects/mysql.md @@ -0,0 +1,9 @@ +**mysql** SQL conventions: +- **FQTN:** `database.table` (MySQL has no separate schema layer — a schema is a database). +- **Identifiers:** quote with backticks (`` `order` ``); table-name case-sensitivity follows the server filesystem, while column names are case-insensitive. +- **Date/time:** `DATE_FORMAT(ts, '%Y-%m')`, `STR_TO_DATE(s, fmt)`, `YEAR(ts)`/`MONTH(ts)`, `CURDATE()`, `NOW()`. +- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH RECURSIVE months(d) AS (SELECT '2023-01-01' UNION ALL SELECT DATE_ADD(d, INTERVAL 1 MONTH) FROM months WHERE d < '2023-12-01')`, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear. +- **Rolling window over time:** a native interval range frame over a temporal order key tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER ()`. +- **Safe cast:** MySQL has no `TRY_CAST`, and `CAST('abc' AS DECIMAL)` returns `0` with a warning rather than erroring — guard with a pattern test first: `CASE WHEN x REGEXP '^-?[0-9.]+$' THEN CAST(x AS DECIMAL(18,4)) END` makes a value that does not parse `NULL`, so a residual-`NULL` count catches an encoding the sample missed (`REGEXP_REPLACE` can strip symbols). +- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (...)` and filter outside it; use `ORDER BY ... LIMIT n` for a global top-N. +- **JSON:** `JSON_EXTRACT(col, '$.k')`, or the `col->'$.k'` / `col->>'$.k'` shortcuts (`->>` unquotes to text). diff --git a/packages/cli/src/context/sql-analysis/dialects/postgres.md b/packages/cli/src/context/sql-analysis/dialects/postgres.md new file mode 100644 index 00000000..0b69a282 --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialects/postgres.md @@ -0,0 +1,10 @@ +**postgres** SQL conventions: +- **FQTN:** `schema.table` (e.g. `public.orders`); one query targets a single database, so qualify by schema, not by database. +- **Identifiers:** unquoted names fold to lower-case; double-quote (`"Name"`) only to keep case or use a reserved word. +- **Date/time:** `date_trunc('month', ts)`, `EXTRACT(YEAR FROM ts)`, `to_char(ts, 'YYYY-MM')`, `CURRENT_DATE`; cast text to a date with `col::date`. +- **Series:** build a date/number spine with `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')` (or `generate_series(1, n)` for integers), then `LEFT JOIN` the aggregated facts onto it so empty periods still appear. +- **Rolling window over time:** a native calendar-range frame spans real dates and tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER ()`. +- **Integer division:** `/` between two integers truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; cast one operand first — `a::numeric / b` or `a * 1.0 / b` — and round only in the final projection. +- **Safe cast:** postgres has no `TRY_CAST`; guard a text-encoded number with a pattern test before casting — `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END` yields `NULL` for a value that does not parse, so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed (`regexp_replace` can strip symbols, but chained `REPLACE` is the portable default). +- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)` and filter in the outer query, or use `DISTINCT ON (key) ... ORDER BY key, ...` for one row per key. +- **JSON:** `col->'k'` returns json, `col->>'k'` returns text, deep path `col#>>'{a,b}'`; prefer `jsonb` operators on `jsonb` columns. diff --git a/packages/cli/src/context/sql-analysis/dialects/snowflake.md b/packages/cli/src/context/sql-analysis/dialects/snowflake.md new file mode 100644 index 00000000..c7375f9e --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialects/snowflake.md @@ -0,0 +1,10 @@ +**snowflake** SQL conventions: +- **FQTN:** three-part `DATABASE.SCHEMA.TABLE` (e.g. `analytics.public.orders`). +- **Identifiers:** unquoted names fold to UPPER-case; double-quote for a case-sensitive or reserved name — `orders` resolves to `"ORDERS"`, which is a different object from `"orders"`. +- **Date/time:** `DATE_TRUNC('month', ts)`, `TO_DATE(s[, fmt])`, `DATEADD(day, -7, CURRENT_DATE)`, `CURRENT_DATE`. +- **Series:** generate rows with `TABLE(GENERATOR(ROWCOUNT => n))` and offset a start date via `DATEADD('month', SEQ4(), '2023-01-01')` (or a recursive CTE) to form a spine, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear. +- **Rolling window over time:** a native interval range frame over a date/timestamp order key tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER ()`. +- **Safe cast:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` (or `TRY_CAST(x AS NUMBER)`) returns `NULL` instead of erroring on a value that does not parse, so a residual-`NULL` count among non-sentinel rows catches an encoding the sample missed. +- **Top-N / windows:** `QUALIFY` filters on a window result without a subquery, e.g. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1`. +- **Semi-structured (VARIANT):** traverse with a colon path and cast with `::`, e.g. `src:vehicle[0].make::string`, `payload:events.date::date`; expand arrays with `LATERAL FLATTEN`. +- **Geospatial (GEOGRAPHY):** build a point with `ST_MAKEPOINT(longitude, latitude)` — **longitude first** — or `TO_GEOGRAPHY(wkt_or_geojson)`; an area polygon from a closed ring of corner points with `ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_CONSTRUCT(p1, p2, …, p1)))` (repeat the first point last to close). Predicates: proximity `ST_DWITHIN(g1, g2, meters)` (geodesic) and distance `ST_DISTANCE(g1, g2)` (meters); containment `ST_CONTAINS(area, pt)` / `ST_WITHIN(pt, area)` where `ST_WITHIN(a,b)=ST_CONTAINS(b,a)`; overlap `ST_INTERSECTS`. Prefer these predicates over hand-rolled lat/lon `BETWEEN` boxes. diff --git a/packages/cli/src/context/sql-analysis/dialects/sqlite.md b/packages/cli/src/context/sql-analysis/dialects/sqlite.md new file mode 100644 index 00000000..8c5a2e6e --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialects/sqlite.md @@ -0,0 +1,11 @@ +**sqlite** SQL conventions: +- **FQTN:** usually the bare `table`; `main.table` to be explicit, `attached.table` for an attached database. +- **Identifiers:** case-insensitive; double-quote (`"Name"`) to preserve a name with spaces or a keyword. +- **Date/time:** there is no native date type — values are TEXT, INTEGER, or REAL. Format and bucket with `strftime('%Y-%m', col)`, `date(col)`, `datetime(col)`, and take day differences with `julianday(a) - julianday(b)`. Confirm the stored encoding (ISO text vs Unix epoch) before comparing. +- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH RECURSIVE months(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d, '+1 month') FROM months WHERE d < '2023-12-01')`, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear. +- **Rolling window over time:** there is no date-interval range frame (a `RANGE` offset needs a single numeric order key, and dates are TEXT), so build a gap-free date spine (see **Series**) and use a row frame — `AVG(amount) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)` then equals a trailing 30-day average; guard minimum periods with `COUNT(*) OVER ()`. +- **Integer division:** `/` between two integers truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; force real division with `a * 1.0 / b` (or `CAST(a AS REAL) / b`) and round only in the final projection. +- **Safe cast:** sqlite has no failure-signaling cast — `CAST('abc' AS REAL)` returns `0.0` and `CAST('12abc' AS REAL)` returns `12.0` (no error, no `NULL`), so an `IS NULL` coverage check silently passes. Detect a value that did not parse with a pattern guard before casting, e.g. `CASE WHEN cleaned NOT GLOB '*[^0-9.]*' THEN CAST(cleaned AS REAL) END` (strip any leading sign first), then count the residual `NULL`s. +- **Rounding (exact half-up at `.5` boundaries):** `ROUND(x, n)` rounds half-away-from-zero, but binary floating-point stores an exact half-way value just *below* it, so the round goes the wrong way — `ROUND(6.475, 2)` returns `6.47`, not `6.48`. When a rounded measure must match exact half-up (a displayed average, rate, or price), nudge by a tiny epsilon below display precision before rounding: `ROUND(x + 1e-9, n)` lifts `6.4749999…` back to `6.475` so it rounds to `6.48` (it leaves non-boundary values unchanged). Round once, at full precision, in the final projection — never in intermediate CTEs. +- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (...)` and filter in the outer query; use `ORDER BY ... LIMIT n` for a global top-N. +- **JSON:** `json_extract(col, '$.k')`, or the `col->'$.k'` / `col->>'$.k'` operators (`->>` returns text). diff --git a/packages/cli/src/context/sql-analysis/dialects/tsql.md b/packages/cli/src/context/sql-analysis/dialects/tsql.md new file mode 100644 index 00000000..355b8d5c --- /dev/null +++ b/packages/cli/src/context/sql-analysis/dialects/tsql.md @@ -0,0 +1,10 @@ +**tsql** (SQL Server) SQL conventions: +- **FQTN:** `schema.table` (e.g. `dbo.orders`), or `database.schema.table` across databases. +- **Identifiers:** quote with square brackets (`[Order]`), or double quotes when `QUOTED_IDENTIFIER` is on; case-sensitivity is set by the database collation (commonly case-insensitive). +- **Date/time:** `DATEPART(year, ts)`, `DATEADD(day, -7, ts)`, `DATEDIFF(day, a, b)`, `CONVERT(date, ts)`, `FORMAT(ts, 'yyyy-MM')`, `GETDATE()`. +- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH months AS (SELECT CAST('2023-01-01' AS date) AS d UNION ALL SELECT DATEADD(month, 1, d) FROM months WHERE d < '2023-12-01')` (cap with `OPTION (MAXRECURSION 0)`), or a numbers/tally table, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear. +- **Rolling window over time:** `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame), so build a gap-free date spine (see **Series**) and use a row frame — `AVG(amount) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)` — or a date-keyed self-join on `f.day BETWEEN DATEADD(day, -29, d.day) AND d.day`; guard minimum periods with `COUNT(*) OVER ()`. +- **Integer division:** `/` between two `int`s truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; cast one operand first — `CAST(a AS decimal(18,4)) / b` or `a * 1.0 / b` — and round only in the final projection. +- **Safe cast:** `TRY_CAST(x AS DECIMAL(18,4))` (or `TRY_CONVERT(decimal(18,4), x)`) returns `NULL` instead of erroring on a value that does not parse, so a residual-`NULL` count among non-sentinel rows catches an encoding the sample missed. +- **Top-N / windows:** `SELECT TOP (n) ... ORDER BY ...` for a global top-N; for per-group, rank in a CTE with `ROW_NUMBER() OVER (...)` and filter in the outer query. +- **JSON:** `JSON_VALUE(col, '$.k')` returns a scalar, `JSON_QUERY(col, '$.k')` returns an object/array, and `OPENJSON(col)` shreds JSON into rows. diff --git a/packages/cli/src/context/wiki/keys.ts b/packages/cli/src/context/wiki/keys.ts index 8af66373..8d7c5234 100644 --- a/packages/cli/src/context/wiki/keys.ts +++ b/packages/cli/src/context/wiki/keys.ts @@ -1,4 +1,4 @@ -const FLAT_WIKI_KEY_PATTERN = /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/; +const FLAT_WIKI_KEY_PATTERN = /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/; export function suggestFlatWikiKey(key: string): string { const suggested = key diff --git a/packages/cli/src/context/wiki/knowledge-wiki.service.ts b/packages/cli/src/context/wiki/knowledge-wiki.service.ts index e87756c3..ee9a2329 100644 --- a/packages/cli/src/context/wiki/knowledge-wiki.service.ts +++ b/packages/cli/src/context/wiki/knowledge-wiki.service.ts @@ -1,7 +1,7 @@ import { createHash } from 'node:crypto'; import YAML from 'yaml'; import type { KtxEmbeddingPort } from '../../context/core/embedding.js'; -import type { KtxFileStorePort } from '../../context/core/file-store.js'; +import type { KtxFileStorePort, KtxFileWriteResult } from '../../context/core/file-store.js'; import type { KtxLogger } from '../../context/core/config.js'; import { noopLogger } from '../../context/core/config.js'; import type { ReindexWorkResult } from '../index-sync/types.js'; @@ -232,11 +232,21 @@ export class KnowledgeWikiService { author: string, authorEmail: string, commitMessage?: string, - ): Promise { - await this.writePage(scope, scopeId, pageKey, frontmatter, content, author, authorEmail, commitMessage); + ): Promise { + const writeResult = await this.writePage( + scope, + scopeId, + pageKey, + frontmatter, + content, + author, + authorEmail, + commitMessage, + ); const serialized = this.serializePage(frontmatter, content); const contentHash = createHash('sha256').update(serialized).digest('hex'); await this.syncSinglePage(scope, scopeId, pageKey, frontmatter, content, contentHash); + return writeResult; } // ── Index sync (files → DB) ─────────────────────────────────── diff --git a/packages/cli/src/context/wiki/local-knowledge.ts b/packages/cli/src/context/wiki/local-knowledge.ts index dd9b9ad7..8f07b52c 100644 --- a/packages/cli/src/context/wiki/local-knowledge.ts +++ b/packages/cli/src/context/wiki/local-knowledge.ts @@ -21,6 +21,7 @@ export interface LocalKnowledgePage { tags: string[]; refs: string[]; slRefs: string[]; + connections: string[]; } export interface LocalKnowledgeSummary { @@ -52,6 +53,7 @@ export interface WriteLocalKnowledgePageInput { representativeSql?: string; usage?: HistoricSqlWikiUsageFrontmatter; fingerprints?: string[]; + connections?: string[]; } const LOCAL_AUTHOR = 'ktx'; @@ -75,6 +77,19 @@ function stringArray(value: unknown): string[] { return Array.isArray(value) ? value.filter((item): item is string => typeof item === 'string') : []; } +/** Coerce a YAML scalar or list into a string list — `connections` accepts a single id or a list. */ +function stringList(value: unknown): string[] { + if (typeof value === 'string') { + return value.trim().length > 0 ? [value] : []; + } + return stringArray(value); +} + +/** A page applies to `connectionId` when it is unscoped (empty) or lists that id. */ +function pageMatchesConnection(connections: string[], connectionId: string | undefined): boolean { + return connectionId === undefined || connections.length === 0 || connections.includes(connectionId); +} + function knowledgePath(scope: LocalKnowledgeScope, userId: string | undefined, key: string): string { const safeKey = assertFlatWikiKey(key); if (scope === 'GLOBAL') { @@ -104,6 +119,7 @@ function parseKnowledgePage(key: string, path: string, scope: LocalKnowledgeScop tags: [], refs: [], slRefs: [], + connections: [], }; } @@ -117,6 +133,7 @@ function parseKnowledgePage(key: string, path: string, scope: LocalKnowledgeScop tags: stringArray(frontmatter.tags), refs: stringArray(frontmatter.refs), slRefs: stringArray(frontmatter.sl_refs), + connections: stringList(frontmatter.connections), }; } @@ -133,6 +150,7 @@ function serializeKnowledgePage(input: WriteLocalKnowledgePageInput): string { ...(input.representativeSql === undefined ? {} : { representative_sql: input.representativeSql }), ...(input.usage === undefined ? {} : { usage: input.usage }), ...(input.fingerprints === undefined ? {} : { fingerprints: input.fingerprints }), + ...(input.connections === undefined ? {} : { connections: input.connections }), }; return `---\n${YAML.stringify(frontmatter, { indent: 2, lineWidth: 0 }).trimEnd()}\n---\n\n${input.content.trim()}\n`; } @@ -180,7 +198,7 @@ export async function readLocalKnowledgePage( export async function listLocalKnowledgePages( project: KtxLocalProject, - input: { userId?: string } = {}, + input: { userId?: string; connectionId?: string } = {}, ): Promise { const userId = input.userId ?? 'local'; const pages: LocalKnowledgeSummary[] = []; @@ -193,7 +211,7 @@ export async function listLocalKnowledgePages( continue; } const page = await readPageAtPath(project, key, path, scope); - if (page) { + if (page && pageMatchesConnection(page.connections, input.connectionId)) { pages.push({ key, path, scope, summary: page.summary }); } } @@ -227,6 +245,26 @@ export async function listLocalKnowledgePageKeys( return [...keys].sort(); } +/** + * Connection ids referenced by any stored page's `connections` frontmatter, + * sorted and deduped. Derived from files; an id here that is not configured in + * `ktx.yaml` is a warn-only condition (config and content evolve independently) + * and never blocks loading, searching, or reading. + */ +export async function listReferencedConnectionIds( + project: KtxLocalProject, + input: { userId?: string } = {}, +): Promise { + const pages = await loadAllKnowledgePages(project, { userId: input.userId }); + const ids = new Set(); + for (const page of pages) { + for (const id of page.connections) { + ids.add(id); + } + } + return [...ids].sort(); +} + function scorePage(page: LocalKnowledgePage, terms: string[]): number { const haystack = buildKnowledgeSearchText(page.key, page.summary, page.content, page.tags).toLowerCase(); return terms.some((term) => haystack.includes(term)) ? 3 : 0; @@ -266,9 +304,12 @@ function tokenLaneCandidates(pages: LocalKnowledgePage[], terms: string[]) { async function loadAllKnowledgePages( project: KtxLocalProject, - input: { userId?: string } = {}, + input: { userId?: string; connectionId?: string } = {}, ): Promise { - const summaries = await listLocalKnowledgePages(project, { userId: input.userId }); + const summaries = await listLocalKnowledgePages(project, { + userId: input.userId, + connectionId: input.connectionId, + }); const pages: LocalKnowledgePage[] = []; for (const summary of summaries) { const page = await readPageAtPath(project, summary.key, summary.path, summary.scope); @@ -281,10 +322,27 @@ async function loadAllKnowledgePages( async function searchLocalKnowledgePagesWithSqlite( project: KtxLocalProject, - input: { query: string; userId?: string; embeddingService?: KtxEmbeddingPort | null; limit?: number }, + input: { + query: string; + userId?: string; + connectionId?: string; + embeddingService?: KtxEmbeddingPort | null; + limit?: number; + }, ): Promise { + // The sqlite index is shared across connections and `index.sync` deletes any + // page not in its input, so sync the FULL corpus and apply the connection + // filter only to the candidate/result set (`allowedPaths`), never to sync. const pages = await loadAllKnowledgePages(project, { userId: input.userId }); - const byPath = new Map(pages.map((page) => [page.path, page])); + const allowedPaths = new Set( + pages.filter((page) => pageMatchesConnection(page.connections, input.connectionId)).map((page) => page.path), + ); + const allowedPages = pages.filter((page) => allowedPaths.has(page.path)); + // Scope the lexical/semantic lanes inside the query so their LIMIT applies to + // in-scope rows; only narrow when a connection is requested (otherwise every + // path is allowed and the filter is a no-op). + const scopedPaths = input.connectionId === undefined ? undefined : [...allowedPaths]; + const byPath = new Map(allowedPages.map((page) => [page.path, page])); const embeddingService = input.embeddingService ?? null; const index = new SqliteKnowledgeIndex({ dbPath: sqliteKnowledgeDbPath(project) }); const existingPages = index.getExistingPages(); @@ -309,7 +367,7 @@ async function searchLocalKnowledgePagesWithSqlite( index.sync(indexPages); - const finalLimit = input.limit ?? Math.max(1, indexPages.length); + const finalLimit = input.limit ?? Math.max(1, allowedPages.length); const core = new HybridSearchCore(); const generators: SearchCandidateGenerator[] = [ { @@ -318,6 +376,7 @@ async function searchLocalKnowledgePagesWithSqlite( const rows = index.searchLexicalCandidates({ queryText: args.queryText, limit: args.laneCandidatePoolLimit, + allowedPaths: scopedPaths, }); return { candidates: rows.map((row) => ({ id: row.id, rank: row.rank, rawScore: row.rawScore })), @@ -327,7 +386,10 @@ async function searchLocalKnowledgePagesWithSqlite( { lane: 'token', async generate(args) { - const rows = tokenLaneCandidates(pages, args.normalizedQuery.terms).slice(0, args.laneCandidatePoolLimit); + const rows = tokenLaneCandidates(allowedPages, args.normalizedQuery.terms).slice( + 0, + args.laneCandidatePoolLimit, + ); return { candidates: rows.map((row, index) => ({ id: row.page.path, @@ -349,6 +411,7 @@ async function searchLocalKnowledgePagesWithSqlite( const rows = index.searchSemanticCandidates({ queryEmbedding, limit: args.laneCandidatePoolLimit, + allowedPaths: scopedPaths, }); return { candidates: rows @@ -387,14 +450,14 @@ async function searchLocalKnowledgePagesWithSqlite( async function searchLocalKnowledgePagesWithScan( project: KtxLocalProject, - input: { query: string; userId?: string; limit?: number }, + input: { query: string; userId?: string; connectionId?: string; limit?: number }, ): Promise { const terms = input.query .toLowerCase() .split(/\s+/) .map((term) => term.trim()) .filter(Boolean); - const pages = await loadAllKnowledgePages(project, { userId: input.userId }); + const pages = await loadAllKnowledgePages(project, { userId: input.userId, connectionId: input.connectionId }); const results: LocalKnowledgeSearchResult[] = []; for (const page of pages) { const score = scorePage(page, terms); @@ -416,7 +479,13 @@ async function searchLocalKnowledgePagesWithScan( export async function searchLocalKnowledgePages( project: KtxLocalProject, - input: { query: string; userId?: string; embeddingService?: KtxEmbeddingPort | null; limit?: number }, + input: { + query: string; + userId?: string; + connectionId?: string; + embeddingService?: KtxEmbeddingPort | null; + limit?: number; + }, ): Promise { if (project.config.storage.search === 'sqlite-fts5') { return searchLocalKnowledgePagesWithSqlite(project, input); diff --git a/packages/cli/src/context/wiki/sqlite-knowledge-index.ts b/packages/cli/src/context/wiki/sqlite-knowledge-index.ts index 66b30338..32726192 100644 --- a/packages/cli/src/context/wiki/sqlite-knowledge-index.ts +++ b/packages/cli/src/context/wiki/sqlite-knowledge-index.ts @@ -85,6 +85,22 @@ function parseEmbedding(raw: string | null): number[] | null { } } +/** A provided-but-empty allowlist means "no page is in scope", distinct from an absent (unfiltered) one. */ +function isEmptyAllowlist(allowedPaths: readonly string[] | undefined): boolean { + return allowedPaths !== undefined && allowedPaths.length === 0; +} + +/** Build a ` path IN (?, …)` fragment so the scope filter applies inside the query, before any LIMIT. */ +function pathInClause( + keyword: 'AND' | 'WHERE', + allowedPaths: readonly string[] | undefined, +): { sql: string; params: string[] } { + if (allowedPaths === undefined || allowedPaths.length === 0) { + return { sql: '', params: [] }; + } + return { sql: ` ${keyword} path IN (${allowedPaths.map(() => '?').join(', ')})`, params: [...allowedPaths] }; +} + function normalizeFtsQuery(query: string): string { const terms = query .toLowerCase() @@ -217,23 +233,28 @@ export class SqliteKnowledgeIndex { ); } - searchLexicalCandidates(input: { queryText: string; limit: number }): WikiSqliteLaneCandidate[] { + searchLexicalCandidates(input: { + queryText: string; + limit: number; + allowedPaths?: readonly string[]; + }): WikiSqliteLaneCandidate[] { const ftsQuery = normalizeFtsQuery(input.queryText); - if (!ftsQuery) { + if (!ftsQuery || isEmptyAllowlist(input.allowedPaths)) { return []; } + const pathFilter = pathInClause('AND', input.allowedPaths); const rows = this.db .prepare( ` SELECT path, bm25(knowledge_pages_fts) AS rank FROM knowledge_pages_fts - WHERE knowledge_pages_fts MATCH ? + WHERE knowledge_pages_fts MATCH ?${pathFilter.sql} ORDER BY rank ASC, path ASC LIMIT ? `, ) - .all(ftsQuery, Math.max(1, input.limit)) as SearchRow[]; + .all(ftsQuery, ...pathFilter.params, Math.max(1, input.limit)) as SearchRow[]; return rows.map((row, index) => ({ id: row.path, @@ -243,16 +264,25 @@ export class SqliteKnowledgeIndex { })); } - searchSemanticCandidates(input: { queryEmbedding: number[]; limit: number }): WikiSqliteLaneCandidate[] { + searchSemanticCandidates(input: { + queryEmbedding: number[]; + limit: number; + allowedPaths?: readonly string[]; + }): WikiSqliteLaneCandidate[] { + if (isEmptyAllowlist(input.allowedPaths)) { + return []; + } + + const pathFilter = pathInClause('WHERE', input.allowedPaths); const rows = this.db .prepare( ` SELECT path, embedding_json - FROM knowledge_pages + FROM knowledge_pages${pathFilter.sql} ORDER BY path ASC `, ) - .all() as IndexedPageRow[]; + .all(...pathFilter.params) as IndexedPageRow[]; return rows .flatMap((row) => { diff --git a/packages/cli/src/context/wiki/tools/wiki-write.tool.ts b/packages/cli/src/context/wiki/tools/wiki-write.tool.ts index 4b0f1b39..72ffa1a3 100644 --- a/packages/cli/src/context/wiki/tools/wiki-write.tool.ts +++ b/packages/cli/src/context/wiki/tools/wiki-write.tool.ts @@ -35,6 +35,12 @@ const wikiWriteInputSchema = z.object({ tags: z.array(z.string()).optional(), refs: z.array(z.string()).optional(), sl_refs: z.array(z.string()).optional(), + connections: z + .union([z.string(), z.array(z.string())]) + .optional() + .describe( + 'Connection ids this page applies to. Set [connectionId] on database-specific pages (with a connection-distinctive key); omit or leave empty for org-wide content. REPLACE semantics like tags.', + ), source: z.string().optional(), intent: z.string().optional(), tables: z.array(z.string()).optional(), @@ -150,6 +156,33 @@ Keys must be flat file names, not directory paths. Use tags/source frontmatter f const resolvedTags = input.tags === undefined ? existingFm?.tags : input.tags; const resolvedRefs = input.refs === undefined ? existingFm?.refs : input.refs; const resolvedSlRefs = input.sl_refs === undefined ? existingFm?.sl_refs : input.sl_refs; + const incomingConnections = + input.connections === undefined + ? undefined + : typeof input.connections === 'string' + ? [input.connections] + : input.connections; + const resolvedConnections = incomingConnections === undefined ? existingFm?.connections : incomingConnections; + + // Data-loss guard: page keys are a flat global namespace, so a write whose + // incoming connection scope is disjoint from an existing same-key page would + // silently overwrite a different connection's page. Surface it instead. + const existingConnections = existingFm?.connections ?? []; + if ( + existing && + incomingConnections !== undefined && + incomingConnections.length > 0 && + existingConnections.length > 0 && + !incomingConnections.some((id) => existingConnections.includes(id)) + ) { + return { + markdown: + `Error: page "${input.key}" already exists scoped to a different connection ` + + `(connections: ${existingConnections.join(', ')}); writing it for ${incomingConnections.join(', ')} ` + + `would overwrite that page. Use a connection-distinctive key (e.g. "${input.key}_${incomingConnections[0]}").`, + structured: { success: false, key: input.key }, + }; + } let finalContent: string; const finalFm: WikiFrontmatter = { @@ -159,6 +192,7 @@ Keys must be flat file names, not directory paths. Use tags/source frontmatter f tags: resolvedTags, refs: resolvedRefs, sl_refs: resolvedSlRefs, + connections: resolvedConnections, source: input.source === undefined ? existingFm?.source : input.source, intent: input.intent === undefined ? existingFm?.intent : input.intent, tables: input.tables === undefined ? existingFm?.tables : input.tables, diff --git a/packages/cli/src/context/wiki/types.ts b/packages/cli/src/context/wiki/types.ts index bd54d130..abfd370c 100644 --- a/packages/cli/src/context/wiki/types.ts +++ b/packages/cli/src/context/wiki/types.ts @@ -16,6 +16,12 @@ export interface WikiFrontmatter { tags?: string[]; refs?: string[]; sl_refs?: string[]; + /** + * Connection ids this page applies to. Absent or empty ⇒ unscoped: the page + * applies to all connections. Additive metadata, orthogonal to GLOBAL/USER + * scope; it does not namespace page keys. + */ + connections?: string[]; usage_mode: 'always' | 'auto' | 'never'; sort_order?: number; source?: string; diff --git a/packages/cli/src/knowledge.ts b/packages/cli/src/knowledge.ts index 346d3d9a..1a9685e9 100644 --- a/packages/cli/src/knowledge.ts +++ b/packages/cli/src/knowledge.ts @@ -1,6 +1,7 @@ import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js'; import type { KtxEmbeddingPort } from './context/core/embedding.js'; import { loadKtxProject } from './context/project/project.js'; +import { assertConfiguredConnectionId } from './context/connections/configured-connections.js'; import { type LocalKnowledgeSearchResult, type LocalKnowledgeSummary, @@ -17,12 +18,21 @@ import { createRankBadgeFormatter, printList, type PrintListColumn } from './io/ import { emitTelemetryEvent } from './telemetry/index.js'; export type KtxKnowledgeArgs = - | { command: 'list'; projectDir: string; userId: string; output?: string; json?: boolean; cliVersion: string } + | { + command: 'list'; + projectDir: string; + userId: string; + connectionId?: string; + output?: string; + json?: boolean; + cliVersion: string; + } | { command: 'search'; projectDir: string; query: string; userId: string; + connectionId?: string; output?: string; json?: boolean; limit?: number; @@ -120,7 +130,14 @@ export async function runKtxKnowledge( try { const project = await loadKtxProject({ projectDir: args.projectDir }); if (args.command === 'list') { - const pages = await listLocalKnowledgePages(project, { userId: args.userId }); + const connectionId = + args.connectionId === undefined + ? undefined + : assertConfiguredConnectionId(project.config.connections, args.connectionId); + const pages = await listLocalKnowledgePages(project, { + userId: args.userId, + ...(connectionId !== undefined ? { connectionId } : {}), + }); const mode = resolveOutputMode({ explicit: args.output, json: args.json, io }); printList({ rows: pages, @@ -145,6 +162,10 @@ export async function runKtxKnowledge( return 0; } if (args.command === 'search') { + const connectionId = + args.connectionId === undefined + ? undefined + : assertConfiguredConnectionId(project.config.connections, args.connectionId); const embeddingService = await wikiSearchEmbeddingService(project, deps, { cliVersion: args.cliVersion }, io); const search = deps.searchLocalKnowledgePages ?? defaultSearchLocalKnowledgePages; const results = await search(project, { @@ -152,6 +173,7 @@ export async function runKtxKnowledge( userId: args.userId, embeddingService, limit: args.limit, + ...(connectionId !== undefined ? { connectionId } : {}), }); await emitTelemetryEvent({ name: 'wiki_query_completed', diff --git a/packages/cli/src/mcp-http-server.ts b/packages/cli/src/mcp-http-server.ts index 69da9e72..cb7a05c9 100644 --- a/packages/cli/src/mcp-http-server.ts +++ b/packages/cli/src/mcp-http-server.ts @@ -5,6 +5,7 @@ import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js'; import { isInitializeRequest } from '@modelcontextprotocol/sdk/types.js'; import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js'; +import { createMcpLogger, serializeMcpError } from './context/mcp/logger.js'; import { createKtxMcpServerFactory } from './mcp-server-factory.js'; const DEFAULT_ALLOWED_HOSTS = ['localhost', '127.0.0.1', '::1'] as const; @@ -173,6 +174,9 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions): options.createMcpServer === undefined ? await (options.loadProject ?? loadKtxProject)({ projectDir: options.projectDir }) : undefined; + // One logger per process, shared by the tool layer (via the factory) and the + // transport lifecycle below. Falls back to a no-op sink for programmatic callers. + const logger = createMcpLogger(options.io ?? { stdout: { write() {} }, stderr: { write() {} } }); const createMcpServer = options.createMcpServer ?? (await createKtxMcpServerFactory({ @@ -180,6 +184,7 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions): projectDir: options.projectDir, cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version, io: options.io, + logger, })); const sessions = new Map(); @@ -189,6 +194,7 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions): sessionIdGenerator: () => randomUUID(), onsessioninitialized: (sessionId) => { sessions.set(sessionId, transport); + logger.info({ sessionId }, 'session.open'); }, onsessionclosed: (sessionId) => { sessions.delete(sessionId); @@ -197,15 +203,25 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions): allowedOrigins: config.allowedOrigins, enableDnsRebindingProtection: true, }); + // onclose is the universal session-end signal (clean DELETE and dropped connection both + // close the transport), so session.close is logged here rather than in onsessionclosed. transport.onclose = () => { if (transport.sessionId) { sessions.delete(transport.sessionId); + logger.info({ sessionId: transport.sessionId }, 'session.close'); } }; + transport.onerror = (error) => { + logger.error( + { ...(transport.sessionId ? { sessionId: transport.sessionId } : {}), err: serializeMcpError(error) }, + 'transport.error', + ); + }; await createMcpServer().connect(transport); return transport; } + const startedAt = performance.now(); const server = createServer(async (req, res) => { const path = requestPath(req); const auth = isMcpRequestAuthorized({ path, headers: req.headers }, config); @@ -216,7 +232,8 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions): if (path === '/health' && req.method === 'GET') { const port = listenerPort(server, config.port); - writeJson(res, 200, { status: 'ok', projectDir: options.projectDir, port }); + const uptimeMs = Math.round(performance.now() - startedAt); + writeJson(res, 200, { status: 'ok', projectDir: options.projectDir, port, uptimeMs }); return; } diff --git a/packages/cli/src/mcp-server-factory.ts b/packages/cli/src/mcp-server-factory.ts index 84a00253..a6323c13 100644 --- a/packages/cli/src/mcp-server-factory.ts +++ b/packages/cli/src/mcp-server-factory.ts @@ -2,6 +2,9 @@ import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js'; import { createDefaultKtxMcpServer } from './context/mcp/server.js'; import { createLocalProjectMcpContextPorts } from './context/mcp/local-project-ports.js'; import { createLocalProjectMemoryIngest } from './context/memory/local-memory.js'; +import { assertConfiguredConnectionId } from './context/connections/configured-connections.js'; +import type { KtxMcpLogger } from './context/mcp/logger.js'; +import type { MemoryIngestPort } from './context/mcp/types.js'; import type { KtxLocalProject } from './context/project/project.js'; import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; import type { KtxCliIo } from './cli-runtime.js'; @@ -23,6 +26,7 @@ export async function createKtxMcpServerFactory(input: { projectDir: string; cliVersion: string; io?: KtxCliIo; + logger?: KtxMcpLogger; }): Promise<() => McpServer> { const io = input.io ?? noopMcpIo(); const queryExecutor = createKtxCliIngestQueryExecutor(input.project); @@ -57,13 +61,25 @@ export async function createKtxMcpServerFactory(input: { }, }); - let memoryIngest: ReturnType | undefined; + let memoryIngest: MemoryIngestPort | undefined; try { - memoryIngest = createLocalProjectMemoryIngest(input.project, { + const baseMemoryIngest = createLocalProjectMemoryIngest(input.project, { semanticLayerCompute, queryExecutor, embeddingProvider, }); + // Validate the explicit connectionId argument here so a typo is rejected with the + // configured ids before the ingest run starts; persisted page scope is validated + // separately (warn-only) and must not fail. + memoryIngest = { + ingest: (ingestInput) => { + if (ingestInput.connectionId !== undefined) { + assertConfiguredConnectionId(input.project.config.connections, ingestInput.connectionId); + } + return baseMemoryIngest.ingest(ingestInput); + }, + status: (runId) => baseMemoryIngest.status(runId), + }; } catch (error) { io.stderr.write(`ktx MCP memory_ingest disabled: ${error instanceof Error ? error.message : String(error)}\n`); } @@ -75,6 +91,7 @@ export async function createKtxMcpServerFactory(input: { userContext: { userId: 'local' }, projectDir: input.projectDir, io, + ...(input.logger ? { logger: input.logger } : {}), contextTools: { ...contextTools, ...(memoryIngest ? { memoryIngest } : {}), diff --git a/packages/cli/src/mcp-stdio-server.ts b/packages/cli/src/mcp-stdio-server.ts index 840e2db9..8e7d89f6 100644 --- a/packages/cli/src/mcp-stdio-server.ts +++ b/packages/cli/src/mcp-stdio-server.ts @@ -4,6 +4,7 @@ import { loadKtxProject } from './context/project/project.js'; import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'; import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js'; +import { createMcpLogger, serializeMcpError } from './context/mcp/logger.js'; import { createKtxMcpServerFactory } from './mcp-server-factory.js'; export interface RunKtxMcpStdioServerOptions { @@ -25,6 +26,8 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions) stdout: { write() {} }, stderr: options.io?.stderr ?? process.stderr, }; + // stdout is reserved for JSON-RPC, so the logger writes to stderr only. + const logger = createMcpLogger(protocolIo); const createMcpServer = options.createMcpServer ?? (await createKtxMcpServerFactory({ @@ -32,6 +35,7 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions) projectDir: options.projectDir, cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version, io: protocolIo, + logger, })); const stdin = options.stdin ?? process.stdin; const transport = new StdioServerTransport(stdin, options.stdout); @@ -50,13 +54,17 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions) settle(() => reject(error instanceof Error ? error : new Error(String(error)))); }); }; - transport.onclose = () => settle(resolve); + transport.onclose = () => { + logger.info({}, 'session.close'); + settle(resolve); + }; transport.onerror = (error) => { - options.io?.stderr.write(`ktx MCP stdio transport error: ${error.message}\n`); + logger.error({ err: serializeMcpError(error) }, 'transport.error'); settle(() => reject(error)); }; stdin.once('end', closeTransport); stdin.once('close', closeTransport); + logger.info({}, 'session.open'); createMcpServer().connect(transport).catch((error: unknown) => { settle(() => reject(error instanceof Error ? error : new Error(String(error)))); }); diff --git a/packages/cli/src/notion-page-picker.ts b/packages/cli/src/notion-page-picker.ts index 066acec1..d98e0985 100644 --- a/packages/cli/src/notion-page-picker.ts +++ b/packages/cli/src/notion-page-picker.ts @@ -46,7 +46,7 @@ const NOTION_SCRIPTED_MODE_HINT = 'Notion picker requires a TTY. Use --no-input --notion-root-page-id for scripted mode.'; function assertSafeNotionPickerConnectionId(connectionId: string): void { - if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId)) { + if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId)) { throw new Error(`Unsafe connection id: ${connectionId}`); } } diff --git a/packages/cli/src/prompts/memory_agent_external_ingest.md b/packages/cli/src/prompts/memory_agent_external_ingest.md index 5c17bd39..5d69a5f8 100644 --- a/packages/cli/src/prompts/memory_agent_external_ingest.md +++ b/packages/cli/src/prompts/memory_agent_external_ingest.md @@ -19,6 +19,8 @@ A single artifact typically produces multiple actions: one SL source per table/v All wiki writes go to the GLOBAL scope - they will be visible to every user of this ktx project. Phrase wiki pages as objective business knowledge, not personal preference. The `wiki_write` tool handles scope selection automatically for external ingest. + +When a `connectionId` is shown in the prompt context, tag database-specific pages with `connections: []` and give them connection-distinctive keys (`orders_sales_db`, not `orders`) so same-concept pages from other databases do not collide or pollute each other's searches. Leave `connections` empty for org-wide knowledge that applies across every database. See the `wiki_capture` skill's "Connection scoping" section. diff --git a/packages/cli/src/public-ingest.ts b/packages/cli/src/public-ingest.ts index b38107c3..899a53d0 100644 --- a/packages/cli/src/public-ingest.ts +++ b/packages/cli/src/public-ingest.ts @@ -20,7 +20,7 @@ import { import { createAggregateProgressPort } from './progress-port-adapter.js'; import { resolvePublicIngestRuntimeRequirements } from './runtime-requirements.js'; import type { KtxScanArgs, KtxScanDeps } from './scan.js'; -import type { KtxTableRef } from './context/scan/types.js'; +import type { KtxScanEnrichmentStage, KtxTableRef } from './context/scan/types.js'; import { profileMark } from './startup-profile.js'; import { isDemoConnection } from './telemetry/demo-detect.js'; import { emitProjectStackSnapshot, emitTelemetryEvent, reportException } from './telemetry/index.js'; @@ -46,6 +46,7 @@ export type KtxPublicIngestArgs = queryHistory?: KtxPublicIngestQueryHistoryFlag; queryHistoryWindowDays?: number; scanMode?: Extract['mode']; + stages?: KtxScanEnrichmentStage[]; detectRelationships?: boolean; cliVersion?: string; runtimeInstallPolicy?: KtxManagedPythonInstallPolicy; @@ -123,6 +124,7 @@ interface KtxPublicContextBuildArgs { queryHistory?: KtxPublicIngestQueryHistoryFlag; queryHistoryWindowDays?: number; scanMode?: Extract['mode']; + stages?: KtxScanEnrichmentStage[]; detectRelationships?: boolean; cliVersion?: string; runtimeInstallPolicy?: KtxManagedPythonInstallPolicy; @@ -974,6 +976,7 @@ async function runIngestTargetSteps( mode: 'enriched', detectRelationships: target.detectRelationships === true, dryRun: false, + ...(args.stages ? { stages: args.stages } : {}), ...(args.cliVersion ? { cliVersion: args.cliVersion } : {}), ...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}), }; @@ -1153,6 +1156,7 @@ export async function runKtxPublicIngest( ...(args.queryHistory ? { queryHistory: args.queryHistory } : {}), ...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}), ...(args.scanMode ? { scanMode: args.scanMode } : {}), + ...(args.stages ? { stages: args.stages } : {}), ...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}), ...(args.cliVersion ? { cliVersion: args.cliVersion } : {}), ...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}), diff --git a/packages/cli/src/scan.ts b/packages/cli/src/scan.ts index 055f2e69..2288530f 100644 --- a/packages/cli/src/scan.ts +++ b/packages/cli/src/scan.ts @@ -1,4 +1,10 @@ -import type { KtxProgressPort, KtxScanMode, KtxScanReport, KtxScanWarning } from './context/scan/types.js'; +import type { + KtxProgressPort, + KtxScanEnrichmentStage, + KtxScanMode, + KtxScanReport, + KtxScanWarning, +} from './context/scan/types.js'; import { runLocalScan } from './context/scan/local-scan.js'; import { loadKtxProject, type KtxLocalProject } from './context/project/project.js'; import { getKtxCliPackageInfo } from './cli-runtime.js'; @@ -21,6 +27,8 @@ export interface KtxScanArgs { mode: KtxScanMode; detectRelationships: boolean; dryRun: boolean; + /** Enrichment stages to (re)run; omit to run all eligible stages. */ + stages?: KtxScanEnrichmentStage[]; databaseIntrospectionUrl?: string; cliVersion?: string; runtimeInstallPolicy?: KtxManagedPythonInstallPolicy; @@ -180,8 +188,14 @@ function describeWarningGroup(code: string, count: number): string { return `${count} LLM relationship ${plural(count, 'proposal')} failed.`; case 'scan_enrichment_backend_not_configured': return 'Scan enrichment backend is not configured; AI stages were skipped.'; + case 'enrichment_stage_skipped': + return `${count} requested ${plural(count, 'enrichment stage')} could not run (prerequisite missing).`; + case 'enrichment_stage_stale': + return `${count} enrichment ${plural(count, 'stage')} are stale after a selective run; re-run them to refresh.`; case 'credential_redacted': return `${count} ${plural(count, 'credential')} were redacted from scan output.`; + case 'object_introspection_failed': + return `${count} ${plural(count, 'object')} skipped during introspection (broken or inaccessible objects were excluded; the rest were ingested).`; default: return `${count} ${plural(count, 'warning')} (${code})`; } @@ -348,6 +362,7 @@ export async function runKtxScan(args: KtxScanArgs, io: KtxCliIo = process, deps connectionId: args.connectionId, mode: args.mode, detectRelationships: args.detectRelationships, + ...(args.stages ? { stages: args.stages } : {}), dryRun: args.dryRun, trigger: 'cli', databaseIntrospectionUrl: args.databaseIntrospectionUrl, diff --git a/packages/cli/src/setup-databases.ts b/packages/cli/src/setup-databases.ts index 548c03a2..7388f74a 100644 --- a/packages/cli/src/setup-databases.ts +++ b/packages/cli/src/setup-databases.ts @@ -320,7 +320,7 @@ function unique(values: string[]): string[] { } function assertSafeDatabaseConnectionId(connectionId: string): void { - if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId)) { + if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId)) { throw new Error(`Unsafe connection id: ${connectionId}`); } } diff --git a/packages/cli/src/skills/analytics/SKILL.md b/packages/cli/src/skills/analytics/SKILL.md index 3fe82b1c..7724b928 100644 --- a/packages/cli/src/skills/analytics/SKILL.md +++ b/packages/cli/src/skills/analytics/SKILL.md @@ -13,12 +13,15 @@ You have access to ktx MCP tools for data discovery, semantic-layer analysis, ra - `kind: 'wiki'` -> `wiki_read` - `kind: 'sl_source'`, `kind: 'sl_measure'`, or `kind: 'sl_dimension'` -> `sl_read_source` - `kind: 'table'` or `kind: 'column'` -> `entity_details` + - For tables you intend to query, sample a few rows (`entity_details` plus a small `sql_execution` sample) to confirm date encoding, null prevalence in join/filter keys, and the real enum values — see the `` Schema-discovery rules. 3. **Resolve business values** - if the user named a value such as "Acme Corp", "enterprise", or "status=shipped", call `dictionary_search` to find which column holds it. -4. **Plan the analysis** - identify the grain, metrics, dimensions, filters, time window, and expected row limits before querying. +4. **Plan the analysis** - identify the grain, metrics, dimensions, filters, time window, and expected row limits before querying. Confirm each filter/join column's real type before comparing it (see the `` Schema-discovery rules). **Write down the exact output-column list first** — enumerate, from the question, every column the answer must have (each requested metric/attribute; for every grouped or named entity BOTH its id and its name; every input to each derived value) and treat that list as the contract your final `SELECT` must match column-for-column. Decide this list *before* writing SQL, not after — building the projection to a pre-stated list is far more reliable than reviewing for omissions at the end. 5. **Query** - - Prefer `sl_query` when the semantic layer covers the question. - Use `sql_execution` only for questions the semantic layer does not cover. -6. **Validate and explain** - sanity-check totals, filters, null handling, and time zones. State the source tables or semantic-layer objects used. + - Before writing raw `sql_execution` SQL against a connection, call `sql_dialect_notes` with its connection id to get that engine's FQTN, identifier-quoting, date, top-N, series/calendar, rolling-window, safe-cast, and JSON conventions. + - When authoring raw SQL, apply the `` rules: build incrementally, keep window ordering deterministic, compute at full precision, and match the answer's grain to the question. +6. **Validate and explain** - sanity-check totals, filters, null handling, and time zones. **Always run the final completeness check before emitting:** re-read the question and confirm every requested output, each named entity's identity, each derived value's inputs, and the question's grain are all in the projection — see the `` Final completeness check. If a result is unexpectedly empty or its grain looks wrong, work through the `` Answer-completeness rules to diagnose. State the source tables or semantic-layer objects used. 7. **Capture durable learnings** - call `memory_ingest` whenever a turn produces something worth remembering (business rules, metric definitions, schema gotchas, recurring findings) **or** whenever the user asks you to remember something. Pass markdown in `content` including any source context the memory agent should weigh. Each call is a feedback loop; better notes today mean smarter `discover_data` and `wiki_search` results tomorrow. @@ -38,6 +41,201 @@ You have access to ktx MCP tools for data discovery, semantic-layer analysis, ra - Ask a concise clarification only when the metric, date range, entity, or grain is genuinely ambiguous and cannot be inferred from context. + +Heuristics for writing *correct* (not merely runnable) SQL. Each is a default plus the reason it holds on any database; apply judgment to the question and the data. + +**Schema discovery before writing SQL** +- **Sample before you compose.** Inspect representative rows of every table you will touch (`entity_details` plus a small `sql_execution` sample) to confirm date/time encoding (`YYYYMMDD` integer vs ISO text vs epoch), null prevalence in join/filter keys, and the real set of categorical/enum values. Assumptions about encoding and nullability are the most common source of silently-wrong filters. +- **Cast to the real type before comparing.** Compare a column against a literal of its actual type in `WHERE`/`JOIN`. A string column compared to a numeric literal (or the reverse) can silently match nothing instead of raising an error. +- **Parse text-encoded numerics before doing math on them.** When a column the question treats as a number is stored as text, sample its **distinct** values (the *Sample before you compose* habit) to learn the encodings actually present — unit suffixes (`K`/`M`/`B`), currency symbols, thousands separators, percent signs, and non-numeric sentinels (`-`, `N/A`, empty) — and never infer the format from the column name. *Why:* aggregated or compared as-is the text sorts lexically (`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL, so the query runs but the number is silently wrong instead of erroring. +- **Strip, scale, and cast in one early CTE.** Strip currency/separator/percent characters, multiply by the suffix scale (`K`=10^3, `M`=10^6, `B`=10^9), map sentinels to `0` **or** `NULL` (by the *Default by additivity* rule below), then cast to a numeric type — all in a single early CTE so every layer above sees clean numbers. This is the *meaning-is-numeric* complement to *Cast to the real type before comparing*. *Why:* one clean conversion at the base keeps the lexical-sort-and-cast-to-0 failure out of every downstream layer. +- **Confirm the parse covered every value.** After parsing, count the non-sentinel rows that failed to parse — a failed parse should surface as `NULL`, visible only with a **failure-detecting cast** from `sql_dialect_notes` (a plain `CAST` errors on some engines and on sqlite silently returns `0`/partial, so an `IS NULL` check is meaningless there). *Why:* an encoding the sample missed would otherwise vanish into `0`/NULL instead of being caught. +- **Parse code/dependency text by its real grammar, not one broad regex.** When a question extracts imported/required/loaded packages or modules from stored source text or dependency manifests, parse by the *language or format*, not a single pattern: Java `import`/`import static` — drop the terminal class/member, keep the package path, and allow valid identifier segments with underscores and mixed case (e.g. com.planet_ink.coffee_mud); Python — handle both `import a, b as c` and `from a.b import c`, stripping aliases; R — handle `library(...)` and `require(...)`; notebooks (`.ipynb`) — parse the JSON and read each cell's `source` lines *before* applying the language rules (never regex the raw notebook file, whose prose contains the words "import"/"from"); JSON/manifest files — `PARSE_JSON` and flatten the dependency object's keys (e.g. `require`). Strip comments/prose lines first and split multi-import lines so each declared dependency is counted once. *Why:* a single lowercase-segment regex silently drops real identifiers and matches prose, so the ranking is wrong though the query runs. +- **Decide the counting population explicitly when a table is deduplicated.** If the source table is de-duplicated and carries a documented copy/occurrence count (e.g. a `copies` column = "repositories sharing this exact content"), the count grain is a real modeling choice: weight by that column only when the question's population is clearly the represented files/repositories; otherwise count the distinct stored rows. State which population the question names and match it — do not default to one silently. *Why:* on a deduplicated table `COUNT(*)` and `SUM(copies)` give different rankings, so the right metric depends on the population the question asks about, not on which is larger. + +```sql +-- "Total trade volume" where value_text holds '1.2K', '3M', '$1,200', '-'. +-- WRONG: a naive cast collapses the formatted values ('1.2K'->1.2, '$1,200'->0, +-- '-'->0) instead of erroring, so the SUM comes back silently far too low. +SELECT SUM(CAST(value_text AS REAL)) AS total_volume FROM metrics; + +-- RIGHT: strip symbols/suffixes, scale by the K/M/B suffix, map sentinels to 0, and +-- cast once in an early CTE; the SUM then runs over clean numbers. +WITH parsed AS ( + SELECT CASE WHEN value_text IN ('-', 'N/A', '') THEN 0 + ELSE CAST(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(value_text, + '$', ''), ',', ''), 'K', ''), 'M', ''), 'B', '') AS DECIMAL(18, 4)) + * CASE WHEN value_text LIKE '%K' THEN 1000 + WHEN value_text LIKE '%M' THEN 1000000 + WHEN value_text LIKE '%B' THEN 1000000000 ELSE 1 END + END AS volume + FROM metrics +) +SELECT SUM(volume) AS total_volume FROM parsed; +``` + +- **Canonicalize observed URL-path variants before page-level analysis.** When a question groups, filters, or sequences web pages by a `path`/`url` column, sample its distinct values first. If the data itself shows route-label variants — `/route` and `/route/` for the same page context — define a canonical page-path expression in an early CTE and use it everywhere above that CTE: preserve `/` as root, strip trailing slashes only from non-root paths, and map an observed empty path to `/` *only* when the column is a URL path and the sampled rows show blank root-page events. Do **not** merge different route names (`/input` ≠ `/regist/input`), strip query strings/fragments/host/scheme, lowercase paths, or canonicalize at all when the question asks for the raw stored URL/path or for slash-vs-no-slash differences. *Why:* raw request logs routinely store the same user-visible page both with and without a trailing slash, so grouping or sequencing the raw labels silently splits one page into several — but inventing aliases the data doesn't show would just as silently merge distinct pages. + +**Composition** +- **Build incrementally.** Assemble complex queries one CTE at a time, checking each layer's output on a small sample before stacking the next; a wrong intermediate layer is far cheaper to catch early than to debug in the final number. +- **Avoid fan-out joins — the danger is cumulative.** Any one-to-many hop on the path between a measure's owning table and the aggregate inflates that measure, even when the offending join sits several hops below the `SUM`/`COUNT` and is easy to miss. The fix is the single-hop one applied per measure-owning table along the whole chain: pre-aggregate each coarse-grained measure to its own grain in a CTE, then join the already-aggregated result. +- **Verify the grain holds across each join.** As you compose, confirm a join you intend to be one-to-one / many-to-one did not change the grain you aggregate at — e.g. the row count (or the count of the aggregate's key) is unchanged across it. When a join is genuinely one-to-many, reach for the default fix (pre-aggregate to grain); for a pure count, `COUNT(DISTINCT key)` is an acceptable escape hatch. A `SUM`/`AVG` of a fanned-out measure must pre-aggregate — `DISTINCT` cannot de-duplicate a sum. +- **A join that only attaches a label must not drop rows — `LEFT JOIN` it, and key the aggregate on the fact column.** Fan-out's mirror image is just as silent: when you join a dimension table *only to fetch a display attribute* (a name for an id, a category for a product), an **incomplete** dimension — and dimensions are routinely incomplete: trimmed catalogs, late-arriving rows, slowly-changing-dimension gaps — makes a plain inner `JOIN` quietly **discard every fact row whose key has no parent**, shrinking the counts, sums, and the universe over which any share / average / median is computed (a measure halves with no error and no empty result). Two guards: (1) inner-join a dimension only when you *intend it as a filter* — you want exactly the rows that have a parent — never merely to read a column off it; for pure enrichment use `LEFT JOIN`. (2) Key the aggregation and `GROUP BY` on the **fact** column (`sales.prod_id`), not the dimension column (`products.prod_id`), so an unmatched key yields a `NULL` label on its own row rather than dropping or collapsing it. Use the same row-count check as above, but for an enrichment join confirm the fact row count is *unchanged* (not merely un-inflated); if a dimension you only wanted a name from removed rows, that is the bug. +- **Source each filter, date, and measure from the table that OWNS it at the question's grain.** When two joined fact tables carry similarly-named columns at *different* grains — a parent (one row per order: its `status`, placement `created_at`, `num_of_item`) and its child (one row per line item: line `created_at`, `sale_price`, `cost`) — read each predicate/measure from the table whose grain the question names, not from whichever is in scope after the join. "Orders that are Complete", "for each month of the orders", "the order's creation date" are *order*-grain, so the status filter and the month bucket come from the parent order row, even though the child also has `status`/`created_at` columns; line price and cost come from the child. *Why:* the parent's and child's copies of a column diverge (an item's placement month or status can differ from its order's), so anchoring an order-grain filter or calendar on the line table silently buckets/filters the wrong rows. The mirror at metric grain: never combine a parent-grain count with child rows after the join (e.g. `num_of_item * SUM(line_price)` once per line) — compute each measure at its own grain (sum line prices to the order, take `num_of_item` once per order) before combining. + +```sql +-- "How many orders per region contain a returned item?" — count each order once. +-- WRONG: order_lines is joined to apply the line-level filter, which multiplies +-- orders; an order with two returned lines is counted twice, three joins below +-- the COUNT, where the inflation is easy to miss. +SELECT r.region_id, COUNT(*) AS n_orders +FROM regions r +JOIN stores s ON s.region_id = r.region_id +JOIN orders o ON o.store_id = s.store_id +JOIN order_lines l ON l.order_id = o.order_id +WHERE l.status = 'returned' +GROUP BY r.region_id; + +-- RIGHT: collapse order_lines to one row per qualifying order first, then join up +-- so each order contributes exactly once. +WITH returned_orders AS ( + SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id +) +SELECT r.region_id, COUNT(*) AS n_orders +FROM regions r +JOIN stores s ON s.region_id = r.region_id +JOIN orders o ON o.store_id = s.store_id +JOIN returned_orders ro ON ro.order_id = o.order_id +GROUP BY r.region_id; +-- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an +-- order-level measure fanned out this way must pre-aggregate — DISTINCT can't +-- de-duplicate a sum. +``` + +**Ordering & aggregation determinism** +- **Make the ordering deterministic.** Give every ranking/ordering window a complete tie-breaker by appending unique key column(s) to `ORDER BY`, so `RANK`/`ROW_NUMBER`/`LAG` results are stable instead of flickering between runs. +- **Order inside string/array aggregation.** When concatenating rows into a delimited string or building an ordered array (`GROUP_CONCAT` / `string_agg` / `array_agg`), the element order is **undefined unless you specify it** — put an explicit `ORDER BY` on the aggregate. Be deliberate about collation: the default text sort is **binary/case-sensitive** (so `'BBQ'` sorts before `'Bacon'` because uppercase code points precede lowercase), which differs from a case-insensitive sort; pick the one the question implies and apply it consistently (`ORDER BY ... COLLATE NOCASE` for case-insensitive). *Why:* an unordered or differently-collated concatenation produces a string with the right elements in the wrong order — runnable but not matching the expected text. +- **Emit a list-valued answer cell as a delimited STRING, not a raw ARRAY/repeated column.** When the answer needs several values in one cell (a set of names/codes/tags for an entity), build a delimited scalar with `STRING_AGG(x, ',' ORDER BY x)` (or `ARRAY_TO_STRING(ARRAY_AGG(x ORDER BY x), ',')`) — do not return a SQL `ARRAY`/repeated column. *Why:* an array column serializes to an engine-specific representation (e.g. `['a' 'b']` or `["a","b"]`) that won't compare equal to a plain delimited list (`a,b`), so a values-correct answer still mismatches when materialized to rows. +- **Filter after the window, not before**, for sequence / "first" / "most recent" / "since" questions: compute the window over the full partition, then keep the rows you want. A pre-filter shrinks the partition the window ranks over, so "first"/"most recent" is measured against the wrong set. + +```sql +-- "Each customer's first order, restricted to orders since 2024-01-01." +-- Wrong: the filter runs before the window, so it ranks only 2024 rows and +-- misses customers whose true first order was earlier. +SELECT customer_id, order_id, + ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date, order_id) AS seq +FROM orders +WHERE order_date >= '2024-01-01'; -- then keep seq = 1 + +-- Right: rank the full partition in a CTE, then filter in the outer query. +WITH ranked AS ( + SELECT customer_id, order_id, order_date, + ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date, order_id) AS seq + FROM orders +) +SELECT customer_id, order_id, order_date +FROM ranked +WHERE seq = 1 AND order_date >= '2024-01-01'; +``` + +- **Cumulative / running total.** Use an explicit frame — `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` — with a complete tie-breaker on the `ORDER BY` (per the deterministic-ordering rule above). *Why:* a bare `ORDER BY` defaults to a `RANGE`-based frame bounded at the current row, which on ties in the order key folds every tied peer into one cumulative value — it runs and looks plausible, but the running total jumps at each tie boundary. +- **Rolling window over calendar time, plus minimum periods.** "Rolling N days/months" spans a *calendar range*, not a fixed row count: a `ROWS BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are missing. Two sanctioned paths — (a) build a gap-free date spine first (the **Series** idiom from `sql_dialect_notes`) so one row exists per calendar unit, then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the intended span (fully portable); or (b) where the engine supports it, a native calendar range frame — or a date-keyed self-join — expresses the window directly: get the rolling-window idiom from `sql_dialect_notes`, do not inline it. For **minimum periods** ("only after N periods of data"), emit `NULL` until the window is full — guard on `COUNT(*) OVER () = N`, counting non-null observations instead when "N periods" means N data points rather than N calendar slots. *Why:* a row-count frame over missing dates measures the wrong span, and a partial early window is not the requested metric. +- **Period-over-period.** Compare against the prior period with `LAG(metric) OVER (PARTITION BY k ORDER BY period)`; compute growth as `(cur - prev) / prev` at full precision, rounding only in the final projection (per the round-at-the-end rule below), and guard the divide against a zero or absent prior — e.g. `… / NULLIF(prev, 0)`. *Why:* without `LAG`, or ordered against the wrong neighbor, the comparison lands on the wrong period, and an unguarded ratio errors or returns garbage when the prior period is zero or missing. + +```sql +-- "Each account's running balance over time" — a cumulative sum of net per +-- account, in date order. +-- WRONG: a bare ORDER BY defaults to a RANGE-based frame, so two txns dated the +-- same day share one inflated balance (every tied peer folds into that value). +SELECT account_id, txn_date, net, + SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date) AS running_balance +FROM account_txns; + +-- RIGHT: an explicit ROWS frame accumulates row by row, and a complete tie-breaker +-- (txn_id) makes the order — and the running total — deterministic across ties. +SELECT account_id, txn_date, net, + SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date, txn_id + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_balance +FROM account_txns; +``` + +**Numeric precision** +- **Integer division truncates on postgres/sqlite/tsql.** The `/` operator between two integers does integer division on **postgres, sqlite, and SQL Server** — `5 / 2` is `2`, `wins / games` is `0` — so a rate, share, or `SUM(a) / COUNT(*)` silently floors to an integer. Cast one operand to a fractional type before dividing: `wins * 1.0 / games`, `CAST(wins AS REAL) / games`, or `SUM(a)::numeric / COUNT(*)`, then round at the end. mysql and bigquery already return a fractional result from `/` (on bigquery prefer `SAFE_DIVIDE` to also guard a zero denominator). +- **Round only at the end.** Compute at full precision and round in the final projection, never inside intermediate CTEs. Be explicit about truncation: an integer cast (`CAST(x AS INT)`) truncates toward zero, so use explicit rounding when rounding is what you mean. +- **Macro vs micro average.** Match the average to the wording. "Average of per-group averages" is `AVG(group_metric)`; an "overall" or "weighted" average is `SUM(numerator) / SUM(denominator)`. The two diverge whenever group sizes differ. + +**Answer completeness / interpretation** +- **"Top / highest / most / lowest"** returns only the winning row(s) — keep the top-ranked row from the window result — not the full ranked list, unless the question asks for a list. +- **"For each X / per X / by X"** returns exactly one row per X. Do not collapse to a single value unless the question says "overall" or "total across X". +- **A named business measure means its amount, not a row count.** When a question asks for "sales", "revenue", "spend", "value", or "volume" of money/goods without an explicit "number / count of", aggregate the monetary/quantity **amount** (`SUM(price)` / `SUM(amount)`), not `COUNT(*)` of rows. *Why:* "toy sales" reads as sales revenue; counting order rows silently answers a different question. +- **Answer literally — do not add unrequested transformations.** Apply exactly the filters, joins, grouping, and computation the question (and any `external_knowledge` doc) states; do not add "helpful" extras the task never asked for — extra status/category predicates, area/residential *weighting* of an average the question states plainly, entity-name *normalization* that forces joins the source leaves unmatched, or a re-derived value where the question names a specific stored measure/column. When the wording bounds an **aggregate** ("committees whose *total* is between $0 and $200", "entities with 5+ orders"), filter the aggregate with `HAVING`, not each row with `WHERE`. When an `external_knowledge` doc gives an explicit formula or function/UDF definition, implement it **verbatim** — same operators, constants, and ordering — rather than substituting your own "more correct" math. *Why:* each unrequested predicate silently drops valid rows, each unrequested weighting/normalization or re-derivation changes the value, and a row-level filter for an aggregate bound answers a different question — so a more-sophisticated-looking query is wrong against the literal ask. Prefer the simplest reading that satisfies the question. +- **Don't project free-text columns the question didn't ask for.** A description/body/comment/notes column whose values contain commas or newlines corrupts the row-delimited output and is almost never the requested value — leave it out of the final projection unless the question explicitly asks for it. +- **"Inter-event duration / gap / interval" is the time between consecutive events, not a magnitude.** When the question asks the typical gap/interval/time *between* occurrences (releases, visits, orders), order rows by the event timestamp and take `LEAD`/`LAG` date differences, then aggregate — never a duration/length/runtime *column*. +- **Anchor a period bucket to the lifecycle event the wording names.** When a record carries several lifecycle timestamps (created/placed, approved, shipped, delivered, completed, settled) and the question counts/measures records in a *named completed state* by period ("delivered orders by month", "shipped items per week", "completed payments by day"), bucket the period by that named event's own timestamp (`order_delivered_customer_date`, `shipped_at`, `settled_at`) — the state value is the qualifying filter, the matching timestamp is the time anchor. Use the creation/placed/purchased/submitted timestamp only when the question names that *start* event (purchased, placed, created, ordered, submitted) or no matching event timestamp exists. If several timestamps fit, pick the one for the event as experienced by the question's subject (customer delivery = the customer-receipt date, not the carrier-handoff or estimated date). If the named state is used only as a non-temporal filter (counts by customer/city/seller with no period bucket), it is just a filter — introduce no date anchor. Confirm each timestamp's meaning from column names, semantic-layer descriptions, and sample rows first. *Why:* bucketing a completed-state count by the record's creation date silently answers a different question — "records that later reached that state, grouped by when they started" — than the one asked. +- **"Highest / most across several achievements" aggregates per metric over the whole history.** When a question asks for top values across multiple metrics or a career/lifetime total ("most runs, most wickets, longest span"), emit one row per metric with that metric summed/maxed over all the entity's records — not a single top-season or top-row snapshot. +- **An aggregate scoped to a per-entity selected set is computed across that set.** "The average revenue per actor **in those top-3 films**", "the mean order value over each customer's **last 5 orders**" means, per entity, the aggregate over the items it selected — one value per entity spanning its chosen items — NOT the per-item value. The per-item formula the question gives ("divide film revenue among its actors") computes each item's contribution; the average/total then spans the selected items. When the question states both a per-item computation AND an aggregate over the items, compute and project BOTH (the per-item value and the across-set aggregate, e.g. `AVG(item_value) OVER (PARTITION BY entity)`). The set is chosen by the ranking measure the question names — "top-N **revenue-generating** films" ranks each entity's items by the item's **own total revenue** — and that ranking is independent of the per-item value (the share), which feeds only the aggregate, never the top-N selection. +- **Coverage over a selected group is a set-membership aggregate (one value for the whole group), not a per-entity metric.** When a question first selects a group of entities ("the top 5 actors", "these products", "the eligible stores") and then asks what count/share/percentage of a **different** subject domain has any relationship to *these* selected entities ("what % of **customers** rented films featuring these actors"), the subject set is the **UNION across the whole group**: select the entity ids in a CTE, join to the subject facts, `COUNT(DISTINCT subject_id)` **once** across the group, and return one aggregate at the subject-domain grain (with the numerator/denominator projected if the question states a ratio). Counting the subject per selected entity and reporting N rows answers a different question and double-counts subjects that relate to more than one entity. This is the **collective-coverage** cousin of the per-entity rule above: emit one row per selected entity **only** when the wording says "for each / per / by / list" or asks for each entity's *own* metric ("top 5 players **and their** batting averages"); a bare "what share … of these" is one collective value. +- **Complete the panel for "each / every / all / per ".** These cues mean the answer's rows should be the *full expected domain* — every month in the asked range, every region in the dimension — not only the groups that happen to have fact rows; a plain inner `GROUP BY` emits only non-empty groups, so empty periods/categories silently drop and a "12 months" answer comes back short. Build the full set of groups (the **spine**), `LEFT JOIN` the aggregated facts onto it, then default the gaps: + - **Spine source.** For a category, take the distinct domain from the **dimension/entity table** (e.g. every region from `regions`) — not `SELECT DISTINCT` over the facts, which can only list categories that already occur; with no dimension table, distinct values from the *unfiltered* facts are the best available domain. For a period or number range, generate the series across the question's stated range (when the range is "all periods present", derive its bounds from `MIN`/`MAX` over the *unfiltered* facts). Series syntax is engine-specific — get the series/calendar idiom from `sql_dialect_notes` rather than inlining one dialect's generator. + - **Default by additivity.** `COALESCE(metric, 0)` only for **additive** measures (a `COUNT`/`SUM` of events or amounts, where "no activity" genuinely reads as 0); leave **non-additive** measures (`AVG`, a rate, a ratio, a price, a running balance) as `NULL` — absence is "no data", and 0 would be a wrong reading. + - **Don't over-apply.** *each / every / all* wants the complete domain; *which / that have* ("which months had orders") wants only the groups that exist — there the spine is wrong, so emit observed groups only. + - **Selecting the extreme group needs the spine too.** When you pick the group with the highest/lowest count or total over a period/category domain ("the month with the **lowest** number of active customers", "the region with the **fewest** orders"), rank over the COMPLETE spine, not only groups that have fact rows — an empty period/category is a genuine 0 and is frequently the true minimum, yet ranking over observed groups alone silently makes it unselectable and returns the wrong extreme. A period with NO rows at all never appears in a `GROUP BY` of the facts: generate the full calendar of the stated range first ("each month of 2020" → all 12 months, even if only 4 have transactions), `LEFT JOIN` the per-group aggregates, `COALESCE` the count to 0, and only THEN rank — otherwise a zero-activity month that is the true lowest is invisible to the ranking. +- **Answer every requested output.** When a question asks for several things — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a value plus its components ("X, Y, and their ratio") — the projection needs one column per requested output, not just the first clause. *Why:* answering only the first clause is the most common way a runnable query is still wrong — the grain and methodology can be perfect yet the answer is short by columns. This is the umbrella over the next two rules: *keep the inputs* is its "value + components" case and *expose identity* is its "entity identity" case, so a **complete projection** is exactly every requested metric/attribute, plus the identifier of each named entity, plus the inputs to each derived value, at the question's grain. It governs *which columns* appear — distinct from *Top …* and *For each X* above, which govern *which rows* — and composes with them ("highest and lowest per region" needs one row per region and a column per clause). +- **Keep the inputs to a derived value.** When the question asks for inputs and something derived from them ("X, Y, and their ratio"), project the inputs as columns alongside the derived value. +- **A comparison BETWEEN two specific extremes is one wide row.** When the question asks for a single value derived by comparing two named extremes — "the **difference between** the highest and the lowest month", "the ratio of the best to the worst" — present BOTH extremes side by side in ONE row: each extreme's attributes as their own columns (e.g. `highest_month`, `highest_value`, `lowest_month`, `lowest_value`) plus the comparison as a column (`difference`). The comparison is a single fact about the pair, so the answer is one wide row — NOT one row per extreme with the comparison repeated. (Contrast: "report a metric **for each** group/category" — e.g. "a percentage for each helmet group", "the top player for each outcome" — has no cross-item comparison and stays long, one row per group.) +- **Project BOTH identity and label.** When the result is per-entity, project the entity's **identifier and its human-readable name together** — whichever you grouped by, add the other. The id disambiguates duplicate names, and a consumer may legitimately expect either; supplying both is the safe, complete choice (a per-entity answer that gives only one is a frequent cause of an otherwise-correct result not matching). +- **Diagnose empty results.** When a result is unexpectedly empty, relax filters one at a time to find which predicate removed the rows instead of guessing. +- **Spatial predicates ("within area / within N meters / inside this polygon / nearest").** When a question filters or relates rows by geography, use the engine's geospatial functions — get the exact ones from `sql_dialect_notes` — rather than hand-rolling latitude/longitude `BETWEEN` boxes (which are wrong off the equator and ignore polygon shape). Recipe: (1) turn each location into a geography point with the point constructor — **mind argument order, most take longitude before latitude**; (2) for an area of interest build a polygon from its boundary/corner coordinates, closing the ring (first point repeated last); (3) test the relation with the engine's containment (`contains`/`within`), proximity (`dwithin(g1,g2,meters)`), or overlap (`intersects`) predicate. For "the features within the same area as entity X", first resolve X's own geometry in a CTE, then join candidates on the spatial predicate against it. *Why:* spatial relationships are not axis-aligned ranges; the geodesic predicates are both correct and index-assisted, while a raw coordinate box silently includes/excludes the wrong rows. +- **Collapse a multi-valued attribute to one representative per entity before counting classes or a concentration metric.** When an entity carries a multi-valued classification array (IPC/CPC codes, tags, categories) and the methodology counts *entities per class* or computes a concentration/diversity measure (HHI, originality, a share), pick exactly **one representative value per entity** in a CTE first — use the array's `main`/`primary`/`first` flag when present, else a defined fallback (e.g. the most-frequent value) — then aggregate. Equally, when a metric's denominator is defined as a count of **entities** ("the number of patents cited"), use `COUNT(DISTINCT entity)`, not the count of exploded array rows. *Why:* `LATERAL FLATTEN`/unnest of the array multiplies an entity's weight by how many codes it has, inflating per-class frequencies and skewing any concentration metric — the query runs but the ranking/score is wrong. (Take the representative rule from the methodology/`external_knowledge` doc when it specifies one; do not invent a selection the source does not state.) +- **Final completeness check.** Before emitting the final SQL, re-read the question and confirm the projection covers: (1) every named **metric / attribute** asked for (→ *answer every requested output*); (2) the **identifier** of each grouped or named entity (→ *expose identity*); (3) every **input** to each derived value (→ *keep the inputs*); (4) all at the **grain** the question specifies (→ *for each X* / *complete the panel*). Run this on every query, not only when a result looks off. **Don't over-project:** anything outside that set — a column the question never asked for, added "to be safe" — adds noise, misleads the reader into thinking it matters, and makes the result harder to consume. Match the request exactly: neither short nor padded. + +```sql +-- "How many orders per region, including regions with no orders?" — every region +-- must appear, even one with zero orders. +-- WRONG: grouping the facts can only emit regions that have at least one order, +-- so a zero-order region silently drops and the panel comes back short a row. +SELECT region_id, COUNT(*) AS n_orders +FROM orders +GROUP BY region_id; + +-- RIGHT: start from the full region domain (the dimension table), LEFT JOIN the +-- per-region counts onto it, and COALESCE the additive count to 0 so empty +-- regions read 0 instead of vanishing. +WITH region_domain AS ( + SELECT DISTINCT region_id FROM regions +), +region_orders AS ( + SELECT region_id, COUNT(*) AS n_orders + FROM orders + GROUP BY region_id +) +SELECT d.region_id, COALESCE(ro.n_orders, 0) AS n_orders +FROM region_domain d +LEFT JOIN region_orders ro ON ro.region_id = d.region_id; +``` + +```sql +-- "For each region, report the highest and the lowest monthly order count and the +-- difference between them." A complete answer is five columns: the region's id and +-- name, the highest, the lowest, and their difference. +-- WRONG: answers only the first clause and drops the region id, the lowest, and the +-- difference — four of the five requested columns are missing. +SELECT region_name, MAX(monthly_orders) AS highest +FROM region_monthly +GROUP BY region_name; + +-- RIGHT: one column per requested output plus the entity's identity, at the region +-- grain — id and name, the highest, the lowest, and their difference. +SELECT r.region_id, r.region_name, + MAX(rm.monthly_orders) AS highest, + MIN(rm.monthly_orders) AS lowest, + MAX(rm.monthly_orders) - MIN(rm.monthly_orders) AS order_count_range +FROM regions r +JOIN region_monthly rm ON rm.region_id = r.region_id +GROUP BY r.region_id, r.region_name; +``` + + **Input:** "How many orders did Acme Corp place last month?" diff --git a/packages/cli/src/skills/wiki_capture/SKILL.md b/packages/cli/src/skills/wiki_capture/SKILL.md index bf006dab..ff29be38 100644 --- a/packages/cli/src/skills/wiki_capture/SKILL.md +++ b/packages/cli/src/skills/wiki_capture/SKILL.md @@ -112,6 +112,30 @@ All three fields use REPLACE semantics on update: - Pass `[]` → field is cleared. - Pass `[values]` → replaces existing with exactly those values (no merging). +## Connection scoping + +A project may have several databases whose schemas reuse the same concept names +(two warehouses each with `orders`, `customers`, …). The `connections` +frontmatter field keeps database-specific pages from polluting searches about +other databases. + +- The `wiki_write` tool accepts a `connections` field (list of connection ids, + same REPLACE semantics as `tags`). Absent or empty ⇒ the page is **unscoped** + and applies to every connection. +- When this ingest/turn is scoped to a connection (its id appears in the prompt + context — e.g. `connectionId: warehouse` in the SL Sources header or the + `` block), set `connections: []` on pages whose content is + **specific to that database** ("in this warehouse `user_id` is the device id, + not the account id"). Pair this with a connection-distinctive key so two + databases' same-concept pages can coexist: `orders_sales_db`, not `orders`. +- Leave `connections` empty for clearly **org-wide** knowledge ("fiscal year + starts in February") so it stays visible everywhere. Do not scope a page to a + connection just because the turn happened to be connection-scoped. +- Keys are still a flat, global namespace; `connections` does not namespace + them. A connection-scoped write whose key already belongs to a page scoped to + a *different* connection is rejected to prevent silently overwriting it — pick + a connection-distinctive key instead. + ## Editing existing pages Two modes: diff --git a/packages/cli/src/status-project.ts b/packages/cli/src/status-project.ts index c3faa63d..6d52518b 100644 --- a/packages/cli/src/status-project.ts +++ b/packages/cli/src/status-project.ts @@ -9,6 +9,7 @@ import { runCodexAuthProbe } from './context/llm/codex-runtime.js'; import type { KtxConfigIssue, KtxProjectConfig, KtxProjectConnectionConfig, KtxProjectEmbeddingConfig, KtxProjectLlmConfig } from './context/project/config.js'; import type { KtxLocalProject } from './context/project/project.js'; import { ktxLocalStateDbPath } from './context/project/local-state-db.js'; +import { listReferencedConnectionIds } from './context/wiki/local-knowledge.js'; import { isQueryHistoryEnabled, queryHistoryDialectForConnection, @@ -109,6 +110,7 @@ interface LocalStatsIngestPerConnection { connectionId: string; adapter: string; lastCompletedAt: string; + skippedObjects: Array<{ name: string; reason: string }>; } interface LocalStatsSemanticLayerEntry { @@ -581,6 +583,29 @@ function buildStorageStatus(config: KtxProjectConfig): StorageStatus { }; } +/** + * Warn (never fail) when stored wiki pages reference connection ids that are no + * longer in `ktx.yaml`. Config and page content evolve independently, so a + * dangling reference is a soft condition — the pages still load, search, and + * read; it just signals a typo or a removed connection. + */ +async function buildUnknownConnectionWarning(project: KtxLocalProject): Promise { + let referenced: string[]; + try { + referenced = await listReferencedConnectionIds(project); + } catch { + return null; + } + const unknown = referenced.filter((id) => !Object.hasOwn(project.config.connections, id)); + if (unknown.length === 0) { + return null; + } + return { + message: `Wiki pages reference connection id(s) not in ktx.yaml: ${unknown.join(', ')}. Those pages still load and search.`, + fix: 'Add the connection(s) via `ktx setup`, or update the pages’ `connections` frontmatter.', + }; +} + function buildWarnings( config: KtxProjectConfig, connections: ConnectionStatus[], @@ -782,6 +807,20 @@ function tryQuery(run: () => T, fallback: T): T { } } +function skippedObjectsFromReportBody(bodyJson: string): Array<{ name: string; reason: string }> { + try { + const body = JSON.parse(bodyJson) as { fetch?: { skipped?: Array<{ entityId?: unknown; message?: unknown }> } }; + const skipped = body.fetch?.skipped; + if (!Array.isArray(skipped)) return []; + return skipped.map((issue) => ({ + name: typeof issue.entityId === 'string' && issue.entityId.length > 0 ? issue.entityId : 'object', + reason: typeof issue.message === 'string' ? issue.message : 'introspection failed', + })); + } catch { + return []; + } +} + /** @internal */ export async function buildLocalStatsStatus(project: KtxLocalProject): Promise { const dbPath = ktxLocalStateDbPath(project); @@ -819,17 +858,19 @@ export async function buildLocalStatsStatus(project: KtxLocalProject): Promise + // SQLite returns body_json from the MAX(completed_at) row for each group. db .prepare( - `SELECT connection_id, adapter, MAX(completed_at) AS last_completed_at + `SELECT connection_id, adapter, MAX(completed_at) AS last_completed_at, body_json FROM local_ingest_reports WHERE status = 'done' GROUP BY connection_id, adapter`, ) - .all() as Array<{ connection_id: string; adapter: string; last_completed_at: string }>, - [] as Array<{ connection_id: string; adapter: string; last_completed_at: string }>, + .all() as IngestStatsRow[], + [] as IngestStatsRow[], ); const perConnectionMap = new Map(); for (const row of ingestRows) { @@ -839,6 +880,7 @@ export async function buildLocalStatsStatus(project: KtxLocalProject): Promise 0) { + const first = entry.skippedObjects[0]!; + const extra = entry.skippedObjects.length - 1; + const detail = `${first.name}: ${first.reason}${extra > 0 ? ` (+${extra} more)` : ''}`; + lines.push( + ` ${' '.repeat(nameWidth)} ${dim(`${entry.skippedObjects.length} object${entry.skippedObjects.length === 1 ? '' : 's'} skipped — ${detail}`)}`, + ); + } } } diff --git a/packages/cli/src/text-ingest.ts b/packages/cli/src/text-ingest.ts index edb59923..bc711e30 100644 --- a/packages/cli/src/text-ingest.ts +++ b/packages/cli/src/text-ingest.ts @@ -8,6 +8,13 @@ import type { KtxCliIo } from './cli-runtime.js'; import { createRepainter, initViewState, renderContextBuildView, type ContextBuildTargetState } from './context-build-view.js'; import { formatDuration } from './demo-metrics.js'; import type { KtxPublicIngestPlanTarget } from './public-ingest.js'; +import { + createLocalProjectVerbatimIngestor, + type VerbatimIngestItem, + type VerbatimIngestOrigin, + type VerbatimIngestorPort, + type VerbatimIngestResult, +} from './verbatim-ingest.js'; export interface KtxTextIngestArgs { projectDir: string; @@ -17,6 +24,8 @@ export interface KtxTextIngestArgs { userId: string; json: boolean; failFast: boolean; + /** Code-driven verbatim ingest: store the document body unchanged, LLM derives metadata only. */ + verbatim?: boolean; } /** @internal */ @@ -29,6 +38,7 @@ export interface TextMemoryIngestPort { interface TextIngestItem { label: string; content: string; + origin: VerbatimIngestOrigin; } interface TextIngestResult { @@ -43,6 +53,7 @@ interface TextIngestResult { export interface KtxTextIngestDeps { loadProject?: (options: { projectDir: string }) => Promise; createMemoryIngest?: (project: KtxLocalProject) => TextMemoryIngestPort; + createVerbatimIngestor?: (project: KtxLocalProject) => VerbatimIngestorPort; readFile?: (path: string) => Promise; readStdin?: () => Promise; now?: () => number; @@ -55,6 +66,10 @@ function defaultCreateMemoryIngest(project: KtxLocalProject): TextMemoryIngestPo return createLocalProjectMemoryIngest(project); } +function defaultCreateVerbatimIngestor(project: KtxLocalProject): VerbatimIngestorPort { + return createLocalProjectVerbatimIngestor(project); +} + async function defaultReadStdin(): Promise { const chunks: string[] = []; process.stdin.setEncoding('utf-8'); @@ -129,17 +144,17 @@ async function loadItems(args: KtxTextIngestArgs, deps: KtxTextIngestDeps): Prom args.texts.forEach((content, index) => { const label = textLabel(content, index, usedTextLabels); usedTextLabels.add(label); - items.push({ label, content }); + items.push({ label, content, origin: { kind: 'text' } }); }); const readFile = deps.readFile ?? defaultReadFile; const readStdin = deps.readStdin ?? defaultReadStdin; for (const file of args.files) { if (file === '-') { - items.push({ label: stdinLabel(items), content: await readStdin() }); + items.push({ label: stdinLabel(items), content: await readStdin(), origin: { kind: 'stdin' } }); } else { const path = resolve(file); - items.push({ label: basename(path), content: await readFile(path) }); + items.push({ label: basename(path), content: await readFile(path), origin: { kind: 'file', path } }); } } @@ -175,13 +190,13 @@ function allTargets(state: ReturnType): ContextBuildTarget return [...state.primarySources, ...state.contextSources]; } -function renderTextIngestView(state: ReturnType, styled: boolean): string { +function renderTextIngestView(state: ReturnType, styled: boolean, verbatim: boolean): string { return renderContextBuildView(state, { styled, - title: 'Ingesting text memory', - contextGroupLabel: 'Texts', - sourceIngestRunningText: 'capturing...', - completedItemName: { singular: 'text', plural: 'texts' }, + title: verbatim ? 'Writing verbatim pages' : 'Ingesting text memory', + contextGroupLabel: verbatim ? 'Documents' : 'Texts', + sourceIngestRunningText: verbatim ? 'writing...' : 'capturing...', + completedItemName: verbatim ? { singular: 'page', plural: 'pages' } : { singular: 'text', plural: 'texts' }, }); } @@ -254,7 +269,9 @@ export async function runKtxTextIngest( } const project = await (deps.loadProject ?? loadKtxProject)({ projectDir: args.projectDir }); - const memoryIngest = (deps.createMemoryIngest ?? defaultCreateMemoryIngest)(project); + const isVerbatim = args.verbatim === true; + const verbatimIngestor = isVerbatim ? (deps.createVerbatimIngestor ?? defaultCreateVerbatimIngestor)(project) : null; + const memoryIngest = isVerbatim ? null : (deps.createMemoryIngest ?? defaultCreateMemoryIngest)(project); const now = deps.now ?? (() => Date.now()); const batchId = now(); const state = initViewState(items.map((item) => makeTarget(item.label))); @@ -264,7 +281,7 @@ export async function runKtxTextIngest( const results: TextIngestResult[] = []; state.startedAt = now(); - const paint = () => repainter?.paint(renderTextIngestView(state, true)); + const paint = () => repainter?.paint(renderTextIngestView(state, true, isVerbatim)); paint(); let spinnerInterval: ReturnType | null = null; @@ -288,29 +305,50 @@ export async function runKtxTextIngest( const target = targets[index]!; target.status = 'running'; target.startedAt = now(); - target.detailLine = 'capturing...'; + target.detailLine = isVerbatim ? 'writing...' : 'capturing...'; target.progressUpdatedAtMs = target.startedAt; paint(); let runId: string | null = null; let result: TextIngestResult; try { - const ingestInput: MemoryAgentInput = { - userId: args.userId, - chatId: `cli-text-ingest-${batchId}-${index + 1}`, - userMessage: `Ingest external text artifact ${artifactReference(item.label)} into ktx memory.`, - assistantMessage: item.content.trim(), - ...(args.connectionId ? { connectionId: args.connectionId } : {}), - sourceType: 'external_ingest', - }; - const ingest = await memoryIngest.ingest(ingestInput); - runId = ingest.runId; - await memoryIngest.waitForRun(runId); - const status = await memoryIngest.status(runId); - if (!status) { - throw new Error(`Memory ingest run "${runId}" was not found.`); + if (verbatimIngestor) { + const verbatimItem: VerbatimIngestItem = { + origin: item.origin, + content: item.content, + ...(args.connectionId ? { connectionId: args.connectionId } : {}), + }; + const outcome: VerbatimIngestResult = await verbatimIngestor.ingest(verbatimItem); + result = { + label: item.label, + runId: null, + status: 'done', + captured: { wiki: [outcome.pageKey], sl: [], xrefs: [] }, + commitHash: outcome.commitHash, + error: null, + }; + } else { + // memoryIngest is set whenever verbatim is off — they are mutually exclusive. + if (!memoryIngest) { + throw new Error('Memory ingest was not initialized.'); + } + const ingestInput: MemoryAgentInput = { + userId: args.userId, + chatId: `cli-text-ingest-${batchId}-${index + 1}`, + userMessage: `Ingest external text artifact ${artifactReference(item.label)} into ktx memory.`, + assistantMessage: item.content.trim(), + ...(args.connectionId ? { connectionId: args.connectionId } : {}), + sourceType: 'external_ingest', + }; + const ingest = await memoryIngest.ingest(ingestInput); + runId = ingest.runId; + await memoryIngest.waitForRun(runId); + const status = await memoryIngest.status(runId); + if (!status) { + throw new Error(`Memory ingest run "${runId}" was not found.`); + } + result = resultFromStatus(item.label, status); } - result = resultFromStatus(item.label, status); } catch (error) { result = errorResult(item.label, runId, error); } @@ -340,17 +378,18 @@ export async function runKtxTextIngest( if (args.json) { writeJsonResult(args, results, io); } else if (repainter) { - repainter.paint(renderTextIngestView(state, true)); + repainter.paint(renderTextIngestView(state, true, isVerbatim)); writePlainFailures(results, io); } else { - io.stdout.write(renderTextIngestView(state, false)); + io.stdout.write(renderTextIngestView(state, false, isVerbatim)); writePlainFailures(results, io); } if (!args.json && results.length > 0) { const duration = state.totalElapsedMs > 0 ? ` in ${formatDuration(state.totalElapsedMs)}` : ''; const outcome = results.some((result) => result.status === 'error') ? 'finished with failures' : 'finished'; - io.stdout.write(`Text memory ingest ${outcome}${duration}.\n`); + const label = isVerbatim ? 'Verbatim ingest' : 'Text memory ingest'; + io.stdout.write(`${label} ${outcome}${duration}.\n`); } return results.some((result) => result.status === 'error') ? 1 : 0; diff --git a/packages/cli/src/verbatim-ingest.ts b/packages/cli/src/verbatim-ingest.ts new file mode 100644 index 00000000..e069d3e8 --- /dev/null +++ b/packages/cli/src/verbatim-ingest.ts @@ -0,0 +1,308 @@ +import { basename, extname, join } from 'node:path'; +import YAML from 'yaml'; +import { z } from 'zod'; +import { noopLogger } from './context/core/config.js'; +import { assertConfiguredConnectionId } from './context/connections/configured-connections.js'; +import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js'; +import { createLocalKtxEmbeddingProviderFromConfig, createLocalKtxLlmRuntimeFromConfig } from './context/llm/local-config.js'; +import type { KtxLlmRuntimePort } from './context/llm/runtime-port.js'; +import type { KtxProjectConnectionConfig } from './context/project/config.js'; +import type { KtxLocalProject } from './context/project/project.js'; +import { KnowledgeWikiService } from './context/wiki/knowledge-wiki.service.js'; +import { suggestFlatWikiKey } from './context/wiki/keys.js'; +import { SqliteKnowledgeIndex } from './context/wiki/sqlite-knowledge-index.js'; +import type { WikiFrontmatter } from './context/wiki/types.js'; +import type { KtxEmbeddingProvider } from './llm/types.js'; + +const LOCAL_AUTHOR = 'ktx'; +const LOCAL_AUTHOR_EMAIL = 'ktx@example.com'; + +/** Only the prefix sent to the LLM for metadata is clipped; the stored body is never clipped. */ +const METADATA_CLIP_LENGTH = 48_000; + +const VERBATIM_METADATA_SYSTEM_PROMPT = [ + 'You generate search metadata for an authoritative document that ktx stores verbatim.', + 'You never rewrite, summarize into, or alter the document body — you only describe it.', + 'Return a concise one- or two-sentence summary, a few topical tags, and any semantic-layer', + 'source names the document is clearly about. Use empty arrays when none apply.', +].join(' '); + +const verbatimMetadataSchema = z.object({ + summary: z.string().min(1).describe('A one- or two-sentence description of what the document defines or specifies.'), + tags: z.array(z.string()).default([]).describe('Short topical keywords that aid lexical and semantic recall.'), + sl_refs: z + .array(z.string()) + .default([]) + .describe('Semantic-layer source names the document is clearly about, if any are evident.'), +}); + +type VerbatimMetadata = z.infer; + +export interface VerbatimIngestOrigin { + kind: 'file' | 'text' | 'stdin'; + /** Present only for `kind: 'file'`; the resolved path the key basename is derived from. */ + path?: string; +} + +const DEGRADED_SUMMARY_MAX_LENGTH = 200; +const FRONTMATTER_PATTERN = /^---\n([\s\S]*?)\n---\n?([\s\S]*)$/; +const HEADING_PATTERN = /^#{1,6}\s+(.+?)\s*#*\s*$/; + +type UsageMode = WikiFrontmatter['usage_mode']; + +function isUsageMode(value: unknown): value is UsageMode { + return value === 'always' || value === 'auto' || value === 'never'; +} + +function nonEmptyString(value: unknown): string | undefined { + return typeof value === 'string' && value.trim().length > 0 ? value : undefined; +} + +function stringArray(value: unknown): string[] { + return Array.isArray(value) ? value.filter((item): item is string => typeof item === 'string') : []; +} + +/** `connections` accepts a single id or a list in YAML; normalize either to a string list. */ +function stringList(value: unknown): string[] { + if (typeof value === 'string') { + return value.trim().length > 0 ? [value] : []; + } + return stringArray(value); +} + +function leadingHeadingText(body: string): string | null { + const firstLine = body.trimStart().split('\n', 1)[0] ?? ''; + const match = firstLine.match(HEADING_PATTERN); + return match ? match[1].trim() : null; +} + +/** @internal */ +export function splitInputDocument(raw: string): { frontmatter: Record; body: string } { + const match = raw.match(FRONTMATTER_PATTERN); + if (!match) { + return { frontmatter: {}, body: raw.trim() }; + } + const parsed = YAML.parse(match[1]) as unknown; + const frontmatter = + parsed !== null && typeof parsed === 'object' && !Array.isArray(parsed) + ? (parsed as Record) + : {}; + return { frontmatter, body: match[2].trim() }; +} + +/** @internal */ +export function deriveVerbatimPageKey(origin: VerbatimIngestOrigin, body: string): string { + if (origin.kind === 'file' && origin.path) { + return suggestFlatWikiKey(basename(origin.path, extname(origin.path))); + } + const heading = leadingHeadingText(body); + if (!heading) { + throw new Error( + 'Verbatim inline text needs a leading Markdown heading to derive a stable page key. Add a "# Heading" line, or pass the content as --file .', + ); + } + return suggestFlatWikiKey(heading); +} + +/** @internal */ +export function deriveDegradedSummary(body: string): string { + const heading = leadingHeadingText(body); + if (heading) { + return heading; + } + const text = body.trim(); + const sentence = text.match(/^([\s\S]*?[.!?])(\s|$)/); + const summary = sentence ? sentence[1].trim() : text; + if (summary.length <= DEGRADED_SUMMARY_MAX_LENGTH) { + return summary; + } + return `${summary.slice(0, DEGRADED_SUMMARY_MAX_LENGTH).trimEnd()}…`; +} + +/** @internal */ +export function buildVerbatimFrontmatter(input: { + inputFrontmatter: Record; + summary: string; + tags: string[]; + slRefs: string[]; + connectionId?: string; +}): WikiFrontmatter & Record { + const { inputFrontmatter } = input; + + const inputConnections = stringList(inputFrontmatter.connections); + const flagConnections = input.connectionId ? [input.connectionId] : []; + if ( + inputConnections.length > 0 && + flagConnections.length > 0 && + !connectionSetsEqual(inputConnections, flagConnections) + ) { + throw new Error( + `Connection scope conflict: frontmatter declares connections [${inputConnections.join( + ', ', + )}] but --connection-id is "${input.connectionId}". Remove one so the intent is unambiguous.`, + ); + } + const connections = inputConnections.length > 0 ? inputConnections : flagConnections; + + const summary = nonEmptyString(inputFrontmatter.summary) ?? input.summary; + const usageMode = isUsageMode(inputFrontmatter.usage_mode) ? inputFrontmatter.usage_mode : 'auto'; + const tags = inputFrontmatter.tags !== undefined ? stringArray(inputFrontmatter.tags) : input.tags; + const slRefs = inputFrontmatter.sl_refs !== undefined ? stringArray(inputFrontmatter.sl_refs) : input.slRefs; + + const passthrough = Object.fromEntries( + Object.entries(inputFrontmatter).filter( + ([key]) => !['summary', 'usage_mode', 'tags', 'sl_refs', 'connections'].includes(key), + ), + ); + + return { + ...passthrough, + summary, + usage_mode: usageMode, + ...(tags.length > 0 ? { tags } : {}), + ...(slRefs.length > 0 ? { sl_refs: slRefs } : {}), + ...(connections.length > 0 ? { connections } : {}), + } satisfies WikiFrontmatter & Record; +} + +function connectionSetsEqual(left: string[], right: string[]): boolean { + if (left.length !== right.length) { + return false; + } + const rightSet = new Set(right); + return left.every((id) => rightSet.has(id)); +} + +export interface VerbatimIngestItem { + origin: VerbatimIngestOrigin; + content: string; + connectionId?: string; +} + +export interface VerbatimIngestResult { + pageKey: string; + outcome: 'written' | 'unchanged'; + connections: string[]; + commitHash: string | null; +} + +export interface VerbatimIngestorPort { + ingest(item: VerbatimIngestItem): Promise; +} + +export interface CreateLocalProjectVerbatimIngestorDeps { + /** `undefined` ⇒ resolve from project config; `null` ⇒ force degraded (offline) metadata. */ + llmRuntime?: KtxLlmRuntimePort | null; + embeddingProvider?: KtxEmbeddingProvider | null; +} + +class LocalVerbatimIngestor implements VerbatimIngestorPort { + constructor( + private readonly deps: { + wikiService: KnowledgeWikiService; + llmRuntime: KtxLlmRuntimePort | null; + configuredConnections: Record; + author: string; + authorEmail: string; + }, + ) {} + + async ingest(item: VerbatimIngestItem): Promise { + if (item.connectionId) { + assertConfiguredConnectionId(this.deps.configuredConnections, item.connectionId); + } + + const { frontmatter: inputFrontmatter, body } = splitInputDocument(item.content); + const pageKey = deriveVerbatimPageKey(item.origin, body); + + const generated = await this.resolveMetadata(inputFrontmatter, body); + const frontmatter = buildVerbatimFrontmatter({ + inputFrontmatter, + summary: generated.summary, + tags: generated.tags, + slRefs: generated.slRefs, + ...(item.connectionId ? { connectionId: item.connectionId } : {}), + }); + const connections = Array.isArray(frontmatter.connections) ? frontmatter.connections : []; + + const existing = await this.deps.wikiService.readPage('GLOBAL', null, pageKey); + if (existing) { + if (existing.content === body) { + return { pageKey, outcome: 'unchanged', connections, commitHash: null }; + } + throw new Error( + `A different page already exists at key "${pageKey}". Re-run with a distinct document name or key, ` + + 'or remove the existing page first — verbatim ingest never overwrites a conflicting page.', + ); + } + + const writeResult = await this.deps.wikiService.writePageAndSync( + 'GLOBAL', + null, + pageKey, + frontmatter, + body, + this.deps.author, + this.deps.authorEmail, + `Ingest verbatim document: ${pageKey}`, + ); + + return { pageKey, outcome: 'written', connections, commitHash: writeResult.commitHash ?? null }; + } + + /** + * Generated metadata is only used to gap-fill absent frontmatter fields, so the LLM is + * skipped entirely when summary, tags, and sl_refs are all explicit. A configured backend + * that fails surfaces the error (the item fails); degraded derivation is reserved for + * `backend: none`, never used as a silent fallback that would poison the idempotency check. + */ + private async resolveMetadata( + inputFrontmatter: Record, + body: string, + ): Promise<{ summary: string; tags: string[]; slRefs: string[] }> { + const needsGeneration = + nonEmptyString(inputFrontmatter.summary) === undefined || + inputFrontmatter.tags === undefined || + inputFrontmatter.sl_refs === undefined; + + if (this.deps.llmRuntime && needsGeneration) { + const clipped = body.length > METADATA_CLIP_LENGTH ? body.slice(0, METADATA_CLIP_LENGTH) : body; + const generated = await this.deps.llmRuntime.generateObject({ + role: 'triage', + system: VERBATIM_METADATA_SYSTEM_PROMPT, + prompt: clipped, + schema: verbatimMetadataSchema, + }); + return { summary: generated.summary, tags: generated.tags, slRefs: generated.sl_refs }; + } + + return { summary: deriveDegradedSummary(body), tags: [], slRefs: [] }; + } +} + +export function createLocalProjectVerbatimIngestor( + project: KtxLocalProject, + deps: CreateLocalProjectVerbatimIngestorDeps = {}, +): VerbatimIngestorPort { + const llmRuntime = + deps.llmRuntime !== undefined + ? deps.llmRuntime + : createLocalKtxLlmRuntimeFromConfig(project.config.llm, { projectDir: project.projectDir }); + + const embeddingProvider = + deps.embeddingProvider !== undefined + ? deps.embeddingProvider + : createLocalKtxEmbeddingProviderFromConfig(project.config.ingest.embeddings, { projectDir: project.projectDir }); + const embeddingPort = embeddingProvider ? new KtxIngestEmbeddingPortAdapter(embeddingProvider) : null; + + const knowledgeIndex = new SqliteKnowledgeIndex({ dbPath: join(project.projectDir, '.ktx', 'db.sqlite') }); + const wikiService = new KnowledgeWikiService(project.fileStore, embeddingPort, knowledgeIndex, project.git, noopLogger); + + return new LocalVerbatimIngestor({ + wikiService, + llmRuntime, + configuredConnections: project.config.connections, + author: LOCAL_AUTHOR, + authorEmail: LOCAL_AUTHOR_EMAIL, + }); +} diff --git a/packages/cli/test/commands/ingest-commands.test.ts b/packages/cli/test/commands/ingest-commands.test.ts new file mode 100644 index 00000000..752dcfb2 --- /dev/null +++ b/packages/cli/test/commands/ingest-commands.test.ts @@ -0,0 +1,117 @@ +import { Command } from '@commander-js/extra-typings'; +import { describe, expect, it, vi } from 'vitest'; +import type { KtxCliCommandContext } from '../../src/cli-program.js'; +import { parseEnrichmentStagesOption, registerIngestCommands } from '../../src/commands/ingest-commands.js'; + +function makeContext(overrides: Partial = {}): KtxCliCommandContext { + let exitCode = 0; + return { + io: { + stdout: { write: vi.fn() }, + stderr: { write: vi.fn() }, + }, + deps: {}, + packageInfo: { name: '@kaelio/ktx', version: '0.0.0-test' }, + setExitCode: (code: number) => { + exitCode = code; + }, + runInit: vi.fn(), + writeDebug: vi.fn(), + ...overrides, + get exitCode() { + return exitCode; + }, + } as unknown as KtxCliCommandContext; +} + +function ingestProgram(context: KtxCliCommandContext): Command { + const program = new Command().exitOverride().option('--project-dir '); + registerIngestCommands(program, context, { runTextIngest: vi.fn(async () => 0) }); + return program; +} + +describe('parseEnrichmentStagesOption', () => { + it('parses a single stage', () => { + expect(parseEnrichmentStagesOption('relationships')).toEqual(['relationships']); + }); + + it('orders and de-duplicates by the canonical registry order', () => { + expect(parseEnrichmentStagesOption('embeddings,descriptions')).toEqual(['descriptions', 'embeddings']); + expect(parseEnrichmentStagesOption('relationships,relationships,descriptions')).toEqual([ + 'descriptions', + 'relationships', + ]); + }); + + it('tolerates surrounding whitespace and empty segments', () => { + expect(parseEnrichmentStagesOption(' descriptions , , embeddings ')).toEqual(['descriptions', 'embeddings']); + }); + + it('rejects an empty list', () => { + expect(() => parseEnrichmentStagesOption('')).toThrow(/non-empty/); + expect(() => parseEnrichmentStagesOption(' , ')).toThrow(/non-empty/); + }); + + it('rejects an unknown stage name', () => { + expect(() => parseEnrichmentStagesOption('foo')).toThrow(/unknown stage "foo"/); + expect(() => parseEnrichmentStagesOption('descriptions,foo')).toThrow(/unknown stage "foo"/); + }); +}); + +describe('ktx ingest --stages', () => { + it('threads a parsed stage set into the public ingest args', async () => { + const publicIngest = vi.fn(async (_args: unknown) => 0); + const context = makeContext({ deps: { publicIngest } }); + const program = ingestProgram(context); + + await program.parseAsync( + ['--project-dir', '/tmp/ktx', 'ingest', 'warehouse', '--stages', 'descriptions,embeddings'], + { from: 'user' }, + ); + + expect(publicIngest).toHaveBeenCalledTimes(1); + expect(publicIngest.mock.calls[0]?.[0]).toMatchObject({ + command: 'run', + targetConnectionId: 'warehouse', + stages: ['descriptions', 'embeddings'], + }); + }); + + it('omits stages entirely when the flag is absent (default = all)', async () => { + const publicIngest = vi.fn(async (_args: unknown) => 0); + const context = makeContext({ deps: { publicIngest } }); + const program = ingestProgram(context); + + await program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', 'warehouse'], { from: 'user' }); + + expect(publicIngest).toHaveBeenCalledTimes(1); + expect(publicIngest.mock.calls[0]?.[0]).not.toHaveProperty('stages'); + }); + + it('rejects an unknown stage with a clear parse error', async () => { + const publicIngest = vi.fn(async (_args: unknown) => 0); + const context = makeContext({ deps: { publicIngest } }); + const program = ingestProgram(context); + + await expect( + program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', 'warehouse', '--stages', 'foo'], { from: 'user' }), + ).rejects.toThrow(/unknown stage "foo"/); + expect(publicIngest).not.toHaveBeenCalled(); + }); + + it('rejects --stages combined with text capture', async () => { + const publicIngest = vi.fn(async (_args: unknown) => 0); + const runTextIngest = vi.fn(async () => 0); + const context = makeContext({ deps: { publicIngest } }); + const program = new Command().exitOverride().option('--project-dir '); + registerIngestCommands(program, context, { runTextIngest }); + + await expect( + program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', '--text', 'hi', '--stages', 'descriptions'], { + from: 'user', + }), + ).rejects.toThrow(/--stages applies to database ingest only/); + expect(publicIngest).not.toHaveBeenCalled(); + expect(runTextIngest).not.toHaveBeenCalled(); + }); +}); diff --git a/packages/cli/test/connectors/bigquery/connector.test.ts b/packages/cli/test/connectors/bigquery/connector.test.ts index 11ad69d8..cc0af685 100644 --- a/packages/cli/test/connectors/bigquery/connector.test.ts +++ b/packages/cli/test/connectors/bigquery/connector.test.ts @@ -1,4 +1,5 @@ import { describe, expect, it, vi } from 'vitest'; +import { KtxQueryError } from '../../../src/errors.js'; import { bigQueryConnectionConfigFromConfig, isKtxBigQueryConnectionConfig, type KtxBigQueryClient, KtxBigQueryScanConnector, type KtxBigQueryClientFactory, type KtxBigQueryDataset, type KtxBigQueryQueryJob, type KtxBigQueryTableRef, prepareBigQueryReadOnlyQuery } from '../../../src/connectors/bigquery/connector.js'; import { createBigQueryLiveDatabaseIntrospection } from '../../../src/connectors/bigquery/live-database-introspection.js'; import { tableRefSet } from '../../../src/context/scan/table-ref.js'; @@ -114,11 +115,40 @@ describe('KtxBigQueryScanConnector', () => { expect(isKtxBigQueryConnectionConfig({ driver: 'mysql' })).toBe(false); expect(bigQueryConnectionConfigFromConfig({ connectionId: 'warehouse', connection })).toMatchObject({ projectId: 'project-1', - datasetIds: ['analytics'], + datasetIds: [{ project: 'project-1', dataset: 'analytics' }], location: 'US', }); }); + it('parses project.dataset entries to host-project pairs and rejects malformed entries', () => { + expect( + bigQueryConnectionConfigFromConfig({ + connectionId: 'warehouse', + connection: { + driver: 'bigquery', + dataset_ids: ['bigquery-public-data.austin_311', 'analytics'], + credentials_json: JSON.stringify({ project_id: 'project-1' }), + }, + }).datasetIds, + ).toEqual([ + { project: 'bigquery-public-data', dataset: 'austin_311' }, + { project: 'project-1', dataset: 'analytics' }, + ]); + + for (const badEntry of ['proj.ds.table', 'proj.', '.ds']) { + expect(() => + bigQueryConnectionConfigFromConfig({ + connectionId: 'warehouse', + connection: { + driver: 'bigquery', + dataset_ids: [badEntry], + credentials_json: JSON.stringify({ project_id: 'project-1' }), + }, + }), + ).toThrow(/connections\.warehouse/); + } + }); + it('introspects datasets, table metadata, primary keys, and normalized types', async () => { const connector = new KtxBigQueryScanConnector({ connectionId: 'warehouse', @@ -184,6 +214,84 @@ describe('KtxBigQueryScanConnector', () => { ]); }); + it('introspects a foreign-hosted dataset under its own project while billing stays local', async () => { + const clientFactory = fakeClientFactory(); + const connector = new KtxBigQueryScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'bigquery', + dataset_ids: ['bigquery-public-data.austin_311'], + credentials_json: JSON.stringify({ project_id: 'project-1' }), + location: 'US', + }, + clientFactory, + }); + + const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'foreign' }); + + const client = vi.mocked(clientFactory.createClient).mock.results[0]?.value as KtxBigQueryClient; + expect(client.dataset).toHaveBeenCalledWith('austin_311', 'bigquery-public-data'); + expect(clientFactory.createClient).toHaveBeenCalledWith(expect.objectContaining({ projectId: 'project-1' })); + expect(snapshot.scope).toEqual({ + catalogs: ['bigquery-public-data'], + datasets: ['bigquery-public-data.austin_311'], + }); + expect(snapshot.metadata.project_id).toBe('project-1'); + expect(snapshot.tables[0]).toMatchObject({ + catalog: 'bigquery-public-data', + db: 'austin_311', + name: 'orders', + }); + }); + + it('introspects datasets across multiple host projects, each under its own project', async () => { + const clientFactory = fakeClientFactory(); + const connector = new KtxBigQueryScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'bigquery', + dataset_ids: ['bigquery-public-data.austin_311', 'analytics'], + credentials_json: JSON.stringify({ project_id: 'project-1' }), + location: 'US', + }, + clientFactory, + }); + + const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'multi' }); + + const client = vi.mocked(clientFactory.createClient).mock.results[0]?.value as KtxBigQueryClient; + expect(client.dataset).toHaveBeenCalledWith('austin_311', 'bigquery-public-data'); + expect(client.dataset).toHaveBeenCalledWith('analytics', 'project-1'); + expect(snapshot.scope.catalogs).toEqual(['bigquery-public-data', 'project-1']); + expect(snapshot.scope.datasets).toEqual(['bigquery-public-data.austin_311', 'analytics']); + expect(snapshot.tables.map((table) => ({ catalog: table.catalog, db: table.db, name: table.name }))).toEqual([ + { catalog: 'bigquery-public-data', db: 'austin_311', name: 'orders' }, + { catalog: 'project-1', db: 'analytics', name: 'orders' }, + ]); + }); + + it('keeps same-named datasets in different projects distinct', async () => { + const clientFactory = fakeClientFactory(); + const connector = new KtxBigQueryScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'bigquery', + dataset_ids: ['proj_a.shared', 'proj_b.shared'], + credentials_json: JSON.stringify({ project_id: 'project-1' }), + }, + clientFactory, + }); + + const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'same-name' }); + + expect(snapshot.scope.catalogs).toEqual(['proj_a', 'proj_b']); + expect(snapshot.scope.datasets).toEqual(['proj_a.shared', 'proj_b.shared']); + expect(snapshot.tables.map((table) => `${table.catalog}.${table.db}.${table.name}`)).toEqual([ + 'proj_a.shared.orders', + 'proj_b.shared.orders', + ]); + }); + it.each([ Object.assign(new Error('Access Denied'), { code: 403 }), Object.assign(new Error('Not found'), { errors: [{ reason: 'notFound' }] }), @@ -330,6 +438,50 @@ describe('KtxBigQueryScanConnector', () => { expect(skippedGet).not.toHaveBeenCalled(); }); + it('skips a table that fails introspection and ingests its healthy siblings', async () => { + const ordersGet = vi.fn(async (): ReturnType => [ + { metadata: { type: 'TABLE', numRows: '5', schema: { fields: [{ name: 'id', type: 'INT64', mode: 'REQUIRED' }] } } }, + ]); + const brokenGet = vi.fn(async (): ReturnType => { + throw new Error('Access Denied: Table project-1:analytics.locked'); + }); + const clientFactory: KtxBigQueryClientFactory = { + createClient: vi.fn(() => ({ + getDatasets: vi.fn(async (): ReturnType => [[{ id: 'analytics' }]]), + dataset: vi.fn( + (): KtxBigQueryDataset => ({ + get: vi.fn(async () => [{ id: 'analytics' }]), + getTables: vi.fn(async (): ReturnType => [ + [ + { id: 'orders', get: ordersGet }, + { id: 'locked', get: brokenGet }, + ], + ]), + }), + ), + createQueryJob: vi.fn(async (): ReturnType => [ + { + getQueryResults: async (): ReturnType => [ + [], + undefined, + { schema: { fields: [{ name: 'table_name', type: 'STRING' }, { name: 'column_name', type: 'STRING' }] } }, + ], + }, + ]), + })), + }; + const connector = new KtxBigQueryScanConnector({ connectionId: 'warehouse', connection, clientFactory }); + const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'skip-test' }); + + expect(snapshot.tables.map((table) => table.name)).toEqual(['orders']); + expect(snapshot.warnings).toHaveLength(1); + expect(snapshot.warnings?.[0]).toMatchObject({ + code: 'object_introspection_failed', + table: 'locked', + metadata: { object: 'project-1.analytics.locked' }, + }); + }); + it('constructs for discovery without dataset scope and lists tables through one region information schema query', async () => { const createQueryJob = vi.fn( async ( @@ -441,7 +593,7 @@ describe('KtxBigQueryScanConnector', () => { const clientFactory = fakeClientFactory(); const connector = new KtxBigQueryScanConnector({ connectionId: 'warehouse', - connection: { ...connection, max_bytes_billed: '987654321', job_timeout_ms: 30_000 }, + connection: { ...connection, max_bytes_billed: '987654321', query_timeout_ms: 30_000 }, clientFactory, }); @@ -491,4 +643,35 @@ describe('KtxBigQueryScanConnector', () => { ]), }); }); + + it('maps a BigQuery job timeout to KtxQueryError', async () => { + const timeoutError = new Error('Job execution was cancelled: Job timed out after 5000ms'); + const clientFactory: KtxBigQueryClientFactory = { + createClient: vi.fn(() => ({ + getDatasets: vi.fn(async (): ReturnType => [[{ id: 'analytics' }]]), + dataset: vi.fn( + (datasetId: string): KtxBigQueryDataset => ({ + get: vi.fn(async () => [{ id: datasetId }]), + getTables: vi.fn(async (): ReturnType => [[]]), + }), + ), + createQueryJob: vi.fn(async (): ReturnType => { + throw timeoutError; + }), + })), + }; + const connector = new KtxBigQueryScanConnector({ + connectionId: 'warehouse', + connection: { ...connection, query_timeout_ms: 5_000 }, + clientFactory, + }); + + const execution = connector.executeReadOnly( + { connectionId: 'warehouse', sql: 'select count(*) from `project-1`.`analytics`.`orders`' }, + { runId: 'scan-run-1' }, + ); + await expect(execution).rejects.toBeInstanceOf(KtxQueryError); + await expect(execution).rejects.toThrow('query exceeded 5s'); + await expect(execution).rejects.toMatchObject({ cause: timeoutError }); + }); }); diff --git a/packages/cli/test/connectors/clickhouse/connector.test.ts b/packages/cli/test/connectors/clickhouse/connector.test.ts index aba3143f..b8987b56 100644 --- a/packages/cli/test/connectors/clickhouse/connector.test.ts +++ b/packages/cli/test/connectors/clickhouse/connector.test.ts @@ -1,4 +1,5 @@ import { describe, expect, it, vi } from 'vitest'; +import { KtxQueryError } from '../../../src/errors.js'; import { clickHouseClientConfigFromConfig, isKtxClickHouseConnectionConfig, KtxClickHouseScanConnector, prepareClickHouseReadOnlyQuery, type KtxClickHouseClientFactory } from '../../../src/connectors/clickhouse/connector.js'; import { createClickHouseLiveDatabaseIntrospection } from '../../../src/connectors/clickhouse/live-database-introspection.js'; import { tableRefSet } from '../../../src/context/scan/table-ref.js'; @@ -385,6 +386,43 @@ describe('KtxClickHouseScanConnector', () => { await connector.cleanup(); }); + it('applies max_execution_time + an outlasting request_timeout and maps code 159 to KtxQueryError', async () => { + let capturedConfig: { request_timeout?: number; clickhouse_settings?: Record } | undefined; + const timeoutError = Object.assign(new Error('Code: 159. DB::Exception: Timeout exceeded'), { code: 159 }); + const clientFactory: KtxClickHouseClientFactory = { + createClient: vi.fn((config) => { + capturedConfig = config as { request_timeout?: number; clickhouse_settings?: Record }; + return { + query: vi.fn(async () => { + throw timeoutError; + }), + close: vi.fn(async () => undefined), + }; + }), + }; + const connector = new KtxClickHouseScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'clickhouse', + host: 'ch.example.test', + database: 'analytics', + username: 'reader', + password: 'test-pass', // pragma: allowlist secret + query_timeout_ms: 5_000, + }, + clientFactory, + }); + + const execution = connector.executeReadOnly( + { connectionId: 'warehouse', sql: 'select count(*) from events' }, + { runId: 'scan-run-1' }, + ); + await expect(execution).rejects.toBeInstanceOf(KtxQueryError); + await expect(execution).rejects.toThrow('query exceeded 5s'); + expect(capturedConfig?.clickhouse_settings?.max_execution_time).toBe(5); + expect(capturedConfig?.request_timeout).toBe(10_000); + }); + it('adapts native ClickHouse snapshots to live-database introspection for local ingest', async () => { const introspection = createClickHouseLiveDatabaseIntrospection({ connections: { diff --git a/packages/cli/test/connectors/mysql/connector.test.ts b/packages/cli/test/connectors/mysql/connector.test.ts index 829d2b0e..5d376246 100644 --- a/packages/cli/test/connectors/mysql/connector.test.ts +++ b/packages/cli/test/connectors/mysql/connector.test.ts @@ -1,5 +1,6 @@ import { describe, expect, it, vi } from 'vitest'; import type { FieldPacket, RowDataPacket } from 'mysql2/promise'; +import { KtxQueryError } from '../../../src/errors.js'; import { createMysqlLiveDatabaseIntrospection } from '../../../src/connectors/mysql/live-database-introspection.js'; import { isKtxMysqlConnectionConfig, KtxMysqlScanConnector, mysqlConnectionPoolConfigFromConfig, prepareMysqlReadOnlyQuery, type KtxMysqlConnectionConfig, type KtxMysqlPoolFactory } from '../../../src/connectors/mysql/connector.js'; import { tableRefSet } from '../../../src/context/scan/table-ref.js'; @@ -84,6 +85,9 @@ function fakePoolFactory(): KtxMysqlPoolFactory { [{ name: 'column_name' }, { name: 'estimated_cardinality' }], ); } + if (/^\s*SET SESSION max_execution_time/i.test(sql)) { + return mysqlResult([], []); + } throw new Error(`Unexpected SQL: ${sql} params=${JSON.stringify(params)}`); }); const release = vi.fn(); @@ -172,6 +176,9 @@ function multiSchemaMysqlPoolFactory( expect(params).toEqual(['analytics', 'mart']); return mysqlResult([], []); } + if (/^\s*SET SESSION max_execution_time/i.test(sql)) { + return mysqlResult([], []); + } throw new Error(`Unexpected SQL: ${sql} params=${JSON.stringify(params)}`); }); return { @@ -596,4 +603,47 @@ describe('KtxMysqlScanConnector', () => { foreignKeys: [], }); }); + + it('sets session max_execution_time to the resolved deadline and maps errno 3024 to KtxQueryError', async () => { + const issued: Array<{ sql: string; params?: unknown }> = []; + const timeoutError = Object.assign(new Error('Query execution was interrupted, maximum statement execution time exceeded'), { + errno: 3024, + code: 'ER_QUERY_TIMEOUT', + }); + const poolFactory: KtxMysqlPoolFactory = { + createPool: vi.fn(() => ({ + getConnection: vi.fn(async () => ({ + query: vi.fn(async (sql: string, params?: unknown) => { + issued.push({ sql, params }); + if (/^\s*SET SESSION max_execution_time/i.test(sql)) { + return [[], []] as [RowDataPacket[], FieldPacket[]]; + } + throw timeoutError; + }), + release: vi.fn(), + })), + end: vi.fn(async () => undefined), + })), + }; + const connector = new KtxMysqlScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'mysql', + host: 'db.example.test', + database: 'analytics', + username: 'reader', + password: 'test-password', // pragma: allowlist secret + query_timeout_ms: 5_000, + }, + poolFactory, + }); + + const execution = connector.executeReadOnly( + { connectionId: 'warehouse', sql: 'select count(*) from orders' }, + { runId: 'scan-run-1' }, + ); + await expect(execution).rejects.toBeInstanceOf(KtxQueryError); + await expect(execution).rejects.toThrow('query exceeded 5s'); + expect(issued[0]).toEqual({ sql: 'SET SESSION max_execution_time = ?', params: [5_000] }); + }); }); diff --git a/packages/cli/test/connectors/postgres/connector.test.ts b/packages/cli/test/connectors/postgres/connector.test.ts index e43e05a4..aee536d9 100644 --- a/packages/cli/test/connectors/postgres/connector.test.ts +++ b/packages/cli/test/connectors/postgres/connector.test.ts @@ -1,4 +1,5 @@ import { describe, expect, it, vi } from 'vitest'; +import { KtxQueryError } from '../../../src/errors.js'; import { createPostgresLiveDatabaseIntrospection } from '../../../src/connectors/postgres/live-database-introspection.js'; import { isKtxPostgresConnectionConfig, KtxPostgresScanConnector, postgresPoolConfigFromConfig, preparePostgresReadOnlyQuery, type KtxPostgresConnectionConfig, type KtxPostgresPoolFactory } from '../../../src/connectors/postgres/connector.js'; import { tableRefSet } from '../../../src/context/scan/table-ref.js'; @@ -148,7 +149,7 @@ describe('KtxPostgresScanConnector', () => { database: 'analytics', user: 'reader', password: 'test-password', // pragma: allowlist secret - options: '-c search_path=analytics,public', + options: '-c search_path=analytics,public -c statement_timeout=30000', ssl: { rejectUnauthorized: false }, }); const libpqPreferConfig = postgresPoolConfigFromConfig({ @@ -401,6 +402,61 @@ describe('KtxPostgresScanConnector', () => { ).rejects.toThrow('Only read-only SELECT/WITH queries can be executed locally'); }); + it('applies the resolved deadline as statement_timeout and maps a 57014 cancellation to KtxQueryError', () => { + expect( + postgresPoolConfigFromConfig({ + connectionId: 'warehouse', + connection: { + driver: 'postgres', + host: 'db.example.test', + database: 'analytics', + username: 'reader', + password: 'test-password', // pragma: allowlist secret + query_timeout_ms: 5_000, + }, + }).options, + ).toBe('-c search_path=public -c statement_timeout=5000'); + }); + + it('maps a Postgres statement_timeout cancellation (57014) to a KtxQueryError', async () => { + const timeoutError = Object.assign(new Error('canceling statement due to statement timeout'), { code: '57014' }); + const poolFactory: KtxPostgresPoolFactory = { + createPool() { + return { + async connect() { + return { + query: vi.fn(async () => { + throw timeoutError; + }), + release: vi.fn(), + }; + }, + end: vi.fn(async () => undefined), + }; + }, + }; + const connector = new KtxPostgresScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'postgres', + host: 'db.example.test', + database: 'analytics', + username: 'reader', + password: 'test-password', // pragma: allowlist secret + query_timeout_ms: 5_000, + }, + poolFactory, + }); + + const execution = connector.executeReadOnly( + { connectionId: 'warehouse', sql: 'select count(*) from orders' }, + { runId: 'scan-run-1' }, + ); + await expect(execution).rejects.toBeInstanceOf(KtxQueryError); + await expect(execution).rejects.toThrow('query exceeded 5s'); + await expect(execution).rejects.toMatchObject({ cause: timeoutError }); + }); + it('limits introspection to tables in tableScope', async () => { const queries: Array<{ sql: string; params?: unknown[] }> = []; const poolFactory: KtxPostgresPoolFactory = { diff --git a/packages/cli/test/connectors/snowflake/connector.test.ts b/packages/cli/test/connectors/snowflake/connector.test.ts index 1b00061b..fa1ed598 100644 --- a/packages/cli/test/connectors/snowflake/connector.test.ts +++ b/packages/cli/test/connectors/snowflake/connector.test.ts @@ -7,6 +7,7 @@ vi.mock('snowflake-sdk', () => ({ createPool, })); +import { KtxQueryError } from '../../../src/errors.js'; import { createSnowflakeLiveDatabaseIntrospection } from '../../../src/connectors/snowflake/live-database-introspection.js'; import { isKtxSnowflakeConnectionConfig, KtxSnowflakeScanConnector, prepareSnowflakeReadOnlyQuery, snowflakeConnectionConfigFromConfig, type KtxSnowflakeConnectionConfig, type KtxSnowflakeDriver, type KtxSnowflakeDriverFactory } from '../../../src/connectors/snowflake/connector.js'; import { tableRefSet } from '../../../src/context/scan/table-ref.js'; @@ -271,6 +272,57 @@ describe('KtxSnowflakeScanConnector', () => { expect(close).toHaveBeenCalledTimes(1); }); + it('sets STATEMENT_TIMEOUT_IN_SECONDS to the resolved deadline and maps a Snowflake timeout to KtxQueryError', async () => { + createPool.mockReset(); + const executedSql: string[] = []; + const timeoutError = Object.assign( + new Error('Statement reached its statement or warehouse timeout of 5 second(s) and was canceled.'), + { code: 604 }, + ); + const connection = { + execute: vi.fn( + (input: { + sqlText: string; + complete: (error: Error | null, statement: ReturnType, rows: unknown[]) => void; + }) => { + executedSql.push(input.sqlText); + if (/^ALTER SESSION/i.test(input.sqlText)) { + input.complete(null, fakeSnowflakeStatement(), [{ ONE: 1 }]); + } else { + input.complete(timeoutError, fakeSnowflakeStatement(), []); + } + }, + ), + }; + createPool.mockReturnValue({ + use: vi.fn(async (fn: (conn: typeof connection) => Promise) => fn(connection)), + drain: vi.fn(async () => undefined), + clear: vi.fn(async () => undefined), + }); + const connector = new KtxSnowflakeScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'snowflake', + authMethod: 'password', + account: 'acct', + warehouse: 'WH', + database: 'ANALYTICS', + schema_name: 'PUBLIC', + username: 'reader', + password: 'fixture-pass', // pragma: allowlist secret + query_timeout_ms: 5_000, + }, + }); + + const execution = connector.executeReadOnly( + { connectionId: 'warehouse', sql: 'select count(*) from orders' }, + { runId: 'run-1' }, + ); + await expect(execution).rejects.toBeInstanceOf(KtxQueryError); + await expect(execution).rejects.toThrow('query exceeded 5s'); + expect(executedSql[0]).toBe('ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = 5'); + }); + it('introspects schema, primary keys, comments, row counts, and dimensions', async () => { const connector = new KtxSnowflakeScanConnector({ connectionId: 'warehouse', diff --git a/packages/cli/test/connectors/sqlite/connector.test.ts b/packages/cli/test/connectors/sqlite/connector.test.ts index 27b00c57..154af042 100644 --- a/packages/cli/test/connectors/sqlite/connector.test.ts +++ b/packages/cli/test/connectors/sqlite/connector.test.ts @@ -1,12 +1,19 @@ import Database from 'better-sqlite3'; +import type { ChildProcess } from 'node:child_process'; import { writeFileSync } from 'node:fs'; import { mkdtemp, rm } from 'node:fs/promises'; import { tmpdir } from 'node:os'; import { join } from 'node:path'; -import { afterEach, beforeEach, describe, expect, it } from 'vitest'; +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; import { createSqliteLiveDatabaseIntrospection } from '../../../src/connectors/sqlite/live-database-introspection.js'; -import { isKtxSqliteConnectionConfig, KtxSqliteScanConnector, sqliteDatabasePathFromConfig } from '../../../src/connectors/sqlite/connector.js'; +import { + forkReadQueryChild, + isKtxSqliteConnectionConfig, + KtxSqliteScanConnector, + sqliteDatabasePathFromConfig, +} from '../../../src/connectors/sqlite/connector.js'; import { tableRefSet } from '../../../src/context/scan/table-ref.js'; +import { resolveEnabledTables } from '../../../src/context/scan/enabled-tables.js'; describe('KtxSqliteScanConnector', () => { let tempDir: string; @@ -150,6 +157,74 @@ describe('KtxSqliteScanConnector', () => { ]); }); + it('skips an object that fails introspection and ingests the rest with one recoverable warning', async () => { + const brokenDbPath = join(tempDir, 'broken.db'); + const brokenDb = new Database(brokenDbPath); + brokenDb.exec(` + CREATE TABLE base (id INTEGER PRIMARY KEY, start_date TEXT); + CREATE VIEW emp_hire_periods_with_name AS SELECT id, start_date FROM base; + CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT NOT NULL); + INSERT INTO customers (id, name) VALUES (1, 'Ada'); + DROP TABLE base; + `); + brokenDb.close(); + + const connector = new KtxSqliteScanConnector({ + connectionId: 'warehouse', + connection: { driver: 'sqlite', path: brokenDbPath }, + }); + + const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'sqlite' }, { runId: 'scan-run-broken' }); + + expect(snapshot.tables.map((table) => table.name)).toEqual(['customers']); + expect(snapshot.warnings).toHaveLength(1); + expect(snapshot.warnings?.[0]).toMatchObject({ + code: 'object_introspection_failed', + table: 'emp_hire_periods_with_name', + recoverable: true, + }); + expect(snapshot.warnings?.[0]?.message).toContain('no such table'); + }); + + it('returns no tables and only warnings when every object fails introspection', async () => { + const brokenDbPath = join(tempDir, 'all-broken.db'); + const brokenDb = new Database(brokenDbPath); + brokenDb.exec(` + CREATE TABLE base (id INTEGER PRIMARY KEY, value TEXT); + CREATE VIEW only_view AS SELECT id, value FROM base; + DROP TABLE base; + `); + brokenDb.close(); + + const connector = new KtxSqliteScanConnector({ + connectionId: 'warehouse', + connection: { driver: 'sqlite', path: brokenDbPath }, + }); + + const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'sqlite' }, { runId: 'scan-run-all-broken' }); + + expect(snapshot.tables).toEqual([]); + expect(snapshot.warnings).toHaveLength(1); + expect(snapshot.warnings?.[0]?.code).toBe('object_introspection_failed'); + }); + + it('restricts introspection to enabled_tables, accepting both "main." and bare ""', async () => { + const connector = new KtxSqliteScanConnector({ + connectionId: 'warehouse', + connection: { driver: 'sqlite', path: dbPath }, + }); + + for (const entry of ['main.customers', 'customers']) { + const tableScope = resolveEnabledTables({ driver: 'sqlite', enabled_tables: [entry] }) ?? undefined; + const snapshot = await connector.introspect( + { connectionId: 'warehouse', driver: 'sqlite', ...(tableScope ? { tableScope } : {}) }, + { runId: `scan-run-scope-${entry}` }, + ); + expect(snapshot.tables.map((table) => table.name)).toEqual(['customers']); + expect(snapshot.metadata.discovered_object_names).toEqual(['customers', 'orders', 'recent_orders']); + } + }); + it('lists schemaless tables and views for setup discovery', async () => { const connector = new KtxSqliteScanConnector({ connectionId: 'warehouse', @@ -224,6 +299,101 @@ describe('KtxSqliteScanConnector', () => { expect(snapshot.tables.map((table) => table.name)).toEqual(['orders']); }); + describe('bounded read-query execution', () => { + // A recursive CTE that spins ~1e9 iterations in SQLite's VM with no yield + // point — the single-aggregate-row shape that maxRows cannot bound. Natural + // completion is far beyond the test window, so a fast finish proves the + // child was killed, not that the query completed. + const pathologicalSql = + 'WITH RECURSIVE c(x) AS (SELECT 1 UNION ALL SELECT x + 1 FROM c WHERE x < 1000000000) SELECT COUNT(*) AS n FROM c'; + + let children: ChildProcess[]; + const trackingSpawn = () => { + const child = forkReadQueryChild(); + children.push(child); + return child; + }; + + beforeEach(() => { + children = []; + }); + + afterEach(() => { + for (const child of children) { + if (child.exitCode === null && child.signalCode === null) { + child.kill('SIGKILL'); + } + } + }); + + it('terminates a pathological query at the deadline, keeps the event loop free, and reaps the child', async () => { + const connector = new KtxSqliteScanConnector({ + connectionId: 'warehouse', + connection: { driver: 'sqlite', path: dbPath, query_timeout_ms: 250 }, + spawnReadQueryChild: trackingSpawn, + }); + + const pending = connector.executeReadOnly( + { connectionId: 'warehouse', sql: pathologicalSql }, + { runId: 'deadline-test' }, + ); + + // The event loop stays free while the query runs off-process, so this + // concurrent timer fires before the deadline rejects the query. + let concurrentFiredWhilePending = false; + void pending.catch(() => {}); + await new Promise((resolveTimer) => setTimeout(resolveTimer, 80)); + concurrentFiredWhilePending = true; + + await expect(pending).rejects.toThrow(/^query exceeded \d+s$/); + expect(concurrentFiredWhilePending).toBe(true); + + // The off-process executor was actually killed (SIGKILL), not left spinning. + expect(children).toHaveLength(1); + const child = children[0]!; + await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), { + timeout: 5_000, + }); + expect(child.signalCode).toBe('SIGKILL'); + }); + + it('returns identical results to the in-process path for a normal query', async () => { + const connector = new KtxSqliteScanConnector({ + connectionId: 'warehouse', + connection: { driver: 'sqlite', path: dbPath }, + spawnReadQueryChild: trackingSpawn, + }); + + await expect( + connector.executeReadOnly( + { connectionId: 'warehouse', sql: 'select id, status from orders order by id' }, + { runId: 'normal' }, + ), + ).resolves.toEqual({ + headers: ['id', 'status'], + rows: [ + [10, 'paid'], + [11, 'open'], + ], + totalRows: 2, + rowCount: 2, + }); + }); + + it('rejects invalid SQL on the main thread without spawning a child', async () => { + const connector = new KtxSqliteScanConnector({ + connectionId: 'warehouse', + connection: { driver: 'sqlite', path: dbPath }, + spawnReadQueryChild: trackingSpawn, + }); + + await expect( + connector.executeReadOnly({ connectionId: 'warehouse', sql: 'delete from orders' }, { runId: 'invalid' }), + ).rejects.toThrow('Only read-only SELECT/WITH queries can be executed locally'); + expect(children).toHaveLength(0); + }); + }); + it('adapts native SQLite snapshots to live-database introspection for local ingest', async () => { const introspection = createSqliteLiveDatabaseIntrospection({ projectDir: tempDir, diff --git a/packages/cli/test/connectors/sqlserver/connector.test.ts b/packages/cli/test/connectors/sqlserver/connector.test.ts index 25184a5a..2e55378d 100644 --- a/packages/cli/test/connectors/sqlserver/connector.test.ts +++ b/packages/cli/test/connectors/sqlserver/connector.test.ts @@ -1,4 +1,5 @@ import { describe, expect, it, vi } from 'vitest'; +import { KtxQueryError } from '../../../src/errors.js'; import { createSqlServerLiveDatabaseIntrospection } from '../../../src/connectors/sqlserver/live-database-introspection.js'; import { isKtxSqlServerConnectionConfig, KtxSqlServerScanConnector, prepareSqlServerReadOnlyQuery, sqlServerConnectionPoolConfigFromConfig, type KtxSqlServerConnectionConfig, type KtxSqlServerPoolFactory, type KtxSqlServerQueryResult } from '../../../src/connectors/sqlserver/connector.js'; import { tableRefSet } from '../../../src/context/scan/table-ref.js'; @@ -404,6 +405,52 @@ describe('KtxSqlServerScanConnector', () => { await connector.cleanup(); }); + it('sets requestTimeout to the resolved deadline and maps an ETIMEOUT to KtxQueryError', async () => { + expect( + sqlServerConnectionPoolConfigFromConfig({ + connectionId: 'warehouse', + connection: { + driver: 'sqlserver', + host: 'db.example.test', + database: 'analytics', + username: 'reader', + query_timeout_ms: 5_000, + }, + }), + ).toMatchObject({ requestTimeout: 5_000 }); + + const timeoutError = Object.assign(new Error('Timeout: Request failed to complete in 5000ms'), { code: 'ETIMEOUT' }); + const poolFactory: KtxSqlServerPoolFactory = { + createPool: vi.fn(async () => { + const request = { + input: vi.fn(() => request), + query: vi.fn(async () => { + throw timeoutError; + }), + }; + return { request: () => request, close: vi.fn(async () => undefined) }; + }), + }; + const connector = new KtxSqlServerScanConnector({ + connectionId: 'warehouse', + connection: { + driver: 'sqlserver', + host: 'db.example.test', + database: 'analytics', + username: 'reader', + query_timeout_ms: 5_000, + }, + poolFactory, + }); + + const execution = connector.executeReadOnly( + { connectionId: 'warehouse', sql: 'select count(*) from dbo.orders' }, + { runId: 'scan-run-1' }, + ); + await expect(execution).rejects.toBeInstanceOf(KtxQueryError); + await expect(execution).rejects.toThrow('query exceeded 5s'); + }); + it('hoists leading CTEs before applying the SQL Server TOP wrapper', async () => { const queries: string[] = []; const request = { diff --git a/packages/cli/test/context/connections/configured-connections.test.ts b/packages/cli/test/context/connections/configured-connections.test.ts new file mode 100644 index 00000000..15163ebd --- /dev/null +++ b/packages/cli/test/context/connections/configured-connections.test.ts @@ -0,0 +1,26 @@ +import { describe, expect, it } from 'vitest'; +import type { KtxProjectConnectionConfig } from '../../../src/context/project/config.js'; +import { assertConfiguredConnectionId } from '../../../src/context/connections/configured-connections.js'; + +const connections = { + sales_db: { driver: 'sqlite' } as unknown as KtxProjectConnectionConfig, + events_db: { driver: 'sqlite' } as unknown as KtxProjectConnectionConfig, +}; + +describe('assertConfiguredConnectionId', () => { + it('returns the id when configured', () => { + expect(assertConfiguredConnectionId(connections, 'sales_db')).toBe('sales_db'); + }); + + it('throws listing the configured ids when unknown', () => { + expect(() => assertConfiguredConnectionId(connections, 'warehouse')).toThrow( + 'Unknown connection "warehouse". Configured connections: events_db, sales_db.', + ); + }); + + it('reports none configured for an empty connections map', () => { + expect(() => assertConfiguredConnectionId({}, 'warehouse')).toThrow( + 'Unknown connection "warehouse". Configured connections: (none configured).', + ); + }); +}); diff --git a/packages/cli/test/context/connections/query-deadline.test.ts b/packages/cli/test/context/connections/query-deadline.test.ts new file mode 100644 index 00000000..747badd4 --- /dev/null +++ b/packages/cli/test/context/connections/query-deadline.test.ts @@ -0,0 +1,36 @@ +import { describe, expect, it } from 'vitest'; +import { KtxQueryError } from '../../../src/errors.js'; +import { + DEFAULT_QUERY_TIMEOUT_MS, + queryDeadlineExceededError, + resolveQueryDeadlineMs, +} from '../../../src/context/connections/query-deadline.js'; + +describe('resolveQueryDeadlineMs', () => { + it('returns the 30s default when no override is set', () => { + expect(DEFAULT_QUERY_TIMEOUT_MS).toBe(30_000); + expect(resolveQueryDeadlineMs(undefined)).toBe(30_000); + expect(resolveQueryDeadlineMs({ driver: 'sqlite' })).toBe(30_000); + }); + + it('honors a positive-integer query_timeout_ms override', () => { + expect(resolveQueryDeadlineMs({ query_timeout_ms: 5_000 })).toBe(5_000); + expect(resolveQueryDeadlineMs({ query_timeout_ms: 1 })).toBe(1); + }); + + it('rejects a zero, negative, or non-integer override', () => { + expect(() => resolveQueryDeadlineMs({ query_timeout_ms: 0 })).toThrow(/positive integer/); + expect(() => resolveQueryDeadlineMs({ query_timeout_ms: -5 })).toThrow(/positive integer/); + expect(() => resolveQueryDeadlineMs({ query_timeout_ms: 1.5 })).toThrow(/positive integer/); + expect(() => resolveQueryDeadlineMs({ query_timeout_ms: '5000' as unknown as number })).toThrow(/positive integer/); + }); +}); + +describe('queryDeadlineExceededError', () => { + it('is a KtxQueryError with the canonical seconds-rounded message', () => { + const error = queryDeadlineExceededError(30_000); + expect(error).toBeInstanceOf(KtxQueryError); + expect(error.message).toBe('query exceeded 30s'); + expect(queryDeadlineExceededError(45_000).message).toBe('query exceeded 45s'); + }); +}); diff --git a/packages/cli/test/context/ingest/adapters/historic-sql/query-history-filter-picker.test.ts b/packages/cli/test/context/ingest/adapters/historic-sql/query-history-filter-picker.test.ts index 5c9e2e60..5b5dabf2 100644 --- a/packages/cli/test/context/ingest/adapters/historic-sql/query-history-filter-picker.test.ts +++ b/packages/cli/test/context/ingest/adapters/historic-sql/query-history-filter-picker.test.ts @@ -91,6 +91,7 @@ function llm(decisions: Array<{ role: string; exclude: boolean; reason: string } generateText: vi.fn(), generateObject, runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, }; } diff --git a/packages/cli/test/context/ingest/adapters/live-database/daemon-introspection.test.ts b/packages/cli/test/context/ingest/adapters/live-database/daemon-introspection.test.ts index 5cc6affb..9798ef7c 100644 --- a/packages/cli/test/context/ingest/adapters/live-database/daemon-introspection.test.ts +++ b/packages/cli/test/context/ingest/adapters/live-database/daemon-introspection.test.ts @@ -130,6 +130,39 @@ describe('createDaemonLiveDatabaseIntrospection', () => { }); }); + it('maps daemon warnings into the snapshot and drops codes Node cannot render', async () => { + const runJson = vi.fn(async () => ({ + ...daemonResponse, + tables: [], + warnings: [ + { + code: 'object_introspection_failed', + message: 'permission denied for relation locked', + table: 'locked', + recoverable: true, + metadata: { object: 'public.locked' }, + }, + { code: 'totally_unknown_code', message: 'ignored', recoverable: true }, + ], + })); + const introspection = createDaemonLiveDatabaseIntrospection({ + connections: { warehouse: { driver: 'postgres', url: 'postgres://localhost:5432/warehouse' } }, + schemas: ['public'], + runJson, + }); + + const snapshot = await introspection.extractSchema('warehouse'); + expect(snapshot.warnings).toEqual([ + { + code: 'object_introspection_failed', + message: 'permission denied for relation locked', + table: 'locked', + recoverable: true, + metadata: { object: 'public.locked' }, + }, + ]); + }); + it('calls a running daemon HTTP endpoint when baseUrl is configured', async () => { const requests: Array<{ url: string | undefined; body: unknown }> = []; const server = createServer((request, response) => { diff --git a/packages/cli/test/context/ingest/adapters/live-database/live-database.adapter.test.ts b/packages/cli/test/context/ingest/adapters/live-database/live-database.adapter.test.ts index 72c31446..f6465792 100644 --- a/packages/cli/test/context/ingest/adapters/live-database/live-database.adapter.test.ts +++ b/packages/cli/test/context/ingest/adapters/live-database/live-database.adapter.test.ts @@ -1,9 +1,14 @@ -import { mkdtemp, readdir, rm } from 'node:fs/promises'; +import Database from 'better-sqlite3'; +import { mkdtemp, readdir, readFile, rm } from 'node:fs/promises'; import { tmpdir } from 'node:os'; import { join } from 'node:path'; -import { describe, expect, it, vi } from 'vitest'; +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; import { tableRefSet, type KtxTableRefKey } from '../../../../../src/context/scan/table-ref.js'; import { LiveDatabaseSourceAdapter } from '../../../../../src/context/ingest/adapters/live-database/live-database.adapter.js'; +import { createSqliteLiveDatabaseIntrospection } from '../../../../../src/connectors/sqlite/live-database-introspection.js'; +import { resolveEnabledTables } from '../../../../../src/context/scan/enabled-tables.js'; +import { KtxExpectedError } from '../../../../../src/errors.js'; +import type { FetchContext } from '../../../../../src/context/ingest/types.js'; describe('LiveDatabaseSourceAdapter', () => { it('fetches a schema snapshot through the introspection port', async () => { @@ -109,3 +114,106 @@ describe('LiveDatabaseSourceAdapter', () => { } }); }); + +describe('LiveDatabaseSourceAdapter (sqlite) tolerant scan', () => { + const CONNECTION_ID = 'warehouse'; + let tempDir: string; + + beforeEach(async () => { + tempDir = await mkdtemp(join(tmpdir(), 'ktx-live-db-tolerant-')); + }); + + afterEach(async () => { + await rm(tempDir, { recursive: true, force: true }); + }); + + function adapterFor(dbPath: string): LiveDatabaseSourceAdapter { + return new LiveDatabaseSourceAdapter({ + introspection: createSqliteLiveDatabaseIntrospection({ + projectDir: tempDir, + connections: { [CONNECTION_ID]: { driver: 'sqlite', path: dbPath } }, + }), + }); + } + + function ctx(overrides: Partial = {}): FetchContext { + return { connectionId: CONNECTION_ID, sourceKey: 'live-database', ...overrides }; + } + + it('ingests healthy objects and reports the broken view as a skip', async () => { + const dbPath = join(tempDir, 'partial.db'); + const db = new Database(dbPath); + db.exec(` + CREATE TABLE base (id INTEGER PRIMARY KEY, start_date TEXT); + CREATE VIEW emp_hire_periods_with_name AS SELECT id, start_date FROM base; + CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT NOT NULL); + DROP TABLE base; + `); + db.close(); + + const adapter = adapterFor(dbPath); + const stagedDir = join(tempDir, 'staged-partial'); + await adapter.fetch(undefined, stagedDir, ctx()); + + await expect(adapter.detect(stagedDir)).resolves.toBe(true); + + const warnings = JSON.parse(await readFile(join(stagedDir, 'warnings.json'), 'utf8')) as { + warnings: Array<{ code: string; table?: string }>; + }; + expect(warnings.warnings).toHaveLength(1); + expect(warnings.warnings[0]).toMatchObject({ + code: 'object_introspection_failed', + table: 'emp_hire_periods_with_name', + }); + + const report = await adapter.readFetchReport(stagedDir); + expect(report?.skipped.map((issue) => issue.entityId)).toEqual(['emp_hire_periods_with_name']); + }); + + it('raises a clear connection error when every object fails introspection', async () => { + const dbPath = join(tempDir, 'all-broken.db'); + const db = new Database(dbPath); + db.exec(` + CREATE TABLE base (id INTEGER PRIMARY KEY, value TEXT); + CREATE VIEW only_view AS SELECT id, value FROM base; + DROP TABLE base; + `); + db.close(); + + const adapter = adapterFor(dbPath); + await expect(adapter.fetch(undefined, join(tempDir, 'staged-all-broken'), ctx())).rejects.toThrow(KtxExpectedError); + }); + + it('treats a genuinely empty database as a recognized, empty success', async () => { + const dbPath = join(tempDir, 'empty.db'); + new Database(dbPath).close(); + + const adapter = adapterFor(dbPath); + const stagedDir = join(tempDir, 'staged-empty'); + await adapter.fetch(undefined, stagedDir, ctx()); + await expect(adapter.detect(stagedDir)).resolves.toBe(true); + await expect(adapter.readFetchReport(stagedDir)).resolves.toBeNull(); + }); + + it('ingests exactly the enabled_tables subset and fails clearly on a zero-match scope', async () => { + const dbPath = join(tempDir, 'scoped.db'); + const db = new Database(dbPath); + db.exec(` + CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT); + CREATE TABLE orders (id INTEGER PRIMARY KEY, customer_id INTEGER); + `); + db.close(); + const adapter = adapterFor(dbPath); + + const scope = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['main.customers'] }) ?? undefined; + const stagedDir = join(tempDir, 'staged-scoped'); + await adapter.fetch(undefined, stagedDir, ctx({ tableScope: scope })); + const meta = JSON.parse(await readFile(join(stagedDir, 'connection.json'), 'utf8')) as { tableCount: number }; + expect(meta.tableCount).toBe(1); + + const typoScope = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['nope'] }) ?? undefined; + await expect( + adapter.fetch(undefined, join(tempDir, 'staged-zero'), ctx({ tableScope: typoScope })), + ).rejects.toThrow(/matched no objects.*Available objects: customers, orders/s); + }); +}); diff --git a/packages/cli/test/context/ingest/adapters/live-database/manifest.test.ts b/packages/cli/test/context/ingest/adapters/live-database/manifest.test.ts index d32868ec..b3916d7e 100644 --- a/packages/cli/test/context/ingest/adapters/live-database/manifest.test.ts +++ b/packages/cli/test/context/ingest/adapters/live-database/manifest.test.ts @@ -14,7 +14,7 @@ describe('buildLiveDatabaseManifestShards', () => { it('builds shard objects with generated joins and preserved external descriptions', () => { const existingDescriptions = new Map([ [ - 'orders', + 'public.orders', { table: { user: 'Pinned analyst description', db: 'Old db description' }, columns: new Map([['id', { user: 'Pinned id description', db: 'Old id description' }]]), @@ -189,7 +189,7 @@ describe('buildLiveDatabaseManifestShards', () => { it('preserves external usage keys while replacing historic SQL managed keys', () => { const existingUsage = new Map([ [ - 'orders', + 'public.orders', { narrative: 'Old generated usage narrative.', frequencyTier: 'low' as const, diff --git a/packages/cli/test/context/ingest/adapters/live-database/scan-outcome.test.ts b/packages/cli/test/context/ingest/adapters/live-database/scan-outcome.test.ts new file mode 100644 index 00000000..afe4ce9c --- /dev/null +++ b/packages/cli/test/context/ingest/adapters/live-database/scan-outcome.test.ts @@ -0,0 +1,65 @@ +import { describe, expect, it } from 'vitest'; +import { assertLiveDatabaseScanOutcome } from '../../../../../src/context/ingest/adapters/live-database/scan-outcome.js'; +import { tableRefSet } from '../../../../../src/context/scan/table-ref.js'; +import type { KtxSchemaSnapshot, KtxSchemaTable } from '../../../../../src/context/scan/types.js'; + +function table(name: string): KtxSchemaTable { + return { catalog: null, db: null, name, kind: 'table', comment: null, estimatedRows: 0, columns: [], foreignKeys: [] }; +} + +function snapshot(overrides: Partial): KtxSchemaSnapshot { + return { + connectionId: 'warehouse', + driver: 'sqlite', + extractedAt: '2026-06-14T00:00:00.000Z', + scope: {}, + metadata: {}, + tables: [], + ...overrides, + }; +} + +describe('assertLiveDatabaseScanOutcome', () => { + it('passes when at least one object was ingested, even with skips', () => { + expect(() => + assertLiveDatabaseScanOutcome({ + connectionId: 'warehouse', + scope: undefined, + snapshot: snapshot({ + tables: [table('customers')], + warnings: [{ code: 'object_introspection_failed', message: 'boom', table: 'broken', recoverable: true }], + }), + }), + ).not.toThrow(); + }); + + it('passes for a legitimately empty database (no scope, no objects)', () => { + expect(() => + assertLiveDatabaseScanOutcome({ connectionId: 'warehouse', scope: undefined, snapshot: snapshot({}) }), + ).not.toThrow(); + }); + + it('fails clearly when every introspected object failed', () => { + expect(() => + assertLiveDatabaseScanOutcome({ + connectionId: 'warehouse', + scope: undefined, + snapshot: snapshot({ + warnings: [ + { code: 'object_introspection_failed', message: 'no such table: base', table: 'only_view', recoverable: true }, + ], + }), + }), + ).toThrow(/all 1 introspected object failed.*only_view: no such table: base/s); + }); + + it('fails clearly when a non-empty enabled_tables scope matched nothing, naming available objects', () => { + expect(() => + assertLiveDatabaseScanOutcome({ + connectionId: 'warehouse', + scope: tableRefSet([{ catalog: null, db: null, name: 'typo_table' }]), + snapshot: snapshot({ metadata: { discovered_object_names: ['customers', 'orders'] } }), + }), + ).toThrow(/matched no objects.*typo_table.*Available objects: customers, orders/s); + }); +}); diff --git a/packages/cli/test/context/ingest/local-bundle-runtime.test.ts b/packages/cli/test/context/ingest/local-bundle-runtime.test.ts index 3ca0c490..6f7c99dd 100644 --- a/packages/cli/test/context/ingest/local-bundle-runtime.test.ts +++ b/packages/cli/test/context/ingest/local-bundle-runtime.test.ts @@ -91,6 +91,7 @@ describe('createLocalBundleIngestRuntime', () => { generateText: vi.fn(), generateObject: vi.fn(), runAgentLoop: vi.fn(async () => ({ stopReason: 'natural' as const })), + subprocessForkSpec: vi.fn(() => null), }; project.config.llm = { provider: { backend: 'claude-code' }, diff --git a/packages/cli/test/context/llm/local-config.test.ts b/packages/cli/test/context/llm/local-config.test.ts index d89cf385..d13ec470 100644 --- a/packages/cli/test/context/llm/local-config.test.ts +++ b/packages/cli/test/context/llm/local-config.test.ts @@ -137,16 +137,19 @@ describe('local ktx LLM config', () => { generateText: vi.fn(), generateObject: vi.fn(), runAgentLoop: vi.fn(), + subprocessForkSpec: vi.fn(() => null), })); const createCodexRuntime = vi.fn(() => ({ generateText: vi.fn(), generateObject: vi.fn(), runAgentLoop: vi.fn(), + subprocessForkSpec: vi.fn(() => null), })); const createAiSdkRuntime = vi.fn(() => ({ generateText: vi.fn(), generateObject: vi.fn(), runAgentLoop: vi.fn(), + subprocessForkSpec: vi.fn(() => null), })); const createKtxLlmProvider = vi.fn(() => ({ getModel: vi.fn(), diff --git a/packages/cli/test/context/llm/subprocess-generate-object.test.ts b/packages/cli/test/context/llm/subprocess-generate-object.test.ts new file mode 100644 index 00000000..64deca54 --- /dev/null +++ b/packages/cli/test/context/llm/subprocess-generate-object.test.ts @@ -0,0 +1,138 @@ +import { type ChildProcess } from 'node:child_process'; +import { mkdtempSync, readFileSync, rmSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; +import { z } from 'zod'; +import { isAbortError } from '../../../src/context/core/abort.js'; +import { + KtxSubprocessDeadlineError, + runGenerateObjectInSubprocess, +} from '../../../src/context/llm/subprocess-generate-object.js'; +import type { SubprocessRuntimeForkSpec } from '../../../src/context/llm/runtime-port.js'; +import { HANGING_CHILD, killTestChildren, RESPONDING_CHILD, spawnTestChild } from './subprocess-test-children.test-utils.js'; + +const FORK_SPEC: SubprocessRuntimeForkSpec = { backend: 'codex', projectDir: '/tmp', modelSlots: { default: 'codex' } }; + +function isAlive(pid: number): boolean { + try { + process.kill(pid, 0); + return true; + } catch { + return false; + } +} + +describe('runGenerateObjectInSubprocess', () => { + let children: ChildProcess[]; + let workDir: string; + + function forkFake(code: string, env: Record = {}): () => ChildProcess { + return () => spawnTestChild(children, code, env); + } + + beforeEach(() => { + children = []; + workDir = mkdtempSync(join(tmpdir(), 'ktx-subproc-')); + }); + + afterEach(() => { + killTestChildren(children); + rmSync(workDir, { recursive: true, force: true }); + }); + + it('tree-kills a wedged child at the deadline and reaps its grandchild', async () => { + const pidFile = join(workDir, 'gc.pid'); + const start = Date.now(); + const pending = runGenerateObjectInSubprocess({ + forkSpec: FORK_SPEC, + role: 'candidateExtraction', + prompt: 'x', + schema: z.object({ answer: z.string() }), + jsonSchema: { type: 'object' }, + deadlineMs: 300, + spawnChild: forkFake(HANGING_CHILD, { KTX_TEST_GC_PID_FILE: pidFile }), + }); + + await expect(pending).rejects.toBeInstanceOf(KtxSubprocessDeadlineError); + // Settled within the deadline plus a small grace, not left wedged. + expect(Date.now() - start).toBeLessThan(3000); + + const child = children[0]!; + await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), { timeout: 5000 }); + expect(child.signalCode).toBe('SIGKILL'); + + const grandchildPid = Number(readFileSync(pidFile, 'utf8')); + expect(Number.isInteger(grandchildPid)).toBe(true); + await vi.waitFor(() => expect(isAlive(grandchildPid)).toBe(false), { timeout: 5000 }); + }); + + it('tree-kills the same way on an external abort', async () => { + const pidFile = join(workDir, 'gc.pid'); + const controller = new AbortController(); + const pending = runGenerateObjectInSubprocess({ + forkSpec: FORK_SPEC, + role: 'candidateExtraction', + prompt: 'x', + schema: z.object({ answer: z.string() }), + jsonSchema: { type: 'object' }, + deadlineMs: 60_000, + signal: controller.signal, + spawnChild: forkFake(HANGING_CHILD, { KTX_TEST_GC_PID_FILE: pidFile }), + }); + void pending.catch(() => undefined); + + await vi.waitFor(() => expect(() => readFileSync(pidFile, 'utf8')).not.toThrow(), { timeout: 5000 }); + controller.abort(); + + await expect(pending).rejects.toSatisfy(isAbortError); + const child = children[0]!; + await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), { timeout: 5000 }); + const grandchildPid = Number(readFileSync(pidFile, 'utf8')); + await vi.waitFor(() => expect(isAlive(grandchildPid)).toBe(false), { timeout: 5000 }); + }); + + it('resolves with the schema-validated output on success', async () => { + await expect( + runGenerateObjectInSubprocess({ + forkSpec: FORK_SPEC, + role: 'candidateExtraction', + prompt: 'x', + schema: z.object({ answer: z.string() }), + jsonSchema: { type: 'object' }, + deadlineMs: 5_000, + spawnChild: forkFake(RESPONDING_CHILD), + }), + ).resolves.toEqual({ answer: 'yes' }); + }); + + it('rejects when the child output fails schema validation', async () => { + await expect( + runGenerateObjectInSubprocess({ + forkSpec: FORK_SPEC, + role: 'candidateExtraction', + prompt: 'x', + schema: z.object({ answer: z.string() }), + jsonSchema: { type: 'object' }, + deadlineMs: 5_000, + spawnChild: forkFake(RESPONDING_CHILD, { KTX_TEST_RESPONSE: '{"ok":true,"output":{"wrong":1}}' }), + }), + ).rejects.toThrow(); + }); + + it('rejects with the child error message when the child reports failure', async () => { + await expect( + runGenerateObjectInSubprocess({ + forkSpec: FORK_SPEC, + role: 'candidateExtraction', + prompt: 'x', + schema: z.object({ answer: z.string() }), + jsonSchema: { type: 'object' }, + deadlineMs: 5_000, + spawnChild: forkFake(RESPONDING_CHILD, { + KTX_TEST_RESPONSE: '{"ok":false,"message":"backend overloaded"}', + }), + }), + ).rejects.toThrow('backend overloaded'); + }); +}); diff --git a/packages/cli/test/context/llm/subprocess-test-children.test-utils.ts b/packages/cli/test/context/llm/subprocess-test-children.test-utils.ts new file mode 100644 index 00000000..8a4c4fcc --- /dev/null +++ b/packages/cli/test/context/llm/subprocess-test-children.test-utils.ts @@ -0,0 +1,45 @@ +import { spawn, type ChildProcess } from 'node:child_process'; + +// A wedged subprocess-backed call: the child ignores SIGTERM (as a child hung on a +// provider socket does), spawns a grandchild (the SDK's model binary stand-in) that +// also ignores SIGTERM, and never replies. Only a SIGKILL of the whole process group +// reaps it. +export const HANGING_CHILD = ` +process.on('SIGTERM', () => {}); +const { spawn } = require('node:child_process'); +const { writeFileSync } = require('node:fs'); +process.on('message', () => { + const gc = spawn(process.execPath, ['-e', 'process.on("SIGTERM",()=>{});setInterval(()=>{},1000000)'], { stdio: 'ignore' }); + gc.unref(); + if (process.env.KTX_TEST_GC_PID_FILE) writeFileSync(process.env.KTX_TEST_GC_PID_FILE, String(gc.pid)); +}); +`; + +export const RESPONDING_CHILD = ` +process.on('message', () => { + const raw = process.env.KTX_TEST_RESPONSE || '{"ok":true,"output":{"answer":"yes"}}'; + process.send(JSON.parse(raw), () => process.exit(0)); +}); +`; + +export function spawnTestChild(registry: ChildProcess[], code: string, env: Record = {}): ChildProcess { + const child = spawn(process.execPath, ['-e', code], { + detached: true, + stdio: ['ignore', 'ignore', 'inherit', 'ipc'], + env: { ...process.env, ...env }, + }); + registry.push(child); + return child; +} + +export function killTestChildren(registry: ChildProcess[]): void { + for (const child of registry) { + if (child.pid !== undefined && child.exitCode === null && child.signalCode === null) { + try { + process.kill(-child.pid, 'SIGKILL'); + } catch { + // Already exited. + } + } + } +} diff --git a/packages/cli/test/context/mcp/__snapshots__/mcp-tools-list.json b/packages/cli/test/context/mcp/__snapshots__/mcp-tools-list.json index 8a78009f..6015b006 100644 --- a/packages/cli/test/context/mcp/__snapshots__/mcp-tools-list.json +++ b/packages/cli/test/context/mcp/__snapshots__/mcp-tools-list.json @@ -63,7 +63,7 @@ { "name": "wiki_search", "title": "Wiki Search", - "description": "Search ktx wiki pages for reusable business context. Example: wiki_search({ query: \"revenue recognition\", limit: 5 }).", + "description": "Search ktx wiki pages for reusable business context. Pass connectionId to scope results to one warehouse (unscoped pages plus pages tagged with that connection) when a concept name collides across databases. Example: wiki_search({ query: \"revenue recognition\", connectionId: \"warehouse\", limit: 5 }).", "inputSchema": { "type": "object", "properties": { @@ -78,6 +78,11 @@ "type": "integer", "minimum": 1, "maximum": 50 + }, + "connectionId": { + "description": "Scope results to one connection: returns unscoped pages plus pages tagged with this connection. Omit to search all pages.", + "type": "string", + "minLength": 1 } }, "required": [ @@ -1478,6 +1483,55 @@ "taskSupport": "forbidden" } }, + { + "name": "sql_dialect_notes", + "title": "SQL Dialect Notes", + "description": "Return the SQL syntax conventions for the dialect of a ktx connection: fully-qualified table-name form, identifier quoting and case-folding, date/time functions, top-N / window-filtering idiom, and JSON access. Call this before writing raw sql_execution SQL against a connection so the SQL matches that engine. Example: sql_dialect_notes({ connectionId: \"warehouse\" }).", + "inputSchema": { + "type": "object", + "properties": { + "connectionId": { + "type": "string", + "minLength": 1, + "description": "Connection id whose engine dialect conventions to return." + } + }, + "required": [ + "connectionId" + ], + "$schema": "http://json-schema.org/draft-07/schema#" + }, + "outputSchema": { + "type": "object", + "properties": { + "connectionId": { + "type": "string" + }, + "dialect": { + "type": "string" + }, + "notes": { + "type": "string" + } + }, + "required": [ + "connectionId", + "dialect", + "notes" + ], + "$schema": "http://json-schema.org/draft-07/schema#", + "additionalProperties": false + }, + "annotations": { + "title": "SQL Dialect Notes", + "readOnlyHint": true, + "idempotentHint": true, + "openWorldHint": false + }, + "execution": { + "taskSupport": "forbidden" + } + }, { "name": "memory_ingest", "title": "Memory Ingest", diff --git a/packages/cli/test/context/mcp/dialect-notes.test.ts b/packages/cli/test/context/mcp/dialect-notes.test.ts new file mode 100644 index 00000000..27e9d922 --- /dev/null +++ b/packages/cli/test/context/mcp/dialect-notes.test.ts @@ -0,0 +1,111 @@ +import { readdirSync } from 'node:fs'; +import { fileURLToPath } from 'node:url'; +import { describe, expect, it } from 'vitest'; +import { KtxExpectedError } from '../../../src/errors.js'; +import { KTX_DATABASE_DRIVER_IDS } from '../../../src/connection-drivers.js'; +import type { KtxProjectConnectionConfig } from '../../../src/context/project/config.js'; +import { sqlAnalysisDialectForDriver } from '../../../src/context/sql-analysis/dialect.js'; +import { DIALECTS_WITH_NOTES, sqlDialectNotes } from '../../../src/context/sql-analysis/dialect-notes.js'; +import { resolveDialectNotesForConnection } from '../../../src/context/mcp/local-project-ports.js'; + +function conn(driver: string): KtxProjectConnectionConfig { + return { driver } as KtxProjectConnectionConfig; +} + +describe('per-dialect SQL notes', () => { + it('covers every dialect reachable from a configured warehouse driver', () => { + // Derived from the connector registry, not a hand-maintained list: a new + // warehouse driver whose resolved dialect lacks authored notes fails here. + for (const driver of KTX_DATABASE_DRIVER_IDS) { + const dialect = sqlAnalysisDialectForDriver(driver); + expect(DIALECTS_WITH_NOTES, `driver "${driver}" resolves to dialect "${dialect}"`).toContain(dialect); + expect(sqlDialectNotes(dialect).length).toBeGreaterThan(0); + } + }); + + it('keeps the authored-dialect list and the ./dialects markdown files in sync', () => { + const dir = fileURLToPath(new URL('../../../src/context/sql-analysis/dialects/', import.meta.url)); + const files = readdirSync(dir) + .filter((name) => name.endsWith('.md')) + .map((name) => name.replace(/\.md$/, '')) + .sort(); + expect(files).toEqual([...DIALECTS_WITH_NOTES].sort()); + }); + + it('does not author notes for unreachable dialects', () => { + // duckdb/databricks appear in the resolver map but no connector produces them. + expect(DIALECTS_WITH_NOTES).not.toContain('duckdb'); + expect(DIALECTS_WITH_NOTES).not.toContain('databricks'); + }); + + it('answers the full rubric for every dialect', () => { + for (const dialect of DIALECTS_WITH_NOTES) { + const notes = sqlDialectNotes(dialect); + expect(notes, `${dialect}: FQTN`).toContain('**FQTN:**'); + expect(notes, `${dialect}: identifiers`).toContain('**Identifiers:**'); + expect(notes, `${dialect}: date/time`).toContain('**Date/time:**'); + expect(notes, `${dialect}: top-N`).toMatch(/\*\*Top-N/); + expect(notes, `${dialect}: series`).toMatch(/\*\*Series/); + expect(notes, `${dialect}: rolling window`).toMatch(/\*\*Rolling/); + expect(notes, `${dialect}: safe cast`).toMatch(/\*\*Safe cast/); + expect(notes, `${dialect}: semi-structured`).toMatch(/\*\*(JSON|Semi-structured)/); + } + }); + + it('gives each engine its own idioms and never leaks another engine-only construct', () => { + // A sqlite analyst gets sqlite date idioms and never Snowflake/BigQuery-only syntax. + expect(sqlDialectNotes('sqlite')).toMatch(/strftime|julianday/); + expect(sqlDialectNotes('sqlite')).not.toContain('VARIANT'); + expect(sqlDialectNotes('sqlite')).not.toContain('_TABLE_SUFFIX'); + + // QUALIFY appears only for the engines that actually support it. + expect(sqlDialectNotes('snowflake')).toContain('QUALIFY'); + expect(sqlDialectNotes('bigquery')).toContain('QUALIFY'); + for (const dialect of ['postgres', 'mysql', 'sqlite', 'clickhouse', 'tsql'] as const) { + expect(sqlDialectNotes(dialect), `${dialect} must not mention QUALIFY`).not.toContain('QUALIFY'); + } + + // Engine-exclusive markers stay in their own dialect. + expect(sqlDialectNotes('snowflake')).toContain('VARIANT'); + expect(sqlDialectNotes('snowflake')).toContain('DATABASE.SCHEMA.TABLE'); + expect(sqlDialectNotes('bigquery')).toContain('_TABLE_SUFFIX'); + expect(sqlDialectNotes('clickhouse')).toContain('LIMIT n BY'); + expect(sqlDialectNotes('tsql')).toContain('TOP (n)'); + }); + + it('contains no benchmark/grader or version-dated content', () => { + for (const dialect of DIALECTS_WITH_NOTES) { + const notes = sqlDialectNotes(dialect); + expect(notes).not.toMatch(/\bspider\b|\bbenchmark\b|\bgold\b|\bgrader\b/i); + expect(notes).not.toMatch(/\bas of v(ersion)?\b/i); + } + }); + + it('falls back to postgres notes for a dialect without its own file', () => { + expect(sqlAnalysisDialectForDriver('some-future-engine')).toBe('postgres'); + // redshift is a valid SqlAnalysisDialect but intentionally unauthored. + expect(sqlDialectNotes('redshift')).toBe(sqlDialectNotes('postgres')); + }); +}); + +describe('resolveDialectNotesForConnection', () => { + it('resolves a warehouse connection to its dialect notes', () => { + expect(resolveDialectNotesForConnection('wh', conn('sqlite'))).toMatchObject({ + connectionId: 'wh', + dialect: 'sqlite', + }); + expect(resolveDialectNotesForConnection('wh', conn('snowflake')).dialect).toBe('snowflake'); + // The sqlserver driver resolves to the tsql dialect (resolver codomain). + expect(resolveDialectNotesForConnection('wh', conn('sqlserver')).dialect).toBe('tsql'); + }); + + it('rejects a non-SQL context source with a clear expected error, not postgres notes', () => { + expect(() => resolveDialectNotesForConnection('mb', conn('metabase'))).toThrow(KtxExpectedError); + expect(() => resolveDialectNotesForConnection('mb', conn('metabase'))).toThrow(/not a SQL warehouse/); + }); + + it('rejects an unconfigured connection', () => { + expect(() => resolveDialectNotesForConnection('missing', undefined)).toThrow(KtxExpectedError); + expect(() => resolveDialectNotesForConnection('missing', undefined)).toThrow(/not configured/); + }); +}); diff --git a/packages/cli/test/context/mcp/local-project-ports.test.ts b/packages/cli/test/context/mcp/local-project-ports.test.ts index d4484775..a20445ff 100644 --- a/packages/cli/test/context/mcp/local-project-ports.test.ts +++ b/packages/cli/test/context/mcp/local-project-ports.test.ts @@ -178,6 +178,7 @@ describe('createLocalProjectMcpContextPorts', () => { expect(Object.keys(ports).sort()).toEqual([ 'connections', + 'dialectNotes', 'dictionarySearch', 'discover', 'entityDetails', @@ -187,6 +188,7 @@ describe('createLocalProjectMcpContextPorts', () => { expect(Object.keys(ports.connections ?? {}).sort()).toEqual(['list']); expect(Object.keys(ports.knowledge ?? {}).sort()).toEqual(['read', 'search']); expect(Object.keys(ports.semanticLayer ?? {}).sort()).toEqual(['query', 'readSource']); + expect(Object.keys(ports.dialectNotes ?? {}).sort()).toEqual(['read']); await expect(ports.connections?.list()).resolves.toEqual([ { id: 'warehouse', name: 'warehouse', connectionType: 'POSTGRESQL' }, ]); @@ -803,6 +805,47 @@ describe('createLocalProjectMcpContextPorts', () => { expect(search?.results[0]?.score).toBeGreaterThan(0); }); + it('scopes wiki_search to a connection and validates the connection id', async () => { + const project = await initKtxProject({ projectDir: tempDir }); + project.config.connections.sales_db = { driver: 'sqlite', url: 'file:sales.db' }; + project.config.connections.events_db = { driver: 'sqlite', url: 'file:events.db' }; + const seed = async (key: string, connections: string[]) => { + await project.fileStore.writeFile( + `wiki/global/${key}.md`, + [ + '---', + `summary: Orders for ${key}`, + 'usage_mode: auto', + ...(connections.length > 0 ? ['connections:', ...connections.map((id) => ` - ${id}`)] : []), + '---', + '', + 'Orders are recognized when paid.', + '', + ].join('\n'), + 'ktx', + 'ktx@example.com', + `seed ${key}`, + ); + }; + await seed('orders-sales', ['sales_db']); + await seed('orders-events', ['events_db']); + await seed('orders-global', []); + + const ports = createLocalProjectMcpContextPorts(project, { embeddingService: null }); + + const scoped = await ports.knowledge?.search({ + userId: 'local-user', + query: 'orders paid', + limit: 10, + connectionId: 'sales_db', + }); + expect(scoped?.results.map((result) => result.key).sort()).toEqual(['orders-global', 'orders-sales']); + + await expect( + ports.knowledge?.search({ userId: 'local-user', query: 'orders', limit: 10, connectionId: 'warehouse' }), + ).rejects.toThrow('Unknown connection "warehouse". Configured connections: events_db, sales_db.'); + }); + it('reads seeded semantic-layer sources', async () => { const project = await initKtxProject({ projectDir: tempDir }); await seedSlSourceFile(project, { diff --git a/packages/cli/test/context/mcp/logger.test.ts b/packages/cli/test/context/mcp/logger.test.ts new file mode 100644 index 00000000..068de0cc --- /dev/null +++ b/packages/cli/test/context/mcp/logger.test.ts @@ -0,0 +1,99 @@ +import { afterEach, describe, expect, it, vi } from 'vitest'; +import { createMcpLogger, mcpLogLevel, mcpSlowToolMs, serializeMcpError } from '../../../src/context/mcp/logger.js'; + +function capturingIo() { + let buf = ''; + return { + io: { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } }, + text: () => buf, + json: () => + buf + .split('\n') + .filter((line) => line.trim().startsWith('{')) + .map((line) => JSON.parse(line) as Record), + }; +} + +describe('mcpLogLevel', () => { + it('defaults to info when unset', () => { + expect(mcpLogLevel({})).toBe('info'); + }); + + it('accepts a recognized pino level', () => { + expect(mcpLogLevel({ KTX_MCP_LOG_LEVEL: 'debug' })).toBe('debug'); + expect(mcpLogLevel({ KTX_MCP_LOG_LEVEL: 'WARN' })).toBe('warn'); + }); + + it('falls back to info for an unrecognized value', () => { + expect(mcpLogLevel({ KTX_MCP_LOG_LEVEL: 'loud' })).toBe('info'); + }); +}); + +describe('mcpSlowToolMs', () => { + it('defaults to 10000ms', () => { + expect(mcpSlowToolMs({})).toBe(10_000); + }); + + it('parses a numeric override', () => { + expect(mcpSlowToolMs({ KTX_MCP_SLOW_TOOL_MS: '250' })).toBe(250); + }); + + it('ignores a non-numeric or negative value', () => { + expect(mcpSlowToolMs({ KTX_MCP_SLOW_TOOL_MS: 'soon' })).toBe(10_000); + expect(mcpSlowToolMs({ KTX_MCP_SLOW_TOOL_MS: '-5' })).toBe(10_000); + }); +}); + +describe('serializeMcpError', () => { + it('serializes an Error with type, message, and stack', () => { + const out = serializeMcpError(new TypeError('boom')); + expect(out.type).toBe('TypeError'); + expect(out.message).toBe('boom'); + expect(typeof out.stack).toBe('string'); + }); + + it('reduces a non-error to a message (no synthetic stack)', () => { + expect(serializeMcpError('plain text')).toEqual({ message: 'plain text' }); + }); +}); + +describe('createMcpLogger', () => { + afterEach(() => { + vi.unstubAllEnvs(); + }); + + it('writes structured JSON lines through io.stderr when not a TTY', () => { + const cap = capturingIo(); + const logger = createMcpLogger(cap.io, { isTTY: false }); + logger.info({ tool: 'sql_execution', callId: 'abc' }, 'tool.start'); + + const [line] = cap.json(); + expect(line.msg).toBe('tool.start'); + expect(line.tool).toBe('sql_execution'); + expect(line.callId).toBe('abc'); + expect(typeof line.time).toBe('number'); + expect(line.level).toBe(30); + }); + + it('writes human-readable (non-JSON) output for a TTY', () => { + const cap = capturingIo(); + const logger = createMcpLogger(cap.io, { isTTY: true }); + logger.info({ tool: 'sql_execution' }, 'tool.start'); + + expect(cap.text()).toContain('tool.start'); + // pino-pretty output is not a JSON line. + expect(cap.text().trim().startsWith('{')).toBe(false); + }); + + it('honors KTX_MCP_LOG_LEVEL by suppressing below-threshold lines', () => { + vi.stubEnv('KTX_MCP_LOG_LEVEL', 'warn'); + const cap = capturingIo(); + const logger = createMcpLogger(cap.io, { isTTY: false }); + logger.info({}, 'routine'); + logger.warn({}, 'slow'); + + const messages = cap.json().map((line) => line.msg); + expect(messages).not.toContain('routine'); + expect(messages).toContain('slow'); + }); +}); diff --git a/packages/cli/test/context/mcp/server.test.ts b/packages/cli/test/context/mcp/server.test.ts index 96a4bd55..6a7e756a 100644 --- a/packages/cli/test/context/mcp/server.test.ts +++ b/packages/cli/test/context/mcp/server.test.ts @@ -4,14 +4,17 @@ import { join } from 'node:path'; import { Client } from '@modelcontextprotocol/sdk/client/index.js'; import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js'; import { afterEach, describe, expect, it, vi } from 'vitest'; +import { KtxQueryError } from '../../../src/errors.js'; import { createLocalProjectMemoryIngest } from '../../../src/context/memory/local-memory.js'; import { detectCaptureSignals } from '../../../src/context/memory/capture-signals.js'; import type { MemoryAgentInput } from '../../../src/context/memory/types.js'; import { parseKtxProjectConfig, serializeKtxProjectConfig } from '../../../src/context/project/config.js'; import { initKtxProject } from '../../../src/context/project/project.js'; import { jsonToolResult } from '../../../src/context/mcp/context-tools.js'; +import { createMcpLogger } from '../../../src/context/mcp/logger.js'; import { createDefaultKtxMcpServer, createKtxMcpServer } from '../../../src/context/mcp/server.js'; import type { + KtxDialectNotesMcpPort, KtxDiscoverDataMcpPort, KtxDictionarySearchMcpPort, KtxEntityDetailsMcpPort, @@ -84,6 +87,7 @@ const retainedToolNames = [ 'memory_ingest_status', 'sl_query', 'sl_read_source', + 'sql_dialect_notes', 'sql_execution', 'wiki_read', 'wiki_search', @@ -136,6 +140,13 @@ function makeAllContextTools(): KtxMcpContextPorts { rowCount: 1, }), }, + dialectNotes: { + read: vi.fn().mockResolvedValue({ + connectionId: 'warehouse', + dialect: 'postgres', + notes: '**postgres** SQL conventions', + }), + }, memoryIngest: { ingest: vi.fn().mockResolvedValue({ runId: 'run-1' }), status: vi.fn().mockResolvedValue({ @@ -203,6 +214,12 @@ describe('createKtxMcpServer', () => { }, sl_query: { title: 'Semantic Layer Query', readOnlyHint: true, openWorldHint: false }, sql_execution: { title: 'SQL Execution', readOnlyHint: true, openWorldHint: false }, + sql_dialect_notes: { + title: 'SQL Dialect Notes', + readOnlyHint: true, + idempotentHint: true, + openWorldHint: false, + }, memory_ingest: { title: 'Memory Ingest', destructiveHint: true, openWorldHint: false }, memory_ingest_status: { title: 'Memory Ingest Status', readOnlyHint: true, openWorldHint: false }, }; @@ -219,6 +236,22 @@ describe('createKtxMcpServer', () => { } }); + it('routes sql_dialect_notes through the dialect-notes port', async () => { + const fake = makeFakeServer(); + const contextTools = makeAllContextTools(); + createKtxMcpServer({ + server: fake.server, + userContext: { userId: 'mcp-user' }, + contextTools, + }); + + const result = await getTool(fake.tools, 'sql_dialect_notes').handler({ connectionId: 'warehouse' }); + expect(contextTools.dialectNotes!.read).toHaveBeenCalledWith({ connectionId: 'warehouse' }); + expect(result).toMatchObject({ + structuredContent: { connectionId: 'warehouse', dialect: 'postgres' }, + }); + }); + it('exposes annotations and output schemas through the SDK tools/list response', async () => { const result = await listToolsThroughSdk(makeAllContextTools()); const toolNames = result.tools.map((tool) => tool.name).sort(); @@ -1332,3 +1365,179 @@ describe('createKtxMcpServer', () => { } }); }); + +describe('MCP tool-call logging', () => { + afterEach(() => { + vi.unstubAllEnvs(); + vi.restoreAllMocks(); + }); + + function loggerCapture() { + let buf = ''; + const io = { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } }; + return { + io, + logger: createMcpLogger(io, { isTTY: false }), + text: () => buf, + lines: () => + buf + .split('\n') + .filter((line) => line.trim().startsWith('{')) + .map((line) => JSON.parse(line) as Record), + }; + } + + it('logs tool.start before the handler runs and a matching tool.end on completion', async () => { + const cap = loggerCapture(); + const fake = makeFakeServer(); + createKtxMcpServer({ + server: fake.server, + userContext: { userId: 'local' }, + logger: cap.logger, + contextTools: { + sqlExecution: { + execute: vi + .fn() + .mockResolvedValue({ headers: ['count'], rows: [[1]], rowCount: 1 }), + }, + }, + }); + + await getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select 1' }); + + const lines = cap.lines(); + const start = lines.find((line) => line.msg === 'tool.start'); + const end = lines.find((line) => line.msg === 'tool.end'); + expect(start).toMatchObject({ + tool: 'sql_execution', + params: { connectionId: 'warehouse', sql: 'select 1' }, + level: 30, + }); + expect(typeof start?.callId).toBe('string'); + expect(end).toMatchObject({ tool: 'sql_execution', callId: start?.callId, outcome: 'ok', level: 30 }); + expect(typeof end?.durationMs).toBe('number'); + expect(end?.resultSize as number).toBeGreaterThan(0); + }); + + it('leaves a tool.start carrying the SQL with no matching tool.end when a handler never returns', () => { + const cap = loggerCapture(); + const fake = makeFakeServer(); + createKtxMcpServer({ + server: fake.server, + userContext: { userId: 'local' }, + logger: cap.logger, + contextTools: { + sqlExecution: { execute: () => new Promise(() => {}) }, + }, + }); + + void getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select pg_sleep(99999)' }); + + const lines = cap.lines(); + const start = lines.find((line) => line.msg === 'tool.start'); + expect(start).toMatchObject({ tool: 'sql_execution', params: { sql: 'select pg_sleep(99999)' } }); + expect(lines.some((line) => line.msg === 'tool.end' && line.callId === start?.callId)).toBe(false); + }); + + it('emits tool.end at warn when a completed call exceeds the slow threshold', async () => { + vi.stubEnv('KTX_MCP_SLOW_TOOL_MS', '0'); + const cap = loggerCapture(); + const fake = makeFakeServer(); + createKtxMcpServer({ + server: fake.server, + userContext: { userId: 'local' }, + logger: cap.logger, + contextTools: { + sqlExecution: { + execute: async () => { + await new Promise((resolve) => setTimeout(resolve, 5)); + return { headers: ['count'], rows: [[1]], rowCount: 1 }; + }, + }, + }, + }); + + await getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select 1' }); + + const end = cap.lines().find((line) => line.msg === 'tool.end'); + expect(end).toMatchObject({ outcome: 'ok', level: 40 }); + }); + + it('logs a matched tool.start/tool.end(error) pair carrying the deadline message when a query times out', async () => { + const cap = loggerCapture(); + const fake = makeFakeServer(); + createKtxMcpServer({ + server: fake.server, + userContext: { userId: 'local' }, + logger: cap.logger, + contextTools: { + sqlExecution: { + execute: vi.fn().mockRejectedValue(new KtxQueryError('query exceeded 30s')), + }, + }, + }); + + await getTool(fake.tools, 'sql_execution').handler({ + connectionId: 'warehouse', + sql: 'select min(time_id), max(time_id), count(*) from profits', + }); + + const lines = cap.lines(); + const start = lines.find((line) => line.msg === 'tool.start'); + const end = lines.find((line) => line.msg === 'tool.end'); + expect(typeof start?.callId).toBe('string'); + expect(end).toMatchObject({ tool: 'sql_execution', callId: start?.callId, outcome: 'error', level: 50 }); + expect((end?.err as { message?: string }).message).toBe('query exceeded 30s'); + // No unmatched tool.start remains — the matched pair closes spec 15's hang gap for this case. + expect(lines.filter((line) => line.msg === 'tool.start')).toHaveLength(1); + expect(lines.filter((line) => line.msg === 'tool.end' && line.callId === start?.callId)).toHaveLength(1); + expect(end?.durationMs as number).toBeGreaterThan(0); + }); + + it('suppresses routine tool traffic at warn level but keeps errored calls', async () => { + vi.stubEnv('KTX_MCP_LOG_LEVEL', 'warn'); + const cap = loggerCapture(); + const fake = makeFakeServer(); + createKtxMcpServer({ + server: fake.server, + userContext: { userId: 'local' }, + logger: cap.logger, + contextTools: { + knowledge: { + search: vi.fn().mockRejectedValue(new Error('wiki index unavailable')), + read: vi.fn().mockResolvedValue(null), + }, + }, + }); + + await getTool(fake.tools, 'wiki_search').handler({ query: 'revenue', limit: 5 }); + + const lines = cap.lines(); + expect(lines.some((line) => line.msg === 'tool.start')).toBe(false); + const end = lines.find((line) => line.msg === 'tool.end'); + expect(end).toMatchObject({ outcome: 'error', level: 50 }); + expect((end?.err as { message?: string }).message).toContain('wiki index unavailable'); + }); + + it('does not log tool calls when no logger is provided', async () => { + const fake = makeFakeServer(); + const io = makeIo(false); + createKtxMcpServer({ + server: fake.server, + userContext: { userId: 'local' }, + io, + contextTools: { + sqlExecution: { + execute: vi + .fn() + .mockResolvedValue({ headers: ['count'], rows: [[1]], rowCount: 1 }), + }, + }, + }); + + await getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select 1' }); + + expect(io.stderrText()).not.toContain('tool.start'); + expect(io.stderrText()).not.toContain('tool.end'); + }); +}); diff --git a/packages/cli/test/context/project/config.test.ts b/packages/cli/test/context/project/config.test.ts index 47cbee2a..cd4be37e 100644 --- a/packages/cli/test/context/project/config.test.ts +++ b/packages/cli/test/context/project/config.test.ts @@ -86,6 +86,7 @@ connections: profileSampleRows: 10000, profileConcurrency: 4, validationConcurrency: 4, + detectionBudgetMs: 600000, }, }, }); @@ -427,6 +428,7 @@ scan: profileConcurrency: 3 validationConcurrency: 2 validationBudget: 0 + detectionBudgetMs: 120000 `); expect(config.scan.relationships).toEqual({ @@ -441,6 +443,7 @@ scan: profileConcurrency: 3, validationConcurrency: 2, validationBudget: 0, + detectionBudgetMs: 120000, }); expect(serializeKtxProjectConfig(config)).toContain('enabled: false'); expect(serializeKtxProjectConfig(config)).toContain('llmProposals: false'); @@ -453,6 +456,25 @@ scan: expect(serializeKtxProjectConfig(config)).toContain('profileConcurrency: 3'); expect(serializeKtxProjectConfig(config)).toContain('validationConcurrency: 2'); expect(serializeKtxProjectConfig(config)).toContain('validationBudget: 0'); + expect(serializeKtxProjectConfig(config)).toContain('detectionBudgetMs: 120000'); + }); + + it('defaults the relationship detection budget to ten minutes', () => { + expect(buildDefaultKtxProjectConfig().scan.relationships.detectionBudgetMs).toBe(600000); + }); + + it('rejects a non-positive or non-integer relationship detection budget', () => { + for (const value of ['0', '-1', '1.5']) { + const yaml = ` +scan: + relationships: + detectionBudgetMs: ${value} +`; + expect(() => parseKtxProjectConfig(yaml)).toThrow(/scan\.relationships\.detectionBudgetMs/); + const validation = validateKtxProjectConfig(yaml); + expect(validation.ok).toBe(false); + expect(validation.issues.map((issue) => issue.path)).toContain('scan.relationships.detectionBudgetMs'); + } }); it('parses the scan relationship validation budget sentinel', () => { diff --git a/packages/cli/test/context/project/setup-config.test.ts b/packages/cli/test/context/project/setup-config.test.ts index f3c2553b..895f2d07 100644 --- a/packages/cli/test/context/project/setup-config.test.ts +++ b/packages/cli/test/context/project/setup-config.test.ts @@ -49,10 +49,10 @@ describe('ktx setup config helpers', () => { it('merges setup-local gitignore entries without removing existing lines', () => { expect(mergeKtxSetupGitignoreEntries('cache/\ndb.sqlite\n')).toBe( - ['cache/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'secrets/', 'setup/', 'agents/', ''].join('\n'), + ['cache/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'logs/', 'secrets/', 'setup/', 'agents/', ''].join('\n'), ); expect(mergeKtxSetupGitignoreEntries('cache/\nsecrets/\n')).toBe( - ['cache/', 'secrets/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'setup/', 'agents/', ''].join('\n'), + ['cache/', 'secrets/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'logs/', 'setup/', 'agents/', ''].join('\n'), ); }); }); diff --git a/packages/cli/test/context/scan/description-generation.test.ts b/packages/cli/test/context/scan/description-generation.test.ts index 9925f857..1bfa42a6 100644 --- a/packages/cli/test/context/scan/description-generation.test.ts +++ b/packages/cli/test/context/scan/description-generation.test.ts @@ -1,4 +1,8 @@ -import { describe, expect, it, vi } from 'vitest'; +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; +import { type ChildProcess } from 'node:child_process'; +import { mkdtempSync, rmSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; vi.mock('ai', async (importOriginal) => { const actual = await importOriginal(); @@ -14,6 +18,7 @@ import { KtxDescriptionGenerator, } from '../../../src/context/scan/description-generation.js'; import { createKtxConnectorCapabilities, type KtxScanConnector } from '../../../src/context/scan/types.js'; +import { HANGING_CHILD, killTestChildren, spawnTestChild } from '../llm/subprocess-test-children.test-utils.js'; function createCache(initial: Record = {}): KtxDescriptionCachePort { const data = new Map(Object.entries(initial)); @@ -41,6 +46,7 @@ function createLlmProvider(text = 'generated description') { }), generateObject: vi.fn(), runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, } as any; } @@ -57,6 +63,7 @@ function createFailingLlmProvider(message = 'timeout exceeded when trying to con }), generateObject: vi.fn(), runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, } as any; } @@ -492,7 +499,8 @@ describe('KtxDescriptionGenerator', () => { expect(result.tableDescription).toBeNull(); expect(Object.fromEntries(result.columnDescriptions)).toEqual({ status: null }); expect(warnings).toContain('enrichment_failed'); - expect(llmRuntime.generateObject).toHaveBeenCalledTimes(1); + // A transient (non-timeout) failure retries up to the attempt limit (default 3). + expect(llmRuntime.generateObject).toHaveBeenCalledTimes(3); expect(llmRuntime.generateText).not.toHaveBeenCalled(); }); }); @@ -684,6 +692,41 @@ describe('KtxDescriptionGenerator resilience', () => { expect(warnings).toEqual([]); }); + it('propagates a genuine context abort during the batched LLM call instead of degrading to null', async () => { + const controller = new AbortController(); + const llmRuntime = createLlmProvider('unused'); + llmRuntime.generateObject = vi.fn(async () => { + controller.abort(); + throw new Error('The operation was aborted'); + }); + const warnings: string[] = []; + const generator = new KtxDescriptionGenerator({ + llmRuntime, + onWarning: (warning) => warnings.push(warning.code), + settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 }, + }); + + await expect( + generator.generateBatchedTableDescriptions({ + connectionId: 'conn-1', + connector: createConnector(), + context: { runId: 'run-1', signal: controller.signal }, + dataSourceType: 'POSTGRESQL', + supportsNestedAnalysis: false, + table: { + catalog: null, + db: 'public', + name: 'orders', + rawDescriptions: {}, + columns: [{ name: 'status', type: 'text' }], + }, + }), + ).rejects.toThrow(); + + // A genuine cancellation must not be filed as a per-table failure/timeout. + expect(warnings).toEqual([]); + }); + it('generates column descriptions from rawDescriptions when sampleColumn is unavailable', async () => { const samplerWithoutColumn: KtxScanConnector = { ...createConnector(), @@ -782,3 +825,89 @@ describe('KtxDescriptionGenerator resilience', () => { expect(generateText).not.toHaveBeenCalled(); }); }); + +describe('KtxDescriptionGenerator subprocess kill boundary', () => { + const children: ChildProcess[] = []; + let workDir: string; + let priorTimeout: string | undefined; + + beforeEach(() => { + workDir = mkdtempSync(join(tmpdir(), 'ktx-enrich-')); + priorTimeout = process.env.KTX_ENRICH_LLM_TIMEOUT_MS; + process.env.KTX_ENRICH_LLM_TIMEOUT_MS = '300'; + }); + + afterEach(() => { + killTestChildren(children); + children.length = 0; + if (priorTimeout === undefined) { + delete process.env.KTX_ENRICH_LLM_TIMEOUT_MS; + } else { + process.env.KTX_ENRICH_LLM_TIMEOUT_MS = priorTimeout; + } + rmSync(workDir, { recursive: true, force: true }); + }); + + it('skips a wedged subprocess-backed table with enrichment_timeout and settles within deadline+grace', async () => { + const pidFile = join(workDir, 'gc.pid'); + const llmRuntime = createLlmProvider('unused'); + llmRuntime.subprocessForkSpec = () => ({ backend: 'codex', projectDir: '/tmp', modelSlots: { default: 'codex' } }); + const warnings: string[] = []; + const generator = new KtxDescriptionGenerator({ + llmRuntime, + onWarning: (warning) => warnings.push(warning.code), + settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 }, + spawnSubprocessGenerateChild: () => spawnTestChild(children, HANGING_CHILD, { KTX_TEST_GC_PID_FILE: pidFile }), + }); + + const start = Date.now(); + const result = await generator.generateBatchedTableDescriptions({ + connectionId: 'conn-1', + connector: createConnector(), + context: { runId: 'run-1' }, + dataSourceType: 'POSTGRESQL', + supportsNestedAnalysis: false, + table: { catalog: null, db: 'public', name: 'orders', columns: [{ name: 'status', type: 'text' }] }, + }); + + expect(Date.now() - start).toBeLessThan(5000); + expect(result.tableDescription).toBeNull(); + expect(Object.fromEntries(result.columnDescriptions)).toEqual({ status: null }); + expect(warnings).toContain('enrichment_timeout'); + // One wedge = one timeout: the hung table is not retried. + expect(children).toHaveLength(1); + const child = children[0]!; + await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), { timeout: 5000 }); + }); + + it('runs HTTP-backed enrichment in-process without spawning a child', async () => { + const spawnSpy = vi.fn(() => { + throw new Error('HTTP backend must not spawn a kill-boundary child'); + }); + const llmRuntime = createLlmProvider('unused'); + llmRuntime.subprocessForkSpec = () => null; + llmRuntime.generateObject = vi.fn(async () => ({ + tableDescription: 'Orders fact table', + columns: [{ name: 'status', description: 'Order lifecycle status' }], + })); + const generator = new KtxDescriptionGenerator({ + llmRuntime, + settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 }, + spawnSubprocessGenerateChild: spawnSpy, + }); + + const result = await generator.generateBatchedTableDescriptions({ + connectionId: 'conn-1', + connector: createConnector(), + context: { runId: 'run-1' }, + dataSourceType: 'POSTGRESQL', + supportsNestedAnalysis: false, + table: { catalog: null, db: 'public', name: 'orders', columns: [{ name: 'status', type: 'text' }] }, + }); + + expect(spawnSpy).not.toHaveBeenCalled(); + expect(llmRuntime.generateObject).toHaveBeenCalledTimes(1); + expect(result.tableDescription).toBe('Orders fact table'); + expect(Object.fromEntries(result.columnDescriptions)).toEqual({ status: 'Order lifecycle status' }); + }); +}); diff --git a/packages/cli/test/context/scan/description-resume.test.ts b/packages/cli/test/context/scan/description-resume.test.ts new file mode 100644 index 00000000..1380a721 --- /dev/null +++ b/packages/cli/test/context/scan/description-resume.test.ts @@ -0,0 +1,264 @@ +import { mkdtemp, rm } from 'node:fs/promises'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; +import YAML from 'yaml'; +import type { KtxLlmRuntimePort } from '../../../src/context/llm/runtime-port.js'; +import { buildDefaultKtxProjectConfig, type KtxScanRelationshipConfig } from '../../../src/context/project/config.js'; +import { initKtxProject, type KtxLocalProject } from '../../../src/context/project/project.js'; +import { + createKtxScanDescriptionResumeStore, + writeLocalScanManifestShards, +} from '../../../src/context/scan/local-enrichment-artifacts.js'; +import { runLocalScanEnrichment, type KtxLocalScanEnrichmentResult } from '../../../src/context/scan/local-enrichment.js'; +import { SqliteLocalScanEnrichmentStateStore } from '../../../src/context/scan/sqlite-local-enrichment-state-store.js'; +import { createKtxConnectorCapabilities, type KtxScanConnector, type KtxSchemaSnapshot } from '../../../src/context/scan/types.js'; + +const PROGRESS_PATH = 'raw-sources/warehouse/live-database/enrichment-progress/descriptions.json'; +const SHARD_PATH = 'semantic-layer/warehouse/_schema/public.yaml'; + +function column(name: string) { + return { + name, + nativeType: 'integer', + normalizedType: 'integer' as const, + dimensionType: 'number' as const, + nullable: false, + primaryKey: name === 'id', + comment: null, + }; +} + +function table(name: string) { + return { + catalog: null, + db: 'public', + name, + kind: 'table' as const, + comment: null, + estimatedRows: 1, + foreignKeys: [], + columns: [column('id'), column('value')], + }; +} + +const snapshot: KtxSchemaSnapshot = { + connectionId: 'warehouse', + driver: 'postgres', + extractedAt: '2026-04-29T12:00:00.000Z', + scope: { schemas: ['public'] }, + metadata: {}, + tables: [table('customers'), table('orders'), table('products')], +}; + +function connector(): KtxScanConnector { + return { + id: 'test:warehouse', + driver: 'postgres', + capabilities: createKtxConnectorCapabilities({ tableSampling: true, columnSampling: true }), + introspect: vi.fn(async () => snapshot), + listSchemas: vi.fn(async () => []), + listTables: vi.fn(async () => []), + sampleTable: vi.fn(async () => ({ headers: ['id', 'value'], rows: [[1, 2]], totalRows: 1 })), + sampleColumn: vi.fn(async () => ({ values: ['1', '2'], nullCount: 0, distinctCount: 2 })), + }; +} + +function countingRuntime() { + let calls = 0; + const runtime: KtxLlmRuntimePort = { + generateText: vi.fn(async () => 'AI column description'), + generateObject: vi.fn(async () => { + calls += 1; + return { tableDescription: 'AI table description', columns: [] }; + }) as KtxLlmRuntimePort['generateObject'], + runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, + }; + return { runtime, calls: () => calls }; +} + +function relationshipsDisabled(): KtxScanRelationshipConfig { + return { ...buildDefaultKtxProjectConfig().scan.relationships, enabled: false }; +} + +describe('descriptions stage incremental persistence + resume', () => { + let tempDir: string; + let project: KtxLocalProject; + + async function runEnrichment(runId: string): Promise<{ result: KtxLocalScanEnrichmentResult; calls: number }> { + const llm = countingRuntime(); + const result = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + connector: connector(), + snapshot, + context: { runId }, + providers: { llmRuntime: llm.runtime, embedding: null }, + descriptionResumeStore: createKtxScanDescriptionResumeStore({ + project, + connectionId: 'warehouse', + syncId: 'sync-1', + driver: 'postgres', + }), + syncId: 'sync-1', + relationshipSettings: relationshipsDisabled(), + }); + return { result, calls: llm.calls() }; + } + + async function readProgress(): Promise<{ inputHash: string; descriptions: Array<{ table: { name: string } }> }> { + return JSON.parse((await project.fileStore.readFile(PROGRESS_PATH)).content); + } + + async function writeProgress(record: unknown): Promise { + await project.fileStore.writeFile(PROGRESS_PATH, `${JSON.stringify(record, null, 2)}\n`, 'ktx', 'ktx@example.com', 'edit'); + } + + beforeEach(async () => { + tempDir = await mkdtemp(join(tmpdir(), 'ktx-desc-resume-')); + project = await initKtxProject({ projectDir: join(tempDir, 'project') }); + }); + + afterEach(async () => { + await rm(tempDir, { recursive: true, force: true }); + }); + + it('flushes durable descriptions + ai manifest descriptions on a fresh run', async () => { + const { calls } = await runEnrichment('run-1'); + expect(calls).toBe(3); + + const progress = await readProgress(); + expect(progress.descriptions.map((entry) => entry.table.name).sort()).toEqual(['customers', 'orders', 'products']); + + const shard = YAML.parse((await project.fileStore.readFile(SHARD_PATH)).content) as { + tables: Record; + }; + expect(shard.tables.customers?.descriptions?.ai).toBe('AI table description'); + expect(shard.tables.products?.descriptions?.ai).toBe('AI table description'); + }); + + it('re-issues no LLM calls when every table is already enriched (matching inputHash)', async () => { + await runEnrichment('run-1'); + const { result, calls } = await runEnrichment('run-2'); + + expect(calls).toBe(0); + expect(result.descriptionUpdates).toHaveLength(3); + expect(result.descriptionUpdates.every((update) => update.tableDescription === 'AI table description')).toBe(true); + }); + + it('re-enriches only the tables missing from the durable record', async () => { + await runEnrichment('run-1'); + const progress = await readProgress(); + progress.descriptions = progress.descriptions.filter((entry) => entry.table.name !== 'orders'); + await writeProgress(progress); + + const { result, calls } = await runEnrichment('run-2'); + + expect(calls).toBe(1); + expect(result.descriptionUpdates.map((update) => update.table.name).sort()).toEqual([ + 'customers', + 'orders', + 'products', + ]); + }); + + it('recomputes the whole stage when the durable record inputHash differs', async () => { + await runEnrichment('run-1'); + const progress = await readProgress(); + await writeProgress({ ...progress, inputHash: 'stale-input-hash' }); + + const { calls } = await runEnrichment('run-2'); + expect(calls).toBe(3); + }); + + it('persists the other tables and completes the stage when one table fails', async () => { + const stateStore = new SqliteLocalScanEnrichmentStateStore({ dbPath: join(tempDir, 'state.sqlite') }); + let calls = 0; + const runtime: KtxLlmRuntimePort = { + generateText: vi.fn(async () => 'AI column description'), + generateObject: vi.fn(async (input: { prompt: string }) => { + calls += 1; + if (input.prompt.includes('orders')) { + throw new Error('backend overloaded'); + } + return { tableDescription: 'AI table description', columns: [] }; + }) as KtxLlmRuntimePort['generateObject'], + runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, + }; + + const result = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + connector: connector(), + snapshot, + context: { runId: 'run-skip' }, + providers: { llmRuntime: runtime, embedding: null }, + descriptionResumeStore: createKtxScanDescriptionResumeStore({ + project, + connectionId: 'warehouse', + syncId: 'sync-1', + driver: 'postgres', + }), + stateStore, + syncId: 'sync-1', + relationshipSettings: relationshipsDisabled(), + }); + + // orders retries to the attempt limit (3) then fails; customers + products succeed once each. + expect(calls).toBe(5); + // The failed table is a single missing description, not the whole stage's loss. + const byName = new Map(result.descriptionUpdates.map((update) => [update.table.name, update])); + expect(byName.get('orders')?.tableDescription).toBeNull(); + expect(byName.get('customers')?.tableDescription).toBe('AI table description'); + expect(byName.get('products')?.tableDescription).toBe('AI table description'); + + // The stage completed (a completed row exists, not zero). + const stages = await stateStore.listRunStages('run-skip'); + expect(stages.some((stage) => stage.stage === 'descriptions' && stage.status === 'completed')).toBe(true); + + // The good tables are durable: progress record + ai: in the manifest; the failed one is absent. + const progress = await readProgress(); + expect(progress.descriptions.map((entry) => entry.table.name).sort()).toEqual(['customers', 'products']); + const shard = YAML.parse((await project.fileStore.readFile(SHARD_PATH)).content) as { + tables: Record; + }; + expect(shard.tables.customers?.descriptions?.ai).toBe('AI table description'); + expect(shard.tables.orders?.descriptions?.ai).toBeUndefined(); + }); + + it('rewrites only the manifest shards that gained a changed table', async () => { + const multiDb: KtxSchemaSnapshot = { + ...snapshot, + tables: [ + { ...table('customers'), db: 'sales' }, + { ...table('orders'), db: 'ops' }, + ], + }; + await writeLocalScanManifestShards({ + project, + connectionId: 'warehouse', + syncId: 'sync-1', + driver: 'postgres', + snapshot: multiDb, + dryRun: false, + }); + + const flushed = await writeLocalScanManifestShards({ + project, + connectionId: 'warehouse', + syncId: 'sync-1', + driver: 'postgres', + snapshot: multiDb, + dryRun: false, + descriptionUpdates: [ + { table: { catalog: null, db: 'sales', name: 'customers' }, tableDescription: 'desc', columnDescriptions: {} }, + ], + onlyChangedTableNames: new Set(['customers']), + }); + + expect(flushed.manifestShards).toHaveLength(1); + expect(flushed.manifestShards[0]).toContain('sales'); + }); +}); diff --git a/packages/cli/test/context/scan/enabled-tables.test.ts b/packages/cli/test/context/scan/enabled-tables.test.ts new file mode 100644 index 00000000..0db08c93 --- /dev/null +++ b/packages/cli/test/context/scan/enabled-tables.test.ts @@ -0,0 +1,24 @@ +import { describe, expect, it } from 'vitest'; +import { resolveEnabledTables } from '../../../src/context/scan/enabled-tables.js'; +import { tableRefKey } from '../../../src/context/scan/table-ref.js'; + +describe('resolveEnabledTables', () => { + it('returns null when enabled_tables is absent or empty', () => { + expect(resolveEnabledTables(undefined)).toBeNull(); + expect(resolveEnabledTables({ driver: 'sqlite' })).toBeNull(); + expect(resolveEnabledTables({ driver: 'sqlite', enabled_tables: [] })).toBeNull(); + }); + + it('treats sqlite "main." as equivalent to the bare ""', () => { + const qualified = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['main.customers'] }); + const bare = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['customers'] }); + const expected = tableRefKey({ catalog: null, db: null, name: 'customers' }); + expect([...(qualified ?? [])]).toEqual([expected]); + expect([...(bare ?? [])]).toEqual([expected]); + }); + + it('keeps the schema qualifier for non-sqlite drivers', () => { + const scope = resolveEnabledTables({ driver: 'postgres', enabled_tables: ['public.customers'] }); + expect([...(scope ?? [])]).toEqual([tableRefKey({ catalog: null, db: 'public', name: 'customers' })]); + }); +}); diff --git a/packages/cli/test/context/scan/enrichment-state.test.ts b/packages/cli/test/context/scan/enrichment-state.test.ts index 24b4bae3..d2c37b39 100644 --- a/packages/cli/test/context/scan/enrichment-state.test.ts +++ b/packages/cli/test/context/scan/enrichment-state.test.ts @@ -1,15 +1,26 @@ import { mkdtemp, rm } from 'node:fs/promises'; import { tmpdir } from 'node:os'; import { join } from 'node:path'; +import Database from 'better-sqlite3'; import { afterEach, beforeEach, describe, expect, it } from 'vitest'; import { completedKtxScanEnrichmentStateSummary, - computeKtxScanEnrichmentInputHash, + computeKtxDescriptionsStageHash, + computeKtxEmbeddingsStageHash, + computeKtxRelationshipsStageHash, + computeKtxScanDescriptionDigest, + type KtxScanEmbeddingIdentity, + type KtxScanLlmIdentity, summarizeKtxScanEnrichmentState, } from '../../../src/context/scan/enrichment-state.js'; import { SqliteLocalScanEnrichmentStateStore } from '../../../src/context/scan/sqlite-local-enrichment-state-store.js'; +import { buildDefaultKtxProjectConfig } from '../../../src/context/project/config.js'; import type { KtxSchemaSnapshot } from '../../../src/context/scan/types.js'; +const llmIdentity: KtxScanLlmIdentity = { model: 'opus', baseUrlConfigured: false }; +const embeddingIdentity: KtxScanEmbeddingIdentity = { model: 'minilm', dimensions: 384, batchSize: 64 }; +const relationshipSettings = buildDefaultKtxProjectConfig().scan.relationships; + const snapshot: KtxSchemaSnapshot = { connectionId: 'warehouse', driver: 'postgres', @@ -53,28 +64,19 @@ describe('scan enrichment state', () => { await rm(tempDir, { recursive: true, force: true }); }); - it('computes stable input hashes without depending on object key order', () => { - const first = computeKtxScanEnrichmentInputHash({ - snapshot, - mode: 'enriched', - detectRelationships: true, - providerIdentity: { provider: 'local-heuristic', llmModel: 'a' }, - }); - const second = computeKtxScanEnrichmentInputHash({ + it('computes stable per-stage hashes without depending on object key order', () => { + const first = computeKtxDescriptionsStageHash({ snapshot, llmIdentity }); + const second = computeKtxDescriptionsStageHash({ snapshot: { ...snapshot, metadata: {} }, - mode: 'enriched', - detectRelationships: true, - providerIdentity: { llmModel: 'a', provider: 'local-heuristic' }, + llmIdentity: { baseUrlConfigured: false, model: 'opus' }, }); const firstTable = snapshot.tables[0]; if (!firstTable) { throw new Error('Expected test snapshot table'); } - const changed = computeKtxScanEnrichmentInputHash({ + const changed = computeKtxDescriptionsStageHash({ snapshot: { ...snapshot, tables: [{ ...firstTable, name: 'orders_v2' }] }, - mode: 'enriched', - detectRelationships: true, - providerIdentity: { provider: 'local-heuristic', llmModel: 'a' }, + llmIdentity, }); expect(first).toMatch(/^[a-f0-9]{64}$/); @@ -82,13 +84,48 @@ describe('scan enrichment state', () => { expect(changed).not.toBe(first); }); + it('isolates per-stage invalidation: one input changes only its own stage', () => { + const descriptionDigest = computeKtxScanDescriptionDigest(['orders.id (integer)']); + const descriptions = computeKtxDescriptionsStageHash({ snapshot, llmIdentity }); + const embeddings = computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest }); + const relationships = computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity }); + + // Switching the description LLM re-keys descriptions + relationships (both + // depend on llmIdentity) but NOT embeddings. + const otherLlm: KtxScanLlmIdentity = { model: 'sonnet', baseUrlConfigured: false }; + expect(computeKtxDescriptionsStageHash({ snapshot, llmIdentity: otherLlm })).not.toBe(descriptions); + expect(computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity: otherLlm })).not.toBe( + relationships, + ); + expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest })).toBe(embeddings); + + // Swapping the embeddings model re-keys only embeddings. + const otherEmbedding: KtxScanEmbeddingIdentity = { model: 'mpnet', dimensions: 768, batchSize: 64 }; + expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity: otherEmbedding, descriptionDigest })).not.toBe( + embeddings, + ); + expect(computeKtxDescriptionsStageHash({ snapshot, llmIdentity })).toBe(descriptions); + expect(computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity })).toBe(relationships); + + // A description-content change (new digest) re-keys only embeddings; + // relationships are deliberately decoupled from description content (D5). + const otherDigest = computeKtxScanDescriptionDigest(['orders.id (integer). A primary key.']); + expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest: otherDigest })).not.toBe( + embeddings, + ); + expect(computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity })).toBe(relationships); + + // Flipping llmProposals re-keys only relationships. + const otherRelationships = { ...relationshipSettings, llmProposals: !relationshipSettings.llmProposals }; + expect( + computeKtxRelationshipsStageHash({ snapshot, relationshipSettings: otherRelationships, llmIdentity }), + ).not.toBe(relationships); + expect(computeKtxDescriptionsStageHash({ snapshot, llmIdentity })).toBe(descriptions); + expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest })).toBe(embeddings); + }); + it('persists completed stages and ignores stale hashes', async () => { - const inputHash = computeKtxScanEnrichmentInputHash({ - snapshot, - mode: 'enriched', - detectRelationships: true, - providerIdentity: { provider: 'local-heuristic' }, - }); + const inputHash = computeKtxDescriptionsStageHash({ snapshot, llmIdentity }); await store.saveCompletedStage({ runId: 'scan-run-1', @@ -103,7 +140,7 @@ describe('scan enrichment state', () => { await expect( store.findCompletedStage({ - runId: 'scan-run-1', + connectionId: 'warehouse', stage: 'descriptions', inputHash, }), @@ -116,13 +153,51 @@ describe('scan enrichment state', () => { await expect( store.findCompletedStage({ - runId: 'scan-run-1', + connectionId: 'warehouse', stage: 'descriptions', inputHash: 'different-hash', }), ).resolves.toBeNull(); }); + it('resolves a completed stage across a fresh run id by content identity', async () => { + const inputHash = computeKtxDescriptionsStageHash({ snapshot, llmIdentity }); + + await store.saveCompletedStage({ + runId: 'scan-run-first', + connectionId: 'warehouse', + syncId: 'sync-first', + mode: 'enriched', + stage: 'descriptions', + inputHash, + output: [{ table: { catalog: null, db: 'public', name: 'orders' }, tableDescription: 'first' }], + updatedAt: '2026-04-29T12:00:00.000Z', + }); + // A later run with the SAME content identity overwrites in place (the + // primary key no longer includes run_id), and the lookup resolves it + // without ever knowing the run id that produced it. + await store.saveCompletedStage({ + runId: 'scan-run-second', + connectionId: 'warehouse', + syncId: 'sync-second', + mode: 'enriched', + stage: 'descriptions', + inputHash, + output: [{ table: { catalog: null, db: 'public', name: 'orders' }, tableDescription: 'second' }], + updatedAt: '2026-04-29T12:05:00.000Z', + }); + + const resolved = await store.findCompletedStage({ + connectionId: 'warehouse', + stage: 'descriptions', + inputHash, + }); + expect(resolved?.runId).toBe('scan-run-second'); + expect(resolved?.output).toEqual([ + { table: { catalog: null, db: 'public', name: 'orders' }, tableDescription: 'second' }, + ]); + }); + it('records failed stages without making them reusable', async () => { await store.saveFailedStage({ runId: 'scan-run-2', @@ -137,7 +212,7 @@ describe('scan enrichment state', () => { await expect( store.findCompletedStage({ - runId: 'scan-run-2', + connectionId: 'warehouse', stage: 'embeddings', inputHash: 'hash-2', }), @@ -153,6 +228,47 @@ describe('scan enrichment state', () => { ]); }); + it('recreates the resume cache when an older primary key shape is found', async () => { + const dbPath = join(tempDir, 'legacy.sqlite'); + const legacy = new Database(dbPath); + legacy.exec(` + CREATE TABLE local_scan_enrichment_stages ( + run_id TEXT NOT NULL, + stage TEXT NOT NULL, + input_hash TEXT NOT NULL, + connection_id TEXT NOT NULL, + sync_id TEXT NOT NULL, + mode TEXT NOT NULL, + status TEXT NOT NULL, + output_json TEXT, + error_message TEXT, + updated_at TEXT NOT NULL, + PRIMARY KEY (run_id, stage) + ); + INSERT INTO local_scan_enrichment_stages + VALUES ('old-run', 'descriptions', 'hash', 'warehouse', 'sync', 'enriched', 'completed', 'null', NULL, '2026-01-01T00:00:00.000Z'); + `); + legacy.close(); + + const recreated = new SqliteLocalScanEnrichmentStateStore({ dbPath }); + // The legacy row is dropped with the old table; the new key shape is in + // force, so a fresh save + lookup round-trips cleanly. + await recreated.saveCompletedStage({ + runId: 'new-run', + connectionId: 'warehouse', + syncId: 'sync', + mode: 'enriched', + stage: 'descriptions', + inputHash: 'hash', + output: ['fresh'], + updatedAt: '2026-02-01T00:00:00.000Z', + }); + await expect( + recreated.findCompletedStage({ connectionId: 'warehouse', stage: 'descriptions', inputHash: 'hash' }), + ).resolves.toMatchObject({ runId: 'new-run', output: ['fresh'] }); + await expect(recreated.listRunStages('old-run')).resolves.toEqual([]); + }); + it('summarizes resumed, completed, and failed stages for reports', () => { expect( summarizeKtxScanEnrichmentState({ diff --git a/packages/cli/test/context/scan/local-enrichment-artifacts.test.ts b/packages/cli/test/context/scan/local-enrichment-artifacts.test.ts index 638bafb2..f7fe42cb 100644 --- a/packages/cli/test/context/scan/local-enrichment-artifacts.test.ts +++ b/packages/cli/test/context/scan/local-enrichment-artifacts.test.ts @@ -5,7 +5,11 @@ import { afterEach, beforeEach, describe, expect, it } from 'vitest'; import YAML from 'yaml'; import { initKtxProject, type KtxLocalProject } from '../../../src/context/project/project.js'; import type { KtxLocalScanEnrichmentResult } from '../../../src/context/scan/local-enrichment.js'; -import { writeLocalScanEnrichmentArtifacts, writeLocalScanManifestShards } from '../../../src/context/scan/local-enrichment-artifacts.js'; +import { + loadOnDiskDescriptionUpdates, + writeLocalScanEnrichmentArtifacts, + writeLocalScanManifestShards, +} from '../../../src/context/scan/local-enrichment-artifacts.js'; import type { KtxSchemaSnapshot } from '../../../src/context/scan/types.js'; const snapshot: KtxSchemaSnapshot = { @@ -220,6 +224,7 @@ function enrichment(): KtxLocalScanEnrichmentResult { }, ], compositeRelationships: null, + relationshipPartial: null, }; } @@ -238,6 +243,86 @@ describe('writeLocalScanEnrichmentArtifacts', () => { await rm(tempDir, { recursive: true, force: true }); }); + it('scopes manifest descriptions by full table identity across same-named tables in different schemas', async () => { + const multiSchemaSnapshot: KtxSchemaSnapshot = { + connectionId: 'warehouse', + driver: 'postgres', + extractedAt: '2026-04-29T12:00:00.000Z', + scope: { schemas: ['analytics', 'staging'] }, + metadata: {}, + tables: ['analytics', 'staging'].map((schema) => ({ + catalog: null, + db: schema, + name: 'orders', + kind: 'table', + comment: null, + estimatedRows: 1, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: true, + comment: null, + }, + ], + })), + }; + const descriptionUpdates = [ + { + table: { catalog: null, db: 'analytics', name: 'orders' }, + tableDescription: 'Curated analytics orders', + columnDescriptions: { id: 'Analytics order id' }, + }, + { + table: { catalog: null, db: 'staging', name: 'orders' }, + tableDescription: 'Raw staging orders', + columnDescriptions: { id: 'Staging order id' }, + }, + ]; + + await writeLocalScanManifestShards({ + project, + connectionId: 'warehouse', + syncId: 'sync-multi', + driver: 'postgres', + snapshot: multiSchemaSnapshot, + descriptionUpdates, + dryRun: false, + }); + + type Shard = { + tables: Record< + string, + { descriptions?: Record; columns: Array<{ name: string; descriptions?: Record }> } + >; + }; + const analyticsShard = YAML.parse( + await readFile(join(project.projectDir, 'semantic-layer/warehouse/_schema/analytics.yaml'), 'utf-8'), + ) as Shard; + const stagingShard = YAML.parse( + await readFile(join(project.projectDir, 'semantic-layer/warehouse/_schema/staging.yaml'), 'utf-8'), + ) as Shard; + + expect(analyticsShard.tables.orders?.descriptions?.ai).toBe('Curated analytics orders'); + expect(stagingShard.tables.orders?.descriptions?.ai).toBe('Raw staging orders'); + expect(analyticsShard.tables.orders?.columns[0]?.descriptions?.ai).toBe('Analytics order id'); + expect(stagingShard.tables.orders?.columns[0]?.descriptions?.ai).toBe('Staging order id'); + + // The on-disk reconstruction (used by selective `--stages` runs that skip the + // descriptions stage) must also resolve per identity, not collapse names. + const reconstructed = await loadOnDiskDescriptionUpdates(project, 'warehouse', multiSchemaSnapshot); + const analytics = reconstructed.find((update) => update.table.db === 'analytics'); + const staging = reconstructed.find((update) => update.table.db === 'staging'); + expect(analytics?.tableDescription).toBe('Curated analytics orders'); + expect(staging?.tableDescription).toBe('Raw staging orders'); + expect(analytics?.columnDescriptions.id).toBe('Analytics order id'); + expect(staging?.columnDescriptions.id).toBe('Staging order id'); + }); + it('writes enrichment artifacts and manifest shards while preserving external descriptions', async () => { await project.fileStore.writeFile( 'semantic-layer/warehouse/_schema/public.yaml', @@ -291,6 +376,7 @@ describe('writeLocalScanEnrichmentArtifacts', () => { profileSampleRows: 500, profileConcurrency: 3, validationConcurrency: 2, + detectionBudgetMs: 600000, }, }); @@ -476,6 +562,7 @@ describe('writeLocalScanEnrichmentArtifacts', () => { profileSampleRows: 10000, profileConcurrency: 4, validationConcurrency: 4, + detectionBudgetMs: 600000, }, dryRun: false, }); @@ -746,6 +833,7 @@ describe('writeLocalScanEnrichmentArtifacts', () => { profileSampleRows: 10000, profileConcurrency: 4, validationConcurrency: 4, + detectionBudgetMs: 600000, }, dryRun: false, }); diff --git a/packages/cli/test/context/scan/local-enrichment.test.ts b/packages/cli/test/context/scan/local-enrichment.test.ts index dd2d6133..2db86ac3 100644 --- a/packages/cli/test/context/scan/local-enrichment.test.ts +++ b/packages/cli/test/context/scan/local-enrichment.test.ts @@ -1,6 +1,15 @@ +import { mkdtemp, readFile, rm } from 'node:fs/promises'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; import Database from 'better-sqlite3'; import { describe, expect, it, vi } from 'vitest'; +import YAML from 'yaml'; import { buildDefaultKtxProjectConfig } from '../../../src/context/project/config.js'; +import { initKtxProject } from '../../../src/context/project/project.js'; +import { + loadOnDiskDescriptionUpdates, + writeLocalScanEnrichmentArtifacts, +} from '../../../src/context/scan/local-enrichment-artifacts.js'; import type { KtxScanEnrichmentCompletedStage, KtxScanEnrichmentFailedStage, @@ -201,15 +210,24 @@ function noDeclaredRelationshipSnapshot(): KtxSchemaSnapshot { function memoryEnrichmentStateStore(): KtxScanEnrichmentStateStore { const records = new Map(); - const key = (input: Pick) => `${input.runId}:${input.stage}`; + const key = (input: Pick) => + `${input.connectionId}:${input.stage}:${input.inputHash}`; return { async findCompletedStage(input: KtxScanEnrichmentStageLookup) { const record = records.get(key(input)); - if (!record || record.status !== 'completed' || record.inputHash !== input.inputHash) { + if (!record || record.status !== 'completed') { return null; } return record as KtxScanEnrichmentCompletedStage; }, + async findLatestCompletedStage(input) { + const matches = [...records.values()].filter( + (record): record is KtxScanEnrichmentCompletedStage => + record.status === 'completed' && record.connectionId === input.connectionId && record.stage === input.stage, + ); + matches.sort((left, right) => (left.updatedAt < right.updatedAt ? 1 : -1)); + return matches[0] ?? null; + }, async saveCompletedStage(input) { records.set(key(input), { ...input, @@ -246,6 +264,57 @@ describe('local scan enrichment', () => { }); }); + it('scopes descriptions by full table identity across same-named tables in different schemas', () => { + const multiSchemaSnapshot: KtxSchemaSnapshot = { + connectionId: 'warehouse', + driver: 'postgres', + extractedAt: '2026-04-29T12:00:00.000Z', + scope: { schemas: ['analytics', 'staging'] }, + metadata: {}, + tables: ['analytics', 'staging'].map((schema) => ({ + catalog: null, + db: schema, + name: 'orders', + kind: 'table', + comment: null, + estimatedRows: 1, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: true, + comment: null, + }, + ], + })), + }; + const descriptions = [ + { + table: { catalog: null, db: 'analytics', name: 'orders' }, + tableDescription: 'Curated analytics orders', + columnDescriptions: { id: 'Analytics order id' }, + }, + { + table: { catalog: null, db: 'staging', name: 'orders' }, + tableDescription: 'Raw staging orders', + columnDescriptions: { id: 'Staging order id' }, + }, + ]; + + const schema = snapshotToKtxEnrichedSchema(multiSchemaSnapshot, new Map(), descriptions); + + const analytics = schema.tables.find((table) => table.id === 'analytics.orders'); + const staging = schema.tables.find((table) => table.id === 'staging.orders'); + expect(analytics?.descriptions.ai).toBe('Curated analytics orders'); + expect(staging?.descriptions.ai).toBe('Raw staging orders'); + expect(analytics?.columns[0]?.descriptions.ai).toBe('Analytics order id'); + expect(staging?.columns[0]?.descriptions.ai).toBe('Staging order id'); + }); + it('maps snapshot foreign keys into formal schema relationships', () => { const source = noDeclaredRelationshipSnapshot(); const snapshotWithForeignKey = { @@ -617,8 +686,8 @@ describe('local scan enrichment', () => { expect(events).toEqual( expect.arrayContaining([ - expect.objectContaining({ message: 'Generating descriptions 1/2 tables', transient: true }), - expect.objectContaining({ message: 'Generating descriptions 2/2 tables', transient: true }), + expect.objectContaining({ message: 'Generating descriptions 1/2 (customers, 1 cols)', transient: true }), + expect.objectContaining({ message: 'Generating descriptions 2/2 (orders, 2 cols)', transient: true }), expect.objectContaining({ message: 'Building embeddings 1/1 batches', transient: true }), expect.objectContaining({ message: 'Detecting relationships' }), ]), @@ -711,7 +780,7 @@ describe('local scan enrichment', () => { expect(embedBatch.mock.calls.map(([texts]) => texts).map((texts) => texts.length)).toEqual([2, 2, 1]); }); - it('reuses completed description and embedding stages for the same run id and snapshot hash', async () => { + it('reuses completed description and embedding stages across a fresh run id by content identity', async () => { const stateStore = memoryEnrichmentStateStore(); const scanConnector = connector(); const providers = { @@ -728,21 +797,25 @@ describe('local scan enrichment', () => { providers, stateStore, syncId: 'sync-resume-1', - providerIdentity: { provider: 'fake', embeddingDimensions: 6 }, + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 }, }); const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject'); const embedBatch = vi.spyOn(providers.embedding, 'embedBatch'); + // A re-run mints a brand-new runId/syncId (as a real interrupted ingest + // would); resume must still hit the cache via (connectionId, stage, inputHash). const second = await runLocalScanEnrichment({ connectionId: 'warehouse', mode: 'enriched', detectRelationships: true, connector: scanConnector, - context: { runId: 'scan-run-resume-1' }, + context: { runId: 'scan-run-resume-2' }, providers, stateStore, - syncId: 'sync-resume-1', - providerIdentity: { provider: 'fake', embeddingDimensions: 6 }, + syncId: 'sync-resume-2', + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 }, }); expect(first.state.completedStages).toEqual(['descriptions', 'embeddings', 'relationships']); @@ -756,6 +829,159 @@ describe('local scan enrichment', () => { expect(second.relationships).toEqual(first.relationships); }); + it('marks a budget-truncated relationship stage partial, persists it, and re-runs only when the budget is raised', async () => { + const executor = new InMemorySqliteExecutor(); + try { + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id) VALUES (1), (2); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + const scanConnector = { + ...connector(), + driver: 'sqlite' as const, + capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }), + introspect: vi.fn(async () => noDeclaredRelationshipSnapshot()), + executeReadOnly: executor.executeReadOnly.bind(executor), + }; + const stateStore = memoryEnrichmentStateStore(); + const base = Date.parse('2026-06-01T00:00:00.000Z'); + let calls = 0; + // A clock that jumps a second per read against a 1ms budget trips at the + // first table-profile boundary. + const advancingNow = () => new Date(base + calls++ * 1000); + const tightSettings = { + ...buildDefaultKtxProjectConfig().scan.relationships, + detectionBudgetMs: 1, + }; + + const first = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'relationships', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'budget-run-1' }, + providers: null, + stateStore, + syncId: 'sync-budget-1', + relationshipSettings: tightSettings, + now: advancingNow, + }); + + expect(first.relationshipPartial).toEqual({ reason: 'budget' }); + expect(first.warnings.map((warning) => warning.code)).toContain('relationship_detection_partial'); + expect(first.state.completedStages).toContain('relationships'); + + // A re-run with a fresh runId resumes the saved partial from cache. + const second = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'relationships', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'budget-run-2' }, + providers: null, + stateStore, + syncId: 'sync-budget-2', + relationshipSettings: tightSettings, + }); + expect(second.state.resumedStages).toContain('relationships'); + + // Raising the budget changes the content identity, forcing a fuller run. + const third = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'relationships', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'budget-run-3' }, + providers: null, + stateStore, + syncId: 'sync-budget-3', + relationshipSettings: { ...tightSettings, detectionBudgetMs: 600_000 }, + }); + expect(third.state.resumedStages).not.toContain('relationships'); + expect(third.relationshipPartial).toBeNull(); + } finally { + executor.close(); + } + }); + + it('checkpoints descriptions and embeddings before the relationship stage queries the database', async () => { + const executor = new InMemorySqliteExecutor(); + try { + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id) VALUES (1), (2); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + const checkpoints: Array>> = []; + let sawRelationshipQuery = false; + let relationshipQueryRanAfterCheckpoint = true; + const scanConnector = { + ...connector(), + driver: 'sqlite' as const, + capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }), + introspect: vi.fn(async () => noDeclaredRelationshipSnapshot()), + executeReadOnly: (input: KtxReadOnlyQueryInput, ctx: KtxScanContext) => { + sawRelationshipQuery = true; + if (checkpoints.length === 0) { + relationshipQueryRanAfterCheckpoint = false; + } + return executor.executeReadOnly(input, ctx); + }, + }; + + const result = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'checkpoint-order' }, + providers: { + ...createDeterministicLocalScanEnrichmentProviders(), + embedding: fakeScanEmbedding({ dimensions: 6 }), + }, + onCheckpoint: async (checkpoint) => { + checkpoints.push(checkpoint); + }, + }); + + expect(checkpoints).toHaveLength(1); + const checkpoint = checkpoints[0]; + if (!checkpoint) { + throw new Error('Expected a checkpoint'); + } + expect(checkpoint.summary.tableDescriptions).toBe('completed'); + expect(checkpoint.summary.embeddings).toBe('completed'); + expect(checkpoint.descriptionUpdates.length).toBeGreaterThan(0); + expect(checkpoint.embeddingUpdates.length).toBeGreaterThan(0); + // The relationship-specific outputs are deliberately absent at checkpoint time. + expect(checkpoint.relationshipUpdate).toBeNull(); + expect(checkpoint.relationshipProfile).toBeNull(); + expect(sawRelationshipQuery).toBe(true); + expect(relationshipQueryRanAfterCheckpoint).toBe(true); + // The final result still carries the relationship outputs. + expect(result.relationshipProfile).not.toBeNull(); + } finally { + executor.close(); + } + }); + + it('does not checkpoint when relationship detection is skipped', async () => { + const onCheckpoint = vi.fn(async () => {}); + await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + connector: connector(), + context: { runId: 'no-checkpoint' }, + providers: createDeterministicLocalScanEnrichmentProviders(), + relationshipSettings: { ...buildDefaultKtxProjectConfig().scan.relationships, enabled: false }, + onCheckpoint, + }); + expect(onCheckpoint).not.toHaveBeenCalled(); + }); + it('does not reuse completed stages when the snapshot changes', async () => { const stateStore = memoryEnrichmentStateStore(); const providers = { @@ -773,7 +999,8 @@ describe('local scan enrichment', () => { providers, stateStore, syncId: 'sync-resume-hash', - providerIdentity: { provider: 'fake', embeddingDimensions: 6 }, + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 }, }); const firstTable = snapshot.tables[0]; @@ -798,7 +1025,8 @@ describe('local scan enrichment', () => { providers, stateStore, syncId: 'sync-resume-hash', - providerIdentity: { provider: 'fake', embeddingDimensions: 6 }, + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 }, }); expect(result.state.resumedStages).toEqual([]); @@ -868,4 +1096,653 @@ describe('local scan enrichment', () => { } }); + it('merges ai descriptions into the enriched relationship schema', () => { + const schema = snapshotToKtxEnrichedSchema(snapshot, new Map(), [ + { + table: { catalog: null, db: 'public', name: 'orders' }, + tableDescription: 'All customer orders', + columnDescriptions: { customer_id: 'FK to the owning customer' }, + }, + ]); + const orders = schema.tables.find((table) => table.ref.name === 'orders'); + expect(orders?.descriptions).toMatchObject({ db: 'Customer orders', ai: 'All customer orders' }); + expect(orders?.columns.find((column) => column.name === 'customer_id')?.descriptions).toMatchObject({ + db: 'Customer id', + ai: 'FK to the owning customer', + }); + }); + + it('force-reruns a named stage past the completed-row short-circuit and leaves unselected stages untouched', async () => { + const stateStore = memoryEnrichmentStateStore(); + const scanConnector = connector(); + const providers = { + ...createDeterministicLocalScanEnrichmentProviders(), + embedding: fakeScanEmbedding({ dimensions: 6 }), + }; + const identity = { + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 }, + }; + + await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'force-1' }, + providers, + stateStore, + syncId: 'force-s1', + ...identity, + }); + + const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject'); + const embedBatch = vi.spyOn(providers.embedding, 'embedBatch'); + + const rerun = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'force-2' }, + providers, + stateStore, + syncId: 'force-s2', + stages: ['descriptions'], + ...identity, + }); + + // Only descriptions ran, and it recomputed (not resumed) despite a matching + // completed row; embeddings + relationships were left untouched. + expect(rerun.state.completedStages).toEqual(['descriptions']); + expect(rerun.state.resumedStages).toEqual([]); + expect(generateObject).toHaveBeenCalled(); + expect(embedBatch).not.toHaveBeenCalled(); + }); + + it('naming every stage forces a full recompute rather than a no-op resume', async () => { + const stateStore = memoryEnrichmentStateStore(); + const scanConnector = connector(); + const providers = { + ...createDeterministicLocalScanEnrichmentProviders(), + embedding: fakeScanEmbedding({ dimensions: 6 }), + }; + const identity = { + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 }, + }; + + await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'full-1' }, + providers, + stateStore, + syncId: 'full-s1', + ...identity, + }); + + const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject'); + const embedBatch = vi.spyOn(providers.embedding, 'embedBatch'); + + const rerun = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'full-2' }, + providers, + stateStore, + syncId: 'full-s2', + stages: ['descriptions', 'embeddings', 'relationships'], + ...identity, + }); + + expect(rerun.state.resumedStages).toEqual([]); + expect(rerun.state.completedStages).toEqual(['descriptions', 'embeddings', 'relationships']); + expect(generateObject).toHaveBeenCalled(); + expect(embedBatch).toHaveBeenCalled(); + }); + + it('isolates per-stage invalidation: changing the embedding identity re-runs only embeddings', async () => { + const stateStore = memoryEnrichmentStateStore(); + const scanConnector = connector(); + const providers = { + ...createDeterministicLocalScanEnrichmentProviders(), + embedding: fakeScanEmbedding({ dimensions: 6 }), + }; + const llmIdentity = { model: 'fake', baseUrlConfigured: false }; + + await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'iso-1' }, + providers, + stateStore, + syncId: 'iso-s1', + llmIdentity, + embeddingIdentity: { model: 'embed-v1', dimensions: 6, batchSize: 64 }, + }); + + const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject'); + const embedBatch = vi.spyOn(providers.embedding, 'embedBatch'); + + const rerun = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'iso-2' }, + providers, + stateStore, + syncId: 'iso-s2', + llmIdentity, + embeddingIdentity: { model: 'embed-v2', dimensions: 6, batchSize: 64 }, + }); + + // Only the embeddings hash moved: descriptions + relationships resume from + // cache, embeddings recompute. No LLM description/proposal calls fire. + expect(rerun.state.resumedStages).toEqual(['descriptions', 'relationships']); + expect(rerun.state.completedStages).toEqual(['descriptions', 'embeddings', 'relationships']); + expect(generateObject).not.toHaveBeenCalled(); + expect(embedBatch).toHaveBeenCalled(); + }); + + it('warns when a selected stage cannot run because its prerequisite is missing', async () => { + const result = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: false, + connector: connector(), + context: { runId: 'prereq-1' }, + // No embedding provider configured. + providers: createDeterministicLocalScanEnrichmentProviders(), + stages: ['embeddings'], + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + }); + + expect(result.summary.embeddings).toBe('skipped'); + expect(result.warnings).toContainEqual( + expect.objectContaining({ code: 'enrichment_stage_skipped', metadata: { stage: 'embeddings' } }), + ); + }); + + it('feeds on-disk descriptions into the llmProposals prompt on a relationships-only run', async () => { + const executor = new InMemorySqliteExecutor(); + try { + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id) VALUES (1), (2); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + const scanConnector = { + ...connector(), + driver: 'sqlite' as const, + capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }), + introspect: vi.fn(async () => noDeclaredRelationshipSnapshot()), + executeReadOnly: executor.executeReadOnly.bind(executor), + }; + const providers = createDeterministicLocalScanEnrichmentProviders(); + const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject'); + const onDiskDescriptions: Array<{ + table: { catalog: null; db: null; name: string }; + tableDescription: string | null; + columnDescriptions: Record; + }> = [ + { + table: { catalog: null, db: null, name: 'orders' }, + tableDescription: 'Customer purchase orders', + columnDescriptions: { id: 'Order identifier', account_id: 'The owning account reference' }, + }, + { + table: { catalog: null, db: null, name: 'accounts' }, + tableDescription: 'Account records', + columnDescriptions: { id: 'Account identifier' }, + }, + ]; + + await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'rel-only-hydration' }, + providers, + stages: ['relationships'], + llmIdentity: { model: 'fake', baseUrlConfigured: false }, + loadPriorDescriptions: async () => onDiskDescriptions, + }); + + // The relationship-proposal prompt (the only generateObject calls on a + // relationships-only run) carries the on-disk descriptions, not just names. + const prompts = generateObject.mock.calls.map((call) => String((call[0] as { prompt: string }).prompt)); + expect(prompts.length).toBeGreaterThan(0); + expect(prompts.some((prompt) => prompt.includes('The owning account reference'))).toBe(true); + } finally { + executor.close(); + } + }); + + it('resume record still skips already-enriched tables when a forced descriptions rerun re-enters compute', async () => { + const stateStore = memoryEnrichmentStateStore(); + const scanConnector = connector(); + const providers = createDeterministicLocalScanEnrichmentProviders(); + const identity = { llmIdentity: { model: 'fake', baseUrlConfigured: false } }; + const resumeStore = { + load: vi.fn(async () => [ + { + table: { catalog: null, db: 'public', name: 'customers' }, + tableDescription: 'Recovered customers description', + columnDescriptions: { id: 'Recovered id' }, + }, + ]), + flush: vi.fn(async () => {}), + }; + + // Populate a completed descriptions row so a non-forced run would short-circuit. + await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: false, + connector: scanConnector, + context: { runId: 'resume-force-1' }, + providers, + stateStore, + syncId: 'resume-force-s1', + ...identity, + }); + + const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject'); + const rerun = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: false, + connector: scanConnector, + context: { runId: 'resume-force-2' }, + providers, + stateStore, + syncId: 'resume-force-s2', + stages: ['descriptions'], + descriptionResumeStore: resumeStore, + ...identity, + }); + + // Forced compute re-entered, consulted the resume record, recovered + // 'customers', and only re-issued the LLM for the un-recovered 'orders'. + expect(resumeStore.load).toHaveBeenCalled(); + expect(generateObject).toHaveBeenCalledTimes(1); + expect(rerun.descriptionUpdates.find((update) => update.table.name === 'customers')?.tableDescription).toBe( + 'Recovered customers description', + ); + expect(rerun.state.resumedStages).toEqual([]); + }); + + it('resumes per table identity, re-enriching a same-named table in another schema', async () => { + const multiSchemaSnapshot: KtxSchemaSnapshot = { + connectionId: 'warehouse', + driver: 'postgres', + extractedAt: '2026-04-29T12:00:00.000Z', + scope: { schemas: ['analytics', 'staging'] }, + metadata: {}, + tables: ['analytics', 'staging'].map((schema) => ({ + catalog: null, + db: schema, + name: 'orders', + kind: 'table', + comment: null, + estimatedRows: 1, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: true, + comment: null, + }, + ], + })), + }; + const scanConnector = connector(); + const providers = createDeterministicLocalScanEnrichmentProviders(); + const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject'); + // Only the analytics.orders description was flushed before the interruption. + const resumeStore = { + load: vi.fn(async () => [ + { + table: { catalog: null, db: 'analytics', name: 'orders' }, + tableDescription: 'Recovered analytics orders', + columnDescriptions: { id: 'Recovered analytics id' }, + }, + ]), + flush: vi.fn(async () => {}), + }; + + const result = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: false, + connector: scanConnector, + snapshot: multiSchemaSnapshot, + context: { runId: 'resume-identity' }, + providers, + descriptionResumeStore: resumeStore, + relationshipSettings: { ...buildDefaultKtxProjectConfig().scan.relationships, enabled: false }, + }); + + // staging.orders is not recovered (different identity), so it is re-enriched + // exactly once; analytics.orders keeps its recovered description. + expect(generateObject).toHaveBeenCalledTimes(1); + const analytics = result.descriptionUpdates.find((update) => update.table.db === 'analytics'); + const staging = result.descriptionUpdates.find((update) => update.table.db === 'staging'); + expect(analytics?.tableDescription).toBe('Recovered analytics orders'); + expect(staging?.tableDescription).not.toBe('Recovered analytics orders'); + expect(staging?.tableDescription).toBeTruthy(); + }); + + it('flags an unselected stage stale when its inputs changed, names the cascade, and clears after re-running it', async () => { + const stateStore = memoryEnrichmentStateStore(); + const scanConnector = connector(); + const providers = { + ...createDeterministicLocalScanEnrichmentProviders(), + embedding: fakeScanEmbedding({ dimensions: 6 }), + }; + const llmIdentity = { model: 'fake', baseUrlConfigured: false }; + const embeddingV1 = { model: 'embed-v1', dimensions: 6, batchSize: 64 }; + const embeddingV2 = { model: 'embed-v2', dimensions: 6, batchSize: 64 }; + + // Full run captures embeddings + relationships keyed on the v1 embedding model. + const full = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'stale-1' }, + providers, + stateStore, + syncId: 'stale-s1', + llmIdentity, + embeddingIdentity: embeddingV1, + }); + // Stand in for the persisted _schema so embeddings-only runs see the same + // descriptions the descriptions stage produces (deterministic content). + const loadPriorDescriptions = async () => full.descriptionUpdates; + + // The embedding model changed in config, but the operator re-ran only descriptions. + const reDescribe = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'stale-2' }, + providers, + stateStore, + syncId: 'stale-s2', + stages: ['descriptions'], + loadPriorDescriptions, + llmIdentity, + embeddingIdentity: embeddingV2, + }); + const stale = reDescribe.warnings.filter((warning) => warning.code === 'enrichment_stage_stale'); + expect(stale.map((warning) => warning.metadata?.stage)).toEqual(['embeddings']); + expect(stale[0]?.message).toContain('--stages embeddings'); + + // Re-embedding on v2 stores the fresh embeddings hash, clearing the staleness. + await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'stale-3' }, + providers, + stateStore, + syncId: 'stale-s3', + stages: ['embeddings'], + loadPriorDescriptions, + llmIdentity, + embeddingIdentity: embeddingV2, + }); + const afterReembed = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'stale-4' }, + providers, + stateStore, + syncId: 'stale-s4', + stages: ['descriptions'], + loadPriorDescriptions, + llmIdentity, + embeddingIdentity: embeddingV2, + }); + expect(afterReembed.warnings.filter((warning) => warning.code === 'enrichment_stage_stale')).toEqual([]); + }); + + const enrichedFixtureSnapshot = (): KtxSchemaSnapshot => ({ + connectionId: 'warehouse', + driver: 'sqlite', + extractedAt: '2026-05-07T00:00:00.000Z', + scope: {}, + metadata: {}, + tables: [ + { + catalog: null, + db: null, + name: 'accounts', + kind: 'table', + comment: 'DB accounts', + estimatedRows: 2, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'INTEGER', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: false, + comment: 'DB accounts id', + }, + ], + }, + { + catalog: null, + db: null, + name: 'orders', + kind: 'table', + comment: 'DB orders', + estimatedRows: 3, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'INTEGER', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: false, + comment: 'DB orders id', + }, + { + name: 'account_id', + nativeType: 'INTEGER', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: false, + comment: 'DB account ref', + }, + ], + }, + ], + }); + + const countKeyOccurrences = (text: string, key: string): number => + (text.match(new RegExp(`\\b${key}:`, 'g')) ?? []).length; + + // Regression (spec 21 defect, 2026-06-24): a --stages subset that omits a stage + // must not delete that stage's on-disk artifacts from the written _schema. + it('a --stages relationships run preserves on-disk descriptions while adding joins', async () => { + const tempDir = await mkdtemp(join(tmpdir(), 'ktx-stage-preserve-rel-')); + const executor = new InMemorySqliteExecutor(); + try { + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id) VALUES (1), (2); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + const project = await initKtxProject({ projectDir: join(tempDir, 'project') }); + const shardPath = 'semantic-layer/warehouse/_schema/public.yaml'; + // Enriched fixture: full ai + db descriptions, zero joins. + await project.fileStore.writeFile( + shardPath, + YAML.stringify( + { + tables: { + accounts: { + table: 'accounts', + descriptions: { ai: 'AI accounts table', db: 'DB accounts' }, + columns: [{ name: 'id', type: 'number', descriptions: { ai: 'AI accounts id', db: 'DB accounts id' } }], + }, + orders: { + table: 'orders', + descriptions: { ai: 'AI orders table', db: 'DB orders' }, + columns: [ + { name: 'id', type: 'number', descriptions: { ai: 'AI orders id', db: 'DB orders id' } }, + { name: 'account_id', type: 'number', descriptions: { ai: 'AI account ref', db: 'DB account ref' } }, + ], + }, + }, + }, + { indent: 2, lineWidth: 0 }, + ), + 'ktx', + 'ktx@example.com', + 'Seed enriched fixture', + ); + const before = await readFile(join(project.projectDir, shardPath), 'utf-8'); + const aiBefore = countKeyOccurrences(before, 'ai'); + const dbBefore = countKeyOccurrences(before, 'db'); + expect(aiBefore).toBeGreaterThan(0); + + const scanConnector = { + ...connector(), + driver: 'sqlite' as const, + capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }), + introspect: vi.fn(async () => enrichedFixtureSnapshot()), + executeReadOnly: executor.executeReadOnly.bind(executor), + }; + const result = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'preserve-rel-1' }, + providers: createDeterministicLocalScanEnrichmentProviders(), + stages: ['relationships'], + syncId: 'sync-preserve-rel', + loadPriorDescriptions: (snap) => loadOnDiskDescriptionUpdates(project, 'warehouse', snap), + }); + await writeLocalScanEnrichmentArtifacts({ + project, + connectionId: 'warehouse', + syncId: 'sync-preserve-rel', + driver: 'sqlite', + enrichment: result, + dryRun: false, + }); + + const after = await readFile(join(project.projectDir, shardPath), 'utf-8'); + // Every prior ai:/db: description survived the relationships-only run... + expect(countKeyOccurrences(after, 'ai')).toBe(aiBefore); + expect(countKeyOccurrences(after, 'db')).toBe(dbBefore); + expect(after).toContain('AI orders table'); + expect(after).toContain('AI account ref'); + // ...and the relationships stage actually added joins (it was 0 before). + expect(result.relationships.accepted).toBeGreaterThan(0); + const shard = YAML.parse(after) as { tables: Record }; + expect(Object.values(shard.tables).some((table) => (table.joins ?? []).length > 0)).toBe(true); + } finally { + executor.close(); + await rm(tempDir, { recursive: true, force: true }); + } + }); + + it('a --stages descriptions run preserves on-disk joins while refreshing descriptions', async () => { + const tempDir = await mkdtemp(join(tmpdir(), 'ktx-stage-preserve-desc-')); + try { + const project = await initKtxProject({ projectDir: join(tempDir, 'project') }); + const shardPath = 'semantic-layer/warehouse/_schema/public.yaml'; + // Fixture: an inferred join present, descriptions absent. + await project.fileStore.writeFile( + shardPath, + YAML.stringify( + { + tables: { + accounts: { table: 'accounts', columns: [{ name: 'id', type: 'number' }] }, + orders: { + table: 'orders', + columns: [ + { name: 'id', type: 'number' }, + { name: 'account_id', type: 'number' }, + ], + joins: [ + { to: 'accounts', on: 'orders.account_id = accounts.id', relationship: 'many_to_one', source: 'inferred' }, + ], + }, + }, + }, + { indent: 2, lineWidth: 0 }, + ), + 'ktx', + 'ktx@example.com', + 'Seed joins fixture', + ); + + const scanConnector = { + ...connector(), + driver: 'sqlite' as const, + introspect: vi.fn(async () => enrichedFixtureSnapshot()), + }; + const result = await runLocalScanEnrichment({ + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector: scanConnector, + context: { runId: 'preserve-desc-1' }, + providers: createDeterministicLocalScanEnrichmentProviders(), + stages: ['descriptions'], + syncId: 'sync-preserve-desc', + loadPriorDescriptions: (snap) => loadOnDiskDescriptionUpdates(project, 'warehouse', snap), + }); + await writeLocalScanEnrichmentArtifacts({ + project, + connectionId: 'warehouse', + syncId: 'sync-preserve-desc', + driver: 'sqlite', + enrichment: result, + dryRun: false, + }); + + const after = await readFile(join(project.projectDir, shardPath), 'utf-8'); + const shard = YAML.parse(after) as { + tables: Record }>; + }; + // The inferred join survived the descriptions-only run... + expect(shard.tables.orders?.joins?.some((join) => join.to === 'accounts' && join.source === 'inferred')).toBe(true); + // ...and the descriptions stage (re)wrote ai descriptions. + expect(countKeyOccurrences(after, 'ai')).toBeGreaterThan(0); + } finally { + await rm(tempDir, { recursive: true, force: true }); + } + }); }); diff --git a/packages/cli/test/context/scan/local-scan.test.ts b/packages/cli/test/context/scan/local-scan.test.ts index 1021a139..5bc60922 100644 --- a/packages/cli/test/context/scan/local-scan.test.ts +++ b/packages/cli/test/context/scan/local-scan.test.ts @@ -96,6 +96,7 @@ function deterministicLlmRuntime(): KtxLlmRuntimePort { generateText: vi.fn(async (input) => `Deterministic description for ${input.prompt.slice(0, 64).trim() || 'data source'}`), generateObject: vi.fn(async () => ({ pkCandidates: [], fkCandidates: [] }) as never), runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, }; } @@ -1672,6 +1673,111 @@ describe('local scan', () => { expect(persistedReport).toContain('embedding service timed out'); }); + it('keeps AI descriptions in the queryable _schema when the relationship stage fails after enrichment', async () => { + // Durability: the paid descriptions are checkpointed into the queryable + // manifest before relationship detection runs, so a relationship-stage + // failure degrades to "no joins", never "no descriptions". + project.config.scan.enrichment = { mode: 'deterministic' }; + const connector = { + id: 'test:warehouse', + driver: 'postgres' as const, + capabilities: { + structuralIntrospection: true as const, + tableSampling: true, + columnSampling: true, + columnStats: true, + readOnlySql: true, + nestedAnalysis: false, + eventStreamDiscovery: false, + formalForeignKeys: false, + estimatedRowCounts: true, + }, + ...connectorScopeListing, + async introspect() { + return { + connectionId: 'warehouse', + driver: 'postgres' as const, + extractedAt: '2026-04-29T09:00:00.000Z', + scope: { schemas: ['public'] }, + metadata: {}, + tables: [ + { + catalog: null, + db: 'public', + name: 'customers', + kind: 'table' as const, + comment: 'Customer accounts', + estimatedRows: 100, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number' as const, + nullable: false, + primaryKey: true, + comment: 'Customer id', + }, + ], + }, + { + catalog: null, + db: 'public', + name: 'orders', + kind: 'table' as const, + comment: 'Customer orders', + estimatedRows: 1000, + foreignKeys: [], + columns: [ + { + name: 'customer_id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number' as const, + nullable: false, + primaryKey: false, + comment: 'Owning customer', + }, + ], + }, + ], + }; + }, + async sampleTable() { + return { headers: ['id'], rows: [[1]], totalRows: 1 }; + }, + async sampleColumn() { + return { values: ['1'], nullCount: 0, distinctCount: 1 }; + }, + // Profiling succeeds; the coverage probe in the relationship stage throws, + // standing in for a relationship-stage interruption after enrichment. + async executeReadOnly(input: KtxReadOnlyQueryInput) { + return relationshipSqlResult(input, { throwOnCoverage: true }); + }, + }; + + const result = await runLocalScan({ + project, + adapters: [fetchOnlyAdapter({ snapshot: await connector.introspect() })], + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + connector, + jobId: 'scan-checkpoint-durability-1', + now: () => new Date('2026-04-29T09:20:00.000Z'), + }); + + expect(result.report.warnings.map((warning) => warning.code)).toContain('enrichment_failed'); + + const manifestRaw = await readFile( + join(project.projectDir, 'semantic-layer/warehouse/_schema/public.yaml'), + 'utf-8', + ); + expect(manifestRaw).toContain('ai: |-'); + expect(manifestRaw).toContain('Deterministic description'); + }); + it('resumes completed local enrichment stages when an enriched scan run is retried', async () => { let embeddingAttempts = 0; const connector = { @@ -1928,6 +2034,147 @@ describe('local scan', () => { 'raw-sources/warehouse/live-database/2026-04-29-160000-scan-run-sqlserver/scan-report.json', ); }); + + // Regression (spec 21 defect, 2026-06-24): the structural manifest write that runs + // BEFORE enrichment must not let a `--stages` subset delete the prior on-disk + // descriptions. This goes through the full runLocalScan path (the unit-level + // enrichment test could not catch the structural-pre-write ordering). + it('a --stages relationships scan preserves on-disk descriptions while adding joins', async () => { + const snapshot: KtxSchemaSnapshot = { + connectionId: 'warehouse', + driver: 'postgres', + extractedAt: '2026-05-07T09:00:00.000Z', + scope: {}, + metadata: {}, + tables: [ + { + catalog: null, + db: null, + name: 'accounts', + kind: 'table', + comment: null, + estimatedRows: 2, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: false, + comment: null, + }, + ], + }, + { + catalog: null, + db: null, + name: 'orders', + kind: 'table', + comment: null, + estimatedRows: 3, + foreignKeys: [], + columns: [ + { + name: 'id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: false, + comment: null, + }, + { + name: 'account_id', + nativeType: 'integer', + normalizedType: 'integer', + dimensionType: 'number', + nullable: false, + primaryKey: false, + comment: null, + }, + ], + }, + ], + }; + // Enriched fixture already on disk: ai descriptions, zero joins. + await project.fileStore.writeFile( + 'semantic-layer/warehouse/_schema/public.yaml', + YAML.stringify( + { + tables: { + accounts: { + table: 'accounts', + descriptions: { ai: 'AI accounts table' }, + columns: [{ name: 'id', type: 'number', descriptions: { ai: 'AI accounts id' } }], + }, + orders: { + table: 'orders', + descriptions: { ai: 'AI orders table' }, + columns: [ + { name: 'id', type: 'number', descriptions: { ai: 'AI orders id' } }, + { name: 'account_id', type: 'number', descriptions: { ai: 'AI account ref' } }, + ], + }, + }, + }, + { indent: 2, lineWidth: 0 }, + ), + 'ktx', + 'ktx@example.com', + 'Seed enriched fixture', + ); + const shardPath = 'semantic-layer/warehouse/_schema/public.yaml'; + const aiBefore = ((await project.fileStore.readFile(shardPath)).content.match(/\bai:/g) ?? []).length; + expect(aiBefore).toBe(5); + + const connector: KtxScanConnector = { + id: 'test:warehouse', + driver: 'postgres', + capabilities: { + structuralIntrospection: true, + tableSampling: false, + columnSampling: false, + columnStats: true, + readOnlySql: true, + nestedAnalysis: false, + eventStreamDiscovery: false, + formalForeignKeys: false, + estimatedRowCounts: true, + }, + ...connectorScopeListing, + introspect: vi.fn(async () => snapshot), + async executeReadOnly(input: KtxReadOnlyQueryInput) { + return relationshipSqlResult(input); + }, + }; + + const result = await runLocalScan({ + project, + adapters: [fetchOnlyAdapter({ snapshot })], + connectionId: 'warehouse', + mode: 'enriched', + detectRelationships: true, + stages: ['relationships'], + connector, + enrichmentProviders: { llmRuntime: deterministicLlmRuntime() }, + jobId: 'scan-stages-relationships-preserve', + now: () => new Date('2026-05-07T09:30:00.000Z'), + }); + + // The relationships stage actually ran and accepted a join... + expect(result.report.relationships.accepted).toBe(1); + const after = (await project.fileStore.readFile(shardPath)).content; + // ...and every prior ai description survived the structural + enrichment writes. + expect((after.match(/\bai:/g) ?? []).length).toBe(aiBefore); + expect(after).toContain('AI orders table'); + expect(after).toContain('AI account ref'); + const manifest = YAML.parse(after) as { + tables: Record }>; + }; + expect(manifest.tables.orders?.joins?.some((join) => join.to === 'accounts')).toBe(true); + }); }); describe('resolveEnabledTables', () => { diff --git a/packages/cli/test/context/scan/object-introspection.test.ts b/packages/cli/test/context/scan/object-introspection.test.ts new file mode 100644 index 00000000..fa01aeba --- /dev/null +++ b/packages/cli/test/context/scan/object-introspection.test.ts @@ -0,0 +1,47 @@ +import { describe, expect, it } from 'vitest'; +import { tryIntrospectObject } from '../../../src/context/scan/object-introspection.js'; + +describe('tryIntrospectObject', () => { + it('returns the read value when introspection succeeds', async () => { + await expect(tryIntrospectObject({ object: 'customers' }, () => ({ name: 'customers' }))).resolves.toEqual({ + ok: true, + table: { name: 'customers' }, + }); + }); + + it('skips with a recoverable warning when the object read throws', async () => { + const outcome = await tryIntrospectObject({ object: 'broken_view', db: 'main' }, () => { + throw new Error('no such column: ehp.start_date'); + }); + + expect(outcome).toEqual({ + ok: false, + warning: { + code: 'object_introspection_failed', + message: 'no such column: ehp.start_date', + table: 'broken_view', + recoverable: true, + metadata: { object: 'main.broken_view', db: 'main' }, + }, + }); + }); + + it('rethrows native programming faults instead of masking them as object skips', async () => { + await expect( + tryIntrospectObject({ object: 'customers' }, () => { + throw new TypeError('cannot read properties of undefined'); + }), + ).rejects.toBeInstanceOf(TypeError); + }); + + it('builds a fully-qualified object label for warehouse objects', async () => { + const outcome = await tryIntrospectObject({ object: 'orders', db: 'sales', catalog: 'warehouse' }, () => { + throw new Error('permission denied'); + }); + expect(outcome.ok).toBe(false); + if (!outcome.ok) { + expect(outcome.warning.table).toBe('orders'); + expect(outcome.warning.metadata).toEqual({ object: 'warehouse.sales.orders', db: 'sales', catalog: 'warehouse' }); + } + }); +}); diff --git a/packages/cli/test/context/scan/relationship-detection-budget.test.ts b/packages/cli/test/context/scan/relationship-detection-budget.test.ts new file mode 100644 index 00000000..af90d69c --- /dev/null +++ b/packages/cli/test/context/scan/relationship-detection-budget.test.ts @@ -0,0 +1,72 @@ +import { describe, expect, it } from 'vitest'; +import { + createKtxRelationshipDetectionBudget, + mapWithBudget, +} from '../../../src/context/scan/relationship-detection-budget.js'; + +describe('relationship detection budget', () => { + it('reports no stop while inside the wall-clock budget', () => { + let clock = 1000; + const budget = createKtxRelationshipDetectionBudget({ budgetMs: 500, now: () => clock }); + expect(budget.check()).toBeNull(); + clock = 1400; + expect(budget.check()).toBeNull(); + expect(budget.stopReason()).toBeNull(); + }); + + it('trips on budget exhaustion and records it stickily', () => { + let clock = 0; + const budget = createKtxRelationshipDetectionBudget({ budgetMs: 100, now: () => clock }); + clock = 150; + expect(budget.check()).toBe('budget'); + // Even after a notional clock rewind the recorded reason persists. + clock = 10; + expect(budget.stopReason()).toBe('budget'); + }); + + it('prefers abort over budget when the signal fires', () => { + const controller = new AbortController(); + let clock = 0; + const budget = createKtxRelationshipDetectionBudget({ + budgetMs: 1_000, + signal: controller.signal, + now: () => clock, + }); + expect(budget.check()).toBeNull(); + controller.abort(); + expect(budget.check()).toBe('aborted'); + expect(budget.stopReason()).toBe('aborted'); + }); + + it('maps every item and stays unmarked when the budget is never exhausted', async () => { + const budget = createKtxRelationshipDetectionBudget({ budgetMs: 1_000, now: () => 0 }); + const { results, processedCount } = await mapWithBudget({ + inputs: [1, 2, 3, 4], + concurrency: 2, + budget, + mapOne: async (value) => value * 10, + }); + expect(processedCount).toBe(4); + expect(results).toEqual([10, 20, 30, 40]); + expect(budget.stopReason()).toBeNull(); + }); + + it('stops claiming new items once the budget trips and leaves the rest undefined', async () => { + let clock = 0; + const budget = createKtxRelationshipDetectionBudget({ budgetMs: 25, now: () => clock }); + const started: number[] = []; + const { results, processedCount } = await mapWithBudget({ + inputs: [0, 1, 2, 3, 4], + concurrency: 1, + budget, + onStart: (index) => { + started.push(index); + clock += 10; // each unit advances the clock; the budget elapses partway through + }, + mapOne: async (value) => value, + }); + expect(processedCount).toBeLessThan(5); + expect(results.slice(processedCount).every((value) => value === undefined)).toBe(true); + expect(budget.stopReason()).toBe('budget'); + }); +}); diff --git a/packages/cli/test/context/scan/relationship-diagnostics.test.ts b/packages/cli/test/context/scan/relationship-diagnostics.test.ts index 8bad3b4f..786c1d23 100644 --- a/packages/cli/test/context/scan/relationship-diagnostics.test.ts +++ b/packages/cli/test/context/scan/relationship-diagnostics.test.ts @@ -315,6 +315,26 @@ describe('relationship diagnostics artifacts', () => { expect(diagnostics.summary).toEqual({ accepted: 0, review: 0, rejected: 0, skipped: 0 }); expect(diagnostics.noAcceptedReason).toBe('no candidate pairs passed type compatibility'); expect(diagnostics.candidateCountsBySource).toEqual({}); + expect(diagnostics.partial).toBe(false); + expect(diagnostics.partialReason).toBeNull(); + }); + + it('marks the diagnostics partial with its stop reason when relationship detection was truncated', () => { + const artifacts = buildKtxRelationshipArtifacts({ connectionId: 'warehouse' }); + const diagnostics = buildKtxRelationshipDiagnostics({ + connectionId: 'warehouse', + generatedAt: '2026-05-07T12:00:00.000Z', + artifacts, + profile: emptyKtxRelationshipProfileArtifact({ + connectionId: 'warehouse', + driver: 'sqlite', + reason: 'relationship_profiling_not_run', + }), + partial: { reason: 'budget' }, + }); + + expect(diagnostics.partial).toBe(true); + expect(diagnostics.partialReason).toBe('budget'); }); it('records composite relationship endpoints in relationship artifacts', () => { diff --git a/packages/cli/test/context/scan/relationship-discovery.test.ts b/packages/cli/test/context/scan/relationship-discovery.test.ts index cebb2969..2fb3b91d 100644 --- a/packages/cli/test/context/scan/relationship-discovery.test.ts +++ b/packages/cli/test/context/scan/relationship-discovery.test.ts @@ -224,6 +224,7 @@ function llmRuntime(output: unknown): KtxLlmRuntimePort { generateText: vi.fn(), generateObject: vi.fn(async () => output) as KtxLlmRuntimePort['generateObject'], runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, }; } @@ -338,6 +339,126 @@ describe('production relationship discovery', () => { }); }); + it('emits per-table profiling and per-candidate validation progress', async () => { + executor = new InMemorySqliteExecutor(); + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex'); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + const messages: string[] = []; + const progress = { + async update(_progress: number, message?: string) { + if (message) { + messages.push(message); + } + }, + startPhase() { + return progress; + }, + }; + + const result = await discoverKtxRelationships({ + connectionId: 'warehouse', + dialect: getSqlDialectForDriver('sqlite'), + connector: connector(executor), + schema: snapshotToKtxEnrichedSchema(snapshot()), + context: { runId: 'relationship-progress' }, + settings: relationshipSettings(), + progress, + }); + + expect(result.partial).toBeNull(); + expect(messages).toContain('Profiling table 1/2'); + expect(messages).toContain('Profiling table 2/2'); + expect(messages.some((message) => message.startsWith('Validating candidate '))).toBe(true); + }); + + it('returns a partial result when the wall-clock budget is exhausted, without throwing', async () => { + executor = new InMemorySqliteExecutor(); + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex'); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + // A clock that jumps a full second per read against a 1ms budget exhausts + // the budget at the very first unit boundary. + let calls = 0; + const now = () => calls++ * 1000; + + const result = await discoverKtxRelationships({ + connectionId: 'warehouse', + dialect: getSqlDialectForDriver('sqlite'), + connector: connector(executor), + schema: snapshotToKtxEnrichedSchema(snapshot()), + context: { runId: 'relationship-budget' }, + settings: { ...relationshipSettings(), detectionBudgetMs: 1 }, + now, + }); + + expect(result.partial).toEqual({ reason: 'budget' }); + expect(result.relationships.accepted).toBe(0); + }); + + it('does not start the LLM relationship proposal once the budget is exhausted', async () => { + executor = new InMemorySqliteExecutor(); + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex'); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + let calls = 0; + const now = () => calls++ * 1000; + const generateObject = vi.fn(async () => ({ pkCandidates: [], fkCandidates: [] })); + const runtime: KtxLlmRuntimePort = { + generateText: vi.fn(), + generateObject: generateObject as KtxLlmRuntimePort['generateObject'], + runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, + }; + + const result = await discoverKtxRelationships({ + connectionId: 'warehouse', + dialect: getSqlDialectForDriver('sqlite'), + connector: connector(executor), + schema: snapshotToKtxEnrichedSchema(snapshot()), + context: { runId: 'relationship-budget-llm' }, + settings: { ...relationshipSettings(), detectionBudgetMs: 1 }, + llmRuntime: runtime, + now, + }); + + expect(result.partial).toEqual({ reason: 'budget' }); + expect(result.llmRelationshipValidation).toBe('skipped'); + expect(generateObject).not.toHaveBeenCalled(); + }); + + it('returns a partial result when the scan signal is already aborted', async () => { + executor = new InMemorySqliteExecutor(); + executor.db.exec(` + CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL); + CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL); + INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex'); + INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2); + `); + + const result = await discoverKtxRelationships({ + connectionId: 'warehouse', + dialect: getSqlDialectForDriver('sqlite'), + connector: connector(executor), + schema: snapshotToKtxEnrichedSchema(snapshot()), + context: { runId: 'relationship-aborted', signal: AbortSignal.abort() }, + settings: relationshipSettings(), + }); + + expect(result.partial).toEqual({ reason: 'aborted' }); + // A stop-before-completion must not be reported as completed statistical validation. + expect(result.statisticalValidation).toBe('skipped'); + }); + it('accepts a profile-driven natural-key relationship without declared metadata', async () => { executor = new InMemorySqliteExecutor(); executor.db.exec(` diff --git a/packages/cli/test/context/scan/relationship-llm-proposal.test.ts b/packages/cli/test/context/scan/relationship-llm-proposal.test.ts index 3c4cb5f0..12ec0940 100644 --- a/packages/cli/test/context/scan/relationship-llm-proposal.test.ts +++ b/packages/cli/test/context/scan/relationship-llm-proposal.test.ts @@ -9,6 +9,7 @@ function llmRuntime(output?: unknown): KtxLlmRuntimePort { generateText: vi.fn(), generateObject: vi.fn(async () => output) as KtxLlmRuntimePort['generateObject'], runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, }; } @@ -202,6 +203,7 @@ describe('relationship LLM proposals', () => { throw new Error('model unavailable'); }), runAgentLoop: vi.fn(), + subprocessForkSpec: () => null, }, }); expect(failed).toMatchObject({ candidates: [], llmCalls: 1, summary: 'failed' }); diff --git a/packages/cli/test/context/scan/relationship-validation.test.ts b/packages/cli/test/context/scan/relationship-validation.test.ts index 98826aed..2f3f3fe3 100644 --- a/packages/cli/test/context/scan/relationship-validation.test.ts +++ b/packages/cli/test/context/scan/relationship-validation.test.ts @@ -1,5 +1,6 @@ import Database from 'better-sqlite3'; import { afterEach, describe, expect, it } from 'vitest'; +import { KtxQueryError } from '../../../src/errors.js'; import { getSqlDialectForDriver } from '../../../src/context/connections/dialects.js'; import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable } from '../../../src/context/scan/enrichment-types.js'; import { generateKtxRelationshipDiscoveryCandidates } from '../../../src/context/scan/relationship-candidates.js'; @@ -139,6 +140,54 @@ describe('relationship validation', () => { expect(validated[0]?.score).toBeGreaterThanOrEqual(0.85); }); + it('sends a candidate to review (not source-fatal) when its validation query times out', async () => { + executor = new InMemorySqliteExecutor(); + executor.db.exec(` + CREATE TABLE accounts (id INTEGER, name TEXT); + CREATE TABLE users (id INTEGER, account_id INTEGER); + CREATE TABLE invoices (id INTEGER, account_id INTEGER); + INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex'), (3, 'Initech'); + INSERT INTO users (id, account_id) VALUES (10, 1), (11, 2), (12, 3); + INSERT INTO invoices (id, account_id) VALUES (20, 1), (21, 2), (22, 999); + `); + const testSchema = schema(); + const profiles = await profileKtxRelationshipSchema({ + connectionId: 'warehouse', + driver: 'sqlite', + dialect: getSqlDialectForDriver('sqlite'), + schema: testSchema, + executor, + ctx: { runId: 'validate-test' }, + }); + const candidates = generateKtxRelationshipDiscoveryCandidates(testSchema).filter( + (candidate) => candidate.from.table.name === 'users', + ); + + const warnings: string[] = []; + const timingOutExecutor = { + executeReadOnly: () => Promise.reject(new KtxQueryError('query exceeded 30s')), + }; + const validated = await validateKtxRelationshipDiscoveryCandidates({ + connectionId: 'warehouse', + dialect: getSqlDialectForDriver('sqlite'), + candidates, + profiles, + executor: timingOutExecutor, + ctx: { + runId: 'validate-test', + logger: { debug() {}, info() {}, warn: (message) => warnings.push(message), error() {} }, + }, + tableCount: testSchema.tables.length, + }); + + expect(validated).toHaveLength(1); + expect(validated[0]).toMatchObject({ + status: 'review', + validation: { reasons: ['validation_query_failed'] }, + }); + expect(warnings.some((message) => message.includes('query exceeded 30s'))).toBe(true); + }); + it('rejects a candidate with missing parent values and records the deterministic reason', async () => { executor = new InMemorySqliteExecutor(); executor.db.exec(` diff --git a/packages/cli/test/context/wiki/local-knowledge.test.ts b/packages/cli/test/context/wiki/local-knowledge.test.ts index cda5ca1a..8f6114b3 100644 --- a/packages/cli/test/context/wiki/local-knowledge.test.ts +++ b/packages/cli/test/context/wiki/local-knowledge.test.ts @@ -6,10 +6,12 @@ import { initKtxProject, type KtxLocalProject } from '../../../src/context/proje import { listLocalKnowledgePageKeys, listLocalKnowledgePages, + listReferencedConnectionIds, readLocalKnowledgePage, searchLocalKnowledgePages, writeLocalKnowledgePage, } from '../../../src/context/wiki/local-knowledge.js'; +import { SqliteKnowledgeIndex } from '../../../src/context/wiki/sqlite-knowledge-index.js'; class FakeEmbeddingPort { readonly maxBatchSize = 16; @@ -284,6 +286,203 @@ describe('local knowledge helpers', () => { expect(raw.content).toContain(['fingerprints:', ' - fp_paid_orders'].join('\n')); }); + it('round-trips a connections list through write, read, and list', async () => { + await writeLocalKnowledgePage(project, { + key: 'orders-sales-db', + scope: 'GLOBAL', + summary: 'Orders concept for the sales database', + content: 'In sales_db, orders are recognized when paid.', + connections: ['sales_db'], + }); + + const raw = await project.fileStore.readFile('wiki/global/orders-sales-db.md'); + expect(raw.content).toContain(['connections:', ' - sales_db'].join('\n')); + + await expect(readLocalKnowledgePage(project, { key: 'orders-sales-db', userId: 'local' })).resolves.toMatchObject({ + key: 'orders-sales-db', + connections: ['sales_db'], + }); + }); + + it('normalizes a single connections string to a list at parse time', async () => { + await project.fileStore.writeFile( + 'wiki/global/single-scoped.md', + '---\nsummary: Single connection as scalar\nusage_mode: auto\nconnections: events_db\n---\n\nBody\n', + 'Test', + 'test@example.com', + 'Write scalar connections page', + ); + + await expect(readLocalKnowledgePage(project, { key: 'single-scoped', userId: 'local' })).resolves.toMatchObject({ + key: 'single-scoped', + connections: ['events_db'], + }); + }); + + it('treats an absent connections field as unscoped (empty list)', async () => { + await writeLocalKnowledgePage(project, { + key: 'fiscal-year', + scope: 'GLOBAL', + summary: 'Org-wide fiscal year', + content: 'Fiscal year starts in February.', + }); + + await expect(readLocalKnowledgePage(project, { key: 'fiscal-year', userId: 'local' })).resolves.toMatchObject({ + key: 'fiscal-year', + connections: [], + }); + }); + + it('scopes search to unscoped pages plus pages listing the requested connection', async () => { + await writeLocalKnowledgePage(project, { + key: 'orders-sales-db', + scope: 'GLOBAL', + summary: 'Sales DB orders', + content: 'Orders are paid in the sales database.', + connections: ['sales_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'orders-events-db', + scope: 'GLOBAL', + summary: 'Events DB orders', + content: 'Orders are paid in the events database.', + connections: ['events_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'orders-global', + scope: 'GLOBAL', + summary: 'Org-wide orders note', + content: 'Orders are paid everywhere in the org.', + }); + + const scoped = await searchLocalKnowledgePages(project, { + query: 'orders paid', + userId: 'local', + connectionId: 'sales_db', + }); + const keys = scoped.map((result) => result.key).sort(); + expect(keys).toEqual(['orders-global', 'orders-sales-db']); + expect(keys).not.toContain('orders-events-db'); + + const unfiltered = await searchLocalKnowledgePages(project, { query: 'orders paid', userId: 'local' }); + expect(unfiltered.map((result) => result.key).sort()).toEqual([ + 'orders-events-db', + 'orders-global', + 'orders-sales-db', + ]); + }); + + it('keeps other-connection pages and embeddings in the sqlite index after a scoped search', async () => { + const embedding = new FakeEmbeddingPort(); + await writeLocalKnowledgePage(project, { + key: 'orders-sales-db', + scope: 'GLOBAL', + summary: 'Sales DB orders', + content: 'Orders are paid in the sales database.', + connections: ['sales_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'orders-events-db', + scope: 'GLOBAL', + summary: 'Events DB orders', + content: 'Orders are paid in the events database.', + connections: ['events_db'], + }); + + const scoped = await searchLocalKnowledgePages(project, { + query: 'orders paid', + userId: 'local', + connectionId: 'sales_db', + embeddingService: embedding, + }); + expect(scoped.map((result) => result.key)).toEqual(['orders-sales-db']); + + // A connection-scoped search must not prune the other connection's page (or + // its cached embedding) from the shared persistent index. + const index = new SqliteKnowledgeIndex({ dbPath: join(project.projectDir, '.ktx', 'db.sqlite') }); + const indexed = index.getExistingPages(); + expect([...indexed.keys()].sort()).toEqual([ + 'wiki/global/orders-events-db.md', + 'wiki/global/orders-sales-db.md', + ]); + expect(indexed.get('wiki/global/orders-events-db.md')?.embedding).not.toBeNull(); + }); + + it('filters search per connection across lexical and token lanes when embeddings are disabled', async () => { + await writeLocalKnowledgePage(project, { + key: 'rfm-events-db', + scope: 'GLOBAL', + summary: 'RFM definition for events_db', + content: 'RFM segmentation rules for the events database.', + connections: ['events_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'rfm-sales-db', + scope: 'GLOBAL', + summary: 'RFM definition for sales_db', + content: 'RFM segmentation rules for the sales database.', + connections: ['sales_db'], + }); + + const lexical = await searchLocalKnowledgePages(project, { + query: 'rfm segmentation', + userId: 'local', + connectionId: 'events_db', + }); + expect(lexical.map((result) => result.key)).toEqual(['rfm-events-db']); + + const token = await searchLocalKnowledgePages(project, { + query: 'segmentation---', + userId: 'local', + connectionId: 'events_db', + }); + expect(token.map((result) => result.key)).toEqual(['rfm-events-db']); + }); + + it('filters list output by connection while keeping unscoped pages', async () => { + await writeLocalKnowledgePage(project, { + key: 'orders-sales-db', + scope: 'GLOBAL', + summary: 'Sales DB orders', + content: 'Sales orders.', + connections: ['sales_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'orders-events-db', + scope: 'GLOBAL', + summary: 'Events DB orders', + content: 'Events orders.', + connections: ['events_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'orders-global', + scope: 'GLOBAL', + summary: 'Org-wide orders', + content: 'Global orders.', + }); + + const scoped = await listLocalKnowledgePages(project, { userId: 'local', connectionId: 'sales_db' }); + expect(scoped.map((page) => page.key).sort()).toEqual(['orders-global', 'orders-sales-db']); + }); + + it('keeps a page referencing an unconfigured connection searchable and readable', async () => { + await writeLocalKnowledgePage(project, { + key: 'rfm-removed-db', + scope: 'GLOBAL', + summary: 'RFM for a since-removed database', + content: 'RFM rules.', + connections: ['removed_db'], + }); + + await expect(readLocalKnowledgePage(project, { key: 'rfm-removed-db', userId: 'local' })).resolves.toMatchObject({ + key: 'rfm-removed-db', + connections: ['removed_db'], + }); + const search = await searchLocalKnowledgePages(project, { query: 'rfm rules', userId: 'local' }); + expect(search.map((result) => result.key)).toContain('rfm-removed-db'); + await expect(listReferencedConnectionIds(project, { userId: 'local' })).resolves.toEqual(['removed_db']); + }); + it('falls back to Markdown scanning when the config does not select sqlite-fts5', async () => { project.config.storage.search = 'postgres-hybrid'; await writeLocalKnowledgePage(project, { diff --git a/packages/cli/test/context/wiki/sqlite-knowledge-index.test.ts b/packages/cli/test/context/wiki/sqlite-knowledge-index.test.ts index 5a3b0dc1..55441595 100644 --- a/packages/cli/test/context/wiki/sqlite-knowledge-index.test.ts +++ b/packages/cli/test/context/wiki/sqlite-knowledge-index.test.ts @@ -142,6 +142,49 @@ describe('SqliteKnowledgeIndex', () => { ]); }); + it('restricts lexical candidates to the allowlist', () => { + const index = new SqliteKnowledgeIndex({ dbPath }); + index.sync([ + page({ path: 'wiki/global/revenue.md', key: 'revenue' }), + page({ path: 'wiki/global/support.md', key: 'support', content: 'Orders are paid by the support team.' }), + ]); + + expect( + index + .searchLexicalCandidates({ queryText: 'paid', limit: 10, allowedPaths: ['wiki/global/support.md'] }) + .map((row) => row.path), + ).toEqual(['wiki/global/support.md']); + }); + + it('applies the allowlist before the semantic limit so an in-scope match survives', () => { + const index = new SqliteKnowledgeIndex({ dbPath }); + index.sync([ + page({ path: 'wiki/global/noise-a.md', key: 'noise-a', embedding: [1, 0] }), + page({ path: 'wiki/global/noise-b.md', key: 'noise-b', embedding: [1, 0] }), + page({ path: 'wiki/global/target.md', key: 'target', embedding: [1, 0] }), + ]); + + // All three tie on similarity; a limit of 1 over the full corpus drops the target. + expect(index.searchSemanticCandidates({ queryEmbedding: [1, 0], limit: 1 }).map((row) => row.path)).toEqual([ + 'wiki/global/noise-a.md', + ]); + + // Scoped to the target, the limit applies after the allowlist, so it survives. + expect( + index + .searchSemanticCandidates({ queryEmbedding: [1, 0], limit: 1, allowedPaths: ['wiki/global/target.md'] }) + .map((row) => row.path), + ).toEqual(['wiki/global/target.md']); + }); + + it('treats an empty allowlist as no page in scope', () => { + const index = new SqliteKnowledgeIndex({ dbPath }); + index.sync([page({ embedding: [1, 0] })]); + + expect(index.searchLexicalCandidates({ queryText: 'paid order', limit: 10, allowedPaths: [] })).toEqual([]); + expect(index.searchSemanticCandidates({ queryEmbedding: [1, 0], limit: 10, allowedPaths: [] })).toEqual([]); + }); + it('returns an empty result for blank or punctuation-only queries', () => { const index = new SqliteKnowledgeIndex({ dbPath }); index.rebuild([page()]); diff --git a/packages/cli/test/context/wiki/tools/wiki-write.tool.test.ts b/packages/cli/test/context/wiki/tools/wiki-write.tool.test.ts index deadd716..402b1fbf 100644 --- a/packages/cli/test/context/wiki/tools/wiki-write.tool.test.ts +++ b/packages/cli/test/context/wiki/tools/wiki-write.tool.test.ts @@ -263,6 +263,108 @@ describe('WikiWriteTool', () => { }); }); + it('sets connections on a new page and normalizes a single string to a list', async () => { + const { tool, wikiService } = makeTool(); + + await tool.call( + { key: 'orders-sales-db', summary: 'Sales orders', content: '# Orders', connections: 'sales_db' } as any, + baseContext, + ); + + expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] }); + }); + + it('applies REPLACE semantics for connections on update', async () => { + const existing = { + pageKey: 'orders', + frontmatter: { summary: 'Orders', usage_mode: 'auto' as const, sort_order: 0, connections: ['sales_db'] }, + content: 'body', + }; + // omit ⇒ keep existing connections + { + const { tool, wikiService } = makeTool({ wikiService: { readPage: vi.fn().mockResolvedValue(existing) } }); + await tool.call({ key: 'orders', summary: 'Orders', content: 'new body' } as any, baseContext); + expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] }); + } + // [] ⇒ clear to unscoped + { + const { tool, wikiService } = makeTool({ wikiService: { readPage: vi.fn().mockResolvedValue(existing) } }); + await tool.call({ key: 'orders', summary: 'Orders', content: 'new body', connections: [] } as any, baseContext); + expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: [] }); + } + // [ids] ⇒ set (broaden within overlap is allowed) + { + const { tool, wikiService } = makeTool({ wikiService: { readPage: vi.fn().mockResolvedValue(existing) } }); + await tool.call( + { key: 'orders', summary: 'Orders', content: 'new body', connections: ['sales_db', 'events_db'] } as any, + baseContext, + ); + expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db', 'events_db'] }); + } + }); + + it('blocks a connection-scoped write whose key collides with a disjoint-connection page', async () => { + const { tool, wikiService } = makeTool({ + wikiService: { + readPage: vi.fn().mockResolvedValue({ + pageKey: 'orders', + frontmatter: { summary: 'Events orders', usage_mode: 'auto', sort_order: 0, connections: ['events_db'] }, + content: 'events body', + }), + }, + }); + + const result = await tool.call( + { key: 'orders', summary: 'Sales orders', content: 'sales body', connections: ['sales_db'] } as any, + baseContext, + ); + + expect(result.structured).toEqual({ success: false, key: 'orders' }); + expect(result.markdown).toContain('already exists scoped to a different connection'); + expect(result.markdown).toContain('orders_sales_db'); + expect(wikiService.writePage).not.toHaveBeenCalled(); + }); + + it('allows narrowing a connection-scoped page within its own scope', async () => { + const { tool, wikiService } = makeTool({ + wikiService: { + readPage: vi.fn().mockResolvedValue({ + pageKey: 'orders', + frontmatter: { summary: 'Orders', usage_mode: 'auto', sort_order: 0, connections: ['sales_db', 'events_db'] }, + content: 'body', + }), + }, + }); + + const result = await tool.call( + { key: 'orders', summary: 'Orders', content: 'body', connections: ['sales_db'] } as any, + baseContext, + ); + + expect(result.structured).toMatchObject({ success: true, action: 'updated' }); + expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] }); + }); + + it('allows scoping a previously unscoped page (existing connections empty)', async () => { + const { tool, wikiService } = makeTool({ + wikiService: { + readPage: vi.fn().mockResolvedValue({ + pageKey: 'orders', + frontmatter: { summary: 'Orders', usage_mode: 'auto', sort_order: 0 }, + content: 'body', + }), + }, + }); + + const result = await tool.call( + { key: 'orders', summary: 'Orders', content: 'body', connections: ['sales_db'] } as any, + baseContext, + ); + + expect(result.structured).toMatchObject({ success: true, action: 'updated' }); + expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] }); + }); + it('rejects frontmatter refs that target missing wiki pages', async () => { const { tool, wikiService } = makeTool({ wikiService: { diff --git a/packages/cli/test/index.test.ts b/packages/cli/test/index.test.ts index b1901084..939a1dd3 100644 --- a/packages/cli/test/index.test.ts +++ b/packages/cli/test/index.test.ts @@ -989,6 +989,33 @@ describe('runKtxCli', () => { expect(testIo.stderr()).toMatch(/--text\/--file does not accept a positional connection id/); }); + it('threads --verbatim into the text ingest args', async () => { + const textIngest = vi.fn(async () => 0); + const testIo = makeIo(); + + await expect( + runKtxCli(['--project-dir', tempDir, 'ingest', '--file', 'doc.md', '--verbatim', '--json'], testIo.io, { + textIngest, + }), + ).resolves.toBe(0); + + expect(textIngest).toHaveBeenCalledWith(expect.objectContaining({ files: ['doc.md'], verbatim: true }), testIo.io); + }); + + it('rejects --verbatim without --text or --file', async () => { + const textIngest = vi.fn(async () => 0); + const publicIngest = vi.fn(async () => 0); + const testIo = makeIo(); + + await expect( + runKtxCli(['--project-dir', tempDir, 'ingest', '--verbatim'], testIo.io, { textIngest, publicIngest }), + ).resolves.toBe(1); + + expect(textIngest).not.toHaveBeenCalled(); + expect(publicIngest).not.toHaveBeenCalled(); + expect(testIo.stderr()).toMatch(/requires --text or --file/); + }); + it('treats bare ingest as ingest --all', async () => { const publicIngest = vi.fn().mockResolvedValue(0); const testIo = makeIo(); diff --git a/packages/cli/test/knowledge.test.ts b/packages/cli/test/knowledge.test.ts index 94e4bb63..7c97cc4c 100644 --- a/packages/cli/test/knowledge.test.ts +++ b/packages/cli/test/knowledge.test.ts @@ -3,8 +3,9 @@ import { tmpdir } from 'node:os'; import { join } from 'node:path'; import { stripVTControlCharacters } from 'node:util'; import { initKtxProject, loadKtxProject } from '../src/context/project/project.js'; +import { serializeKtxProjectConfig } from '../src/context/project/config.js'; import type { KtxEmbeddingPort } from '../src/context/core/embedding.js'; -import { writeLocalKnowledgePage } from '../src/context/wiki/local-knowledge.js'; +import { searchLocalKnowledgePages, writeLocalKnowledgePage } from '../src/context/wiki/local-knowledge.js'; import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; import { runKtxKnowledge } from '../src/knowledge.js'; @@ -98,6 +99,118 @@ describe('runKtxKnowledge', () => { expect(searchIo.stdout()).toContain('metrics-revenue'); }); + it('scopes wiki list/search by --connection and rejects unknown ids', async () => { + const projectDir = join(tempDir, 'connection-project'); + await initKtxProject({ projectDir }); + const project = await loadKtxProject({ projectDir }); + project.config.connections.sales_db = { driver: 'sqlite', url: 'file:sales.db' }; + project.config.connections.events_db = { driver: 'sqlite', url: 'file:events.db' }; + await project.fileStore.writeFile( + 'ktx.yaml', + serializeKtxProjectConfig(project.config), + 'ktx', + 'ktx@example.com', + 'configure connections', + ); + await writeLocalKnowledgePage(project, { + key: 'orders-sales', + scope: 'GLOBAL', + summary: 'Sales orders', + content: 'Orders are paid in sales.', + connections: ['sales_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'orders-events', + scope: 'GLOBAL', + summary: 'Events orders', + content: 'Orders are paid in events.', + connections: ['events_db'], + }); + await writeLocalKnowledgePage(project, { + key: 'orders-global', + scope: 'GLOBAL', + summary: 'Org-wide orders', + content: 'Orders are paid everywhere.', + }); + + const listIo = makeIo(); + await expect( + runKtxKnowledge( + { command: 'list', projectDir, userId: 'local', connectionId: 'sales_db', cliVersion: '0.0.0-test' }, + listIo.io, + ), + ).resolves.toBe(0); + expect(listIo.stdout()).toContain('orders-sales'); + expect(listIo.stdout()).toContain('orders-global'); + expect(listIo.stdout()).not.toContain('orders-events'); + + const searchIo = makeIo(); + await expect( + runKtxKnowledge( + { + command: 'search', + projectDir, + query: 'orders paid', + userId: 'local', + connectionId: 'events_db', + cliVersion: '0.0.0-test', + }, + searchIo.io, + ), + ).resolves.toBe(0); + expect(searchIo.stdout()).toContain('orders-events'); + expect(searchIo.stdout()).toContain('orders-global'); + expect(searchIo.stdout()).not.toContain('orders-sales'); + + const badIo = makeIo(); + await expect( + runKtxKnowledge( + { command: 'search', projectDir, query: 'orders', userId: 'local', connectionId: 'warehouse', cliVersion: '0.0.0-test' }, + badIo.io, + ), + ).resolves.toBe(1); + expect(badIo.stderr()).toContain('Unknown connection "warehouse". Configured connections: events_db, sales_db.'); + }); + + it('keeps a connection-scoped page that ranks below the lane candidate pool limit', async () => { + const projectDir = join(tempDir, 'scoped-pool-project'); + await initKtxProject({ projectDir }); + const project = await loadKtxProject({ projectDir }); + + // The lane candidate pool floor is 25; seed >25 other-connection pages so the + // single target-connection page only survives if scope is applied before the + // lane limit, not after. + for (let i = 0; i < 30; i++) { + await writeLocalKnowledgePage(project, { + key: `noise-${String(i).padStart(2, '0')}`, + scope: 'GLOBAL', + summary: 'Revenue', + content: 'Revenue is paid order value.', + connections: ['noise_db'], + }); + } + // Path sorts after every noise page, so a slice-before-filter lane drops it. + await writeLocalKnowledgePage(project, { + key: 'zzz-target', + scope: 'GLOBAL', + summary: 'Revenue', + content: 'Revenue is paid order value.', + connections: ['target_db'], + }); + + // "arr" matches the target only semantically (FakeEmbeddingPort), never by + // literal token, so the token lane cannot mask a dropped semantic hit. + const results = await searchLocalKnowledgePages(project, { + query: 'arr', + userId: 'local', + connectionId: 'target_db', + embeddingService: new FakeEmbeddingPort(), + limit: 5, + }); + + expect(results.map((result) => result.key)).toContain('zzz-target'); + }); + it('reads a wiki page as raw markdown with frontmatter', async () => { const projectDir = join(tempDir, 'read-project'); await initKtxProject({ projectDir }); diff --git a/packages/cli/test/local-scan-connectors.test.ts b/packages/cli/test/local-scan-connectors.test.ts index 1dadb6c4..827b4ba1 100644 --- a/packages/cli/test/local-scan-connectors.test.ts +++ b/packages/cli/test/local-scan-connectors.test.ts @@ -69,7 +69,7 @@ describe('createKtxCliScanConnector', () => { ' driver: bigquery', ' dataset_id: analytics', ' max_bytes_billed: "987654321"', - ' job_timeout_ms: 30000', + ' query_timeout_ms: 30000', '', ].join('\n'), 'utf-8', @@ -85,7 +85,7 @@ describe('createKtxCliScanConnector', () => { connectionId: 'warehouse', connection: expect.objectContaining({ max_bytes_billed: '987654321', - job_timeout_ms: 30000, + query_timeout_ms: 30000, }), }), ]); diff --git a/packages/cli/test/mcp-http-server.test.ts b/packages/cli/test/mcp-http-server.test.ts index 86c82326..24045cf5 100644 --- a/packages/cli/test/mcp-http-server.test.ts +++ b/packages/cli/test/mcp-http-server.test.ts @@ -194,6 +194,32 @@ function createTestMcpServer() { }; } +function capturingIo() { + let buf = ''; + return { + io: { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } }, + text: () => buf, + json: () => + buf + .split('\n') + .filter((line) => line.trim().startsWith('{')) + .map((line) => JSON.parse(line) as Record), + }; +} + +function initializeBody() { + return { + jsonrpc: '2.0' as const, + id: 1, + method: 'initialize', + params: { + protocolVersion: '2025-06-18', + capabilities: {}, + clientInfo: { name: 'vitest', version: '0.0.0' }, + }, + }; +} + describe('runKtxMcpHttpServer', () => { it('serves /health with project metadata', async () => { const handle = await runKtxMcpHttpServer({ @@ -208,11 +234,14 @@ describe('runKtxMcpHttpServer', () => { const port = (handle.server.address() as AddressInfo).port; const response = await get(port, '/health'); expect(response.status).toBe(200); - expect(JSON.parse(response.body)).toEqual({ + const body = JSON.parse(response.body); + expect(body).toMatchObject({ status: 'ok', projectDir: '/tmp/ktx-project', port, }); + expect(typeof body.uptimeMs).toBe('number'); + expect(body.uptimeMs).toBeGreaterThanOrEqual(0); } finally { await handle.close(); } @@ -271,4 +300,55 @@ describe('runKtxMcpHttpServer', () => { await handle.close(); } }); + + it('logs session open and close with the session id', async () => { + const cap = capturingIo(); + const handle = await runKtxMcpHttpServer({ + projectDir: '/tmp/ktx-project', + host: '127.0.0.1', + port: 0, + allowedHosts: [], + allowedOrigins: [], + createMcpServer: createTestMcpServer(), + io: cap.io, + }); + let sessionId: string | undefined; + try { + const port = (handle.server.address() as AddressInfo).port; + const response = await postJson(port, '/mcp', initializeBody()); + sessionId = response.headers['mcp-session-id'] as string; + expect(sessionId).toBeTruthy(); + } finally { + await handle.close(); + } + + const lines = cap.json(); + expect(lines.find((line) => line.msg === 'session.open')?.sessionId).toBe(sessionId); + expect(lines.some((line) => line.msg === 'session.close' && line.sessionId === sessionId)).toBe(true); + }); + + it('never writes the bearer token to the log (headers are not logged)', async () => { + const cap = capturingIo(); + const token = 'super-secret-token-value'; // pragma: allowlist secret + const handle = await runKtxMcpHttpServer({ + projectDir: '/tmp/ktx-project', + host: '127.0.0.1', + port: 0, + token, + allowedHosts: [], + allowedOrigins: [], + createMcpServer: createTestMcpServer(), + io: cap.io, + }); + try { + const port = (handle.server.address() as AddressInfo).port; + const response = await postJson(port, '/mcp', initializeBody(), { authorization: `Bearer ${token}` }); + expect(response.status).toBe(200); + } finally { + await handle.close(); + } + + expect(cap.json().some((line) => line.msg === 'session.open')).toBe(true); + expect(cap.text()).not.toContain(token); + }); }); diff --git a/packages/cli/test/mcp-server-factory.test.ts b/packages/cli/test/mcp-server-factory.test.ts index eb378bcf..bfef8291 100644 --- a/packages/cli/test/mcp-server-factory.test.ts +++ b/packages/cli/test/mcp-server-factory.test.ts @@ -147,14 +147,21 @@ describe('createKtxMcpServerFactory', () => { ); expect(factory()).toEqual({ kind: 'mcp-server' }); - expect(createDefaultKtxMcpServer).toHaveBeenCalledWith( - expect.objectContaining({ - contextTools: expect.objectContaining({ - context_tool: { name: 'context_tool' }, - memoryIngest: mocks.memoryIngest, - }), - }), - ); + // memoryIngest is wrapped to validate an explicit connectionId before delegating, + // so it is no longer the raw service object — assert it delegates instead. + const contextTools = (vi.mocked(createDefaultKtxMcpServer).mock.calls[0]![0].contextTools ?? {}) as Record< + string, + unknown + >; + expect(contextTools.context_tool).toEqual({ name: 'context_tool' }); + const memoryIngestPort = contextTools.memoryIngest as + | { ingest: (input: unknown) => unknown; status: (runId: string) => unknown } + | undefined; + expect(memoryIngestPort).toBeDefined(); + await memoryIngestPort?.ingest({ userId: 'local', chatId: 'c', userMessage: 'm', assistantMessage: 'a' }); + expect(mocks.memoryIngest.ingest).toHaveBeenCalled(); + await memoryIngestPort?.status('run-1'); + expect(mocks.memoryIngest.status).toHaveBeenCalledWith('run-1'); }); it('uses null embedding ports when no configured provider is available', async () => { diff --git a/packages/cli/test/mcp-stdio-server.test.ts b/packages/cli/test/mcp-stdio-server.test.ts new file mode 100644 index 00000000..9675043a --- /dev/null +++ b/packages/cli/test/mcp-stdio-server.test.ts @@ -0,0 +1,53 @@ +import { PassThrough } from 'node:stream'; +import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; +import { describe, expect, it } from 'vitest'; +import { runKtxMcpStdioServer } from '../src/mcp-stdio-server.js'; + +function capturingIo() { + let buf = ''; + return { + io: { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } }, + json: () => + buf + .split('\n') + .filter((line) => line.trim().startsWith('{')) + .map((line) => JSON.parse(line) as Record), + }; +} + +function createTestMcpServer() { + return () => { + const server = new McpServer({ name: 'ktx-test', version: '0.0.0-test' }); + server.registerTool('ping', { inputSchema: {} }, async () => ({ + content: [{ type: 'text', text: 'pong' }], + })); + return server; + }; +} + +describe('runKtxMcpStdioServer logging', () => { + it('routes a transport error through the logger as transport.error and marks the session open', async () => { + const cap = capturingIo(); + const stdin = new PassThrough(); + const stdout = new PassThrough(); + + const run = runKtxMcpStdioServer({ + projectDir: '/tmp/ktx-project', + createMcpServer: createTestMcpServer(), + io: cap.io, + stdin, + stdout, + }); + + // A malformed JSON-RPC line makes the SDK stdio transport surface onerror. + stdin.write('this is not json-rpc\n'); + + await expect(run).rejects.toBeDefined(); + + const lines = cap.json(); + expect(lines.some((line) => line.msg === 'session.open')).toBe(true); + const transportError = lines.find((line) => line.msg === 'transport.error'); + expect(transportError).toBeDefined(); + expect(transportError?.err).toBeDefined(); + }); +}); diff --git a/packages/cli/test/skills/analytics-skill-content.test.ts b/packages/cli/test/skills/analytics-skill-content.test.ts new file mode 100644 index 00000000..4eeb0e6a --- /dev/null +++ b/packages/cli/test/skills/analytics-skill-content.test.ts @@ -0,0 +1,146 @@ +import { readFileSync } from 'node:fs'; +import { fileURLToPath } from 'node:url'; +import { describe, expect, it } from 'vitest'; +import { SkillsRegistryService } from '../../src/context/skills/skills-registry.service.js'; + +const skillPath = fileURLToPath(new URL('../../src/skills/analytics/SKILL.md', import.meta.url)); +const skill = readFileSync(skillPath, 'utf-8'); + +describe('analytics SKILL.md SQL craft', () => { + it('keeps the frontmatter parseable as ktx-analytics', () => { + const service = new SkillsRegistryService({ skillsDir: '/tmp' }); + expect(service.parseFrontmatter(skill).name).toBe('ktx-analytics'); + }); + + it('groups the craft under the five sub-headings', () => { + expect(skill).toContain(''); + expect(skill).toContain(''); + expect(skill).toContain('**Schema discovery before writing SQL**'); + expect(skill).toContain('**Composition**'); + expect(skill).toContain('**Ordering & aggregation determinism**'); + expect(skill).toContain('**Numeric precision**'); + expect(skill).toContain('**Answer completeness / interpretation**'); + }); + + it('represents every craft behavior', () => { + const phrases = [ + 'Sample before you compose', // inspect representative rows + 'Cast to the real type before comparing', // string-vs-number compares + 'Build incrementally', // one CTE at a time + 'Avoid fan-out joins', // grain / pre-aggregate + 'the danger is cumulative', // multi-hop fan-out generalization + 'Verify the grain holds across each join', // affirmative grain-verification habit + 'Make the ordering deterministic', // window tie-breaker + 'Filter after the window, not before', // window-then-filter + 'Round only at the end', // precision + truncation + 'Macro vs micro average', // AVG(group) vs SUM/SUM + 'Top / highest / most / lowest', // winning row(s) only + 'For each X / per X / by X', // one row per X + 'Complete the panel', // full-domain spine for "each/every/all" panels + 'Default by additivity', // COALESCE 0 for additive, NULL otherwise + 'Keep the inputs to a derived value', // inputs alongside ratio + 'Project BOTH identity and label', // entity identifier + 'Diagnose empty results', // relax filters one at a time + 'Cumulative / running total', // explicit unbounded-preceding frame (spec 11) + 'Rolling window over calendar time', // calendar range, not row count (spec 11) + 'minimum periods', // emit NULL until the window is full (spec 11) + 'Period-over-period', // LAG + guarded growth ratio (spec 11) + 'Parse text-encoded numerics before doing math on them', // detect text-encoded numbers (spec 12) + 'Strip, scale, and cast in one early CTE', // parse/scale early (spec 12) + 'Confirm the parse covered every value', // failure-detecting cast coverage (spec 12) + 'Answer every requested output', // multi-part/multi-output umbrella over identity+inputs (spec 14) + 'Final completeness check', // re-read the question, confirm the projection covers all four facets (spec 14) + "Don't over-project", // match the request exactly, no padding columns (spec 14) + ]; + for (const phrase of phrases) { + expect(skill).toContain(phrase); + } + }); + + it('ships six dialect-agnostic worked examples: window-then-filter, multi-hop fan-out, panel-completeness spine, cumulative running total, text-encoded-numeric parse-and-scale, multi-part output completeness', () => { + const sqlFences = skill.match(/```sql/g) ?? []; + expect(sqlFences).toHaveLength(6); + // window-then-filter (spec 07) + expect(skill).toContain('WITH ranked AS'); + expect(skill).toContain('ROW_NUMBER() OVER'); + expect(skill).toContain('WHERE seq = 1'); + // multi-hop fan-out, pre-aggregated right side + count-only escape hatch (spec 09) + expect(skill).toContain('WITH returned_orders AS'); + expect(skill).toContain('COUNT(DISTINCT o.order_id)'); + // panel completeness: distinct-dimension spine -> LEFT JOIN -> COALESCE (spec 10) + expect(skill).toContain('SELECT DISTINCT region_id FROM regions'); + expect(skill).toContain('LEFT JOIN'); + expect(skill).toMatch(/COALESCE\(/); + // cumulative running total: explicit unbounded-preceding frame + complete tie-breaker (spec 11) + expect(skill).toContain('ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW'); + expect(skill).toContain('ORDER BY txn_date, txn_id'); + // text-encoded numeric: strip with chained REPLACE -> CASE suffix scale -> CAST (spec 12) + expect(skill).toContain('WITH parsed AS'); + expect(skill).toContain('REPLACE('); + expect(skill).toMatch(/AS DECIMAL\(/); + expect(skill).toContain("LIKE '%K' THEN 1000"); + // multi-part output completeness: a column per clause + entity identity, at grain (spec 14) + expect(skill).toContain('region_monthly'); + expect(skill).toContain('MAX(rm.monthly_orders)'); + expect(skill).toContain('MIN(rm.monthly_orders)'); + expect(skill).toContain('MAX(rm.monthly_orders) - MIN(rm.monthly_orders)'); + expect(skill).toContain('r.region_id, r.region_name'); + }); + + it('leaves the existing interactive guidance intact', () => { + expect(skill).toContain(''); + expect(skill).toContain(''); + expect(skill).toContain(''); + expect(skill).toContain('Always run `discover_data` before writing SQL.'); + expect(skill).toContain('Treat a `dictionary_search` miss as non-authoritative.'); + expect(skill).toContain('ARR is reported in cents'); + }); + + it('points to the dialect-notes tool without inlining dialect syntax (spec 08)', () => { + // Engine-specific syntax lives behind the sql_dialect_notes MCP tool; the flat + // skill only names the tool (the dialect-clean assertion above still holds). + expect(skill).toContain('sql_dialect_notes'); + }); + + it('stays dialect-agnostic and free of any benchmark/grader reference', () => { + const banned = [ + /\bQUALIFY\b/i, + /strftime/i, + /julianday/i, + /generate_series/i, // postgres-only series generator — belongs in dialect notes, not the skill + /GENERATE_DATE_ARRAY/i, // bigquery-only series generator — belongs in dialect notes, not the skill + /\bRANGE\b[\s\S]{0,40}\bINTERVAL\b/i, // inline dialect range-interval frame — belongs in dialect notes, not the skill + /\bSAFE_CAST\b/i, // bigquery failure-detecting cast — belongs in dialect notes, not the skill + /\bTRY_CAST\b/i, // snowflake/tsql failure-detecting cast — belongs in dialect notes, not the skill + /\bTRY_TO_NUMBER\b/i, // snowflake failure-detecting cast — belongs in dialect notes, not the skill + /\bREGEXP_REPLACE\b/i, // dialect regex strip — the portable strip is chained REPLACE + /toFloat64OrNull/i, // clickhouse failure-detecting cast — belongs in dialect notes, not the skill + /\bGLOB\b/i, // sqlite numeric-pattern guard — belongs in dialect notes, not the skill + /\bspider\b/i, + /\bbenchmark\b/i, + /\bgold\b/i, + /\bgrader\b/i, + ]; + for (const pattern of banned) { + expect(skill).not.toMatch(pattern); + } + // no BigQuery/Snowflake-style backtick-quoted three-part FQTN + expect(skill).not.toMatch(/`[A-Za-z_]\w*\.[A-Za-z_]\w*\.[A-Za-z_]\w*`/); + }); + + it('never anchors relative time to the data maximum date', () => { + // Phrase-level guard (not a raw MAX() grep — MAX() is a legitimate aggregate): + // no single line ties "recent"/"past N " to a MAX(...) over the data. + const relativeTime = /(recent|past\s+\w+\s+(day|week|month|year)s?)/i; + const maxCall = /\bMAX\s*\(/i; + for (const line of skill.split('\n')) { + if (maxCall.test(line)) { + expect(line).not.toMatch(relativeTime); + } + } + }); + + it('stays comfortably within the skill size budget', () => { + expect(skill.split('\n').length).toBeLessThan(500); + }); +}); diff --git a/packages/cli/test/status-project.test.ts b/packages/cli/test/status-project.test.ts index 30313897..6dbcae39 100644 --- a/packages/cli/test/status-project.test.ts +++ b/packages/cli/test/status-project.test.ts @@ -10,6 +10,11 @@ import { buildProjectStatus, renderProjectStatus, } from '../src/status-project.js'; +import { initKtxProject, loadKtxProject } from '../src/context/project/project.js'; +import { serializeKtxProjectConfig } from '../src/context/project/config.js'; +import { writeLocalKnowledgePage } from '../src/context/wiki/local-knowledge.js'; + +const stubClaudeCodeAuthProbeForFileBacked = async () => ({ ok: true as const }); function projectWithConfig(config: KtxProjectConfig): KtxLocalProject { return { @@ -646,8 +651,8 @@ describe('buildLocalStatsStatus', () => { expect(stats.unavailable).toBeUndefined(); expect(stats.ingest.totalCompletedRuns).toBe(3); expect(stats.ingest.perConnection).toEqual([ - { connectionId: 'analytics', adapter: 'live-database', lastCompletedAt: '2026-05-10T10:00:00Z' }, - { connectionId: 'docs', adapter: 'notion', lastCompletedAt: '2026-05-01T10:00:00Z' }, + { connectionId: 'analytics', adapter: 'live-database', lastCompletedAt: '2026-05-10T10:00:00Z', skippedObjects: [] }, + { connectionId: 'docs', adapter: 'notion', lastCompletedAt: '2026-05-01T10:00:00Z', skippedObjects: [] }, ]); expect(stats.wikiPages).toEqual([ { scope: 'GLOBAL', count: 2, embeddedCount: 1 }, @@ -691,6 +696,47 @@ describe('buildLocalStatsStatus', () => { expect(stats.wikiPages).toEqual([]); expect(stats.semanticLayer).toEqual([]); }); + + it('surfaces skipped objects from the latest report body', async () => { + await mkdir(join(tempDir, '.ktx'), { recursive: true }); + const dbPath = join(tempDir, '.ktx', 'db.sqlite'); + const db = new Database(dbPath); + const body = JSON.stringify({ + fetch: { + status: 'partial', + retryRecommended: false, + warnings: [], + skipped: [ + { rawPath: '', entityType: 'database_object', entityId: 'emp_hire_periods_with_name', severity: 'warning', statusCode: null, message: 'no such column: ehp.start_date', retryRecommended: false }, + ], + }, + }); + db.exec(` + CREATE TABLE local_ingest_reports ( + run_id TEXT PRIMARY KEY, + adapter TEXT NOT NULL, + connection_id TEXT NOT NULL, + status TEXT NOT NULL, + completed_at TEXT NOT NULL, + raw_content_hashes_json TEXT NOT NULL, + body_json TEXT NOT NULL + ); + `); + db.prepare( + `INSERT INTO local_ingest_reports VALUES ('r1', 'live-database', 'oracle_sql', 'done', '2026-06-13T10:00:00Z', '{}', ?)`, + ).run(body); + db.close(); + + const stats = await buildLocalStatsStatus(projectIn(tempDir)); + expect(stats.ingest.perConnection).toEqual([ + { + connectionId: 'oracle_sql', + adapter: 'live-database', + lastCompletedAt: '2026-06-13T10:00:00Z', + skippedObjects: [{ name: 'emp_hire_periods_with_name', reason: 'no such column: ehp.start_date' }], + }, + ]); + }); }); describe('renderProjectStatus Local data', () => { @@ -701,7 +747,12 @@ describe('renderProjectStatus Local data', () => { ingest: { totalCompletedRuns: 3, perConnection: [ - { connectionId: 'analytics', adapter: 'live-database', lastCompletedAt: new Date(Date.now() - 60 * 60 * 1000).toISOString() }, + { + connectionId: 'analytics', + adapter: 'live-database', + lastCompletedAt: new Date(Date.now() - 60 * 60 * 1000).toISOString(), + skippedObjects: [], + }, ], }, wikiPages: [ @@ -727,6 +778,7 @@ describe('renderProjectStatus Local data', () => { expect(rendered).toContain('Wiki'); expect(rendered).not.toContain('Knowledge'); expect(rendered).toContain('3 completed runs'); + expect(rendered).not.toContain('skipped —'); expect(rendered).toContain('GLOBAL=2 (2 embedded)'); expect(rendered).toContain('PROJECT=1 (0 embedded)'); expect(rendered).toContain('12 sources (10 embedded) · 200 dictionary values'); @@ -736,6 +788,29 @@ describe('renderProjectStatus Local data', () => { expect(rendered).not.toMatch(/semantic-layer=\d+ yaml/); }); + it('renders a per-connection skipped-objects line when the latest ingest skipped objects', async () => { + const project = projectWithConfig(baseProjectConfig()); + const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbe }); + status.localStats = { + ingest: { + totalCompletedRuns: 1, + perConnection: [ + { + connectionId: 'oracle_sql', + adapter: 'live-database', + lastCompletedAt: new Date(Date.now() - 60 * 60 * 1000).toISOString(), + skippedObjects: [{ name: 'emp_hire_periods_with_name', reason: 'no such column: ehp.start_date' }], + }, + ], + }, + wikiPages: [], + semanticLayer: [], + projectDir: { dbSqliteBytes: 4096, ktxCacheBytes: 0, rawSources: { fileCount: 0, bytes: 0 } }, + }; + const rendered = renderProjectStatus(status, { useColor: false }); + expect(rendered).toContain('1 object skipped — emp_hire_periods_with_name: no such column: ehp.start_date'); + }); + it('renders unavailable note when DB is missing', async () => { const project = projectWithConfig(baseProjectConfig()); const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbe }); @@ -755,3 +830,67 @@ describe('renderProjectStatus Local data', () => { expect(rendered).toContain('no .ktx/db.sqlite yet'); }); }); + +describe('buildProjectStatus connection-scoped wiki warnings', () => { + let tempDir: string; + + beforeEach(async () => { + tempDir = await mkdtemp(join(tmpdir(), 'ktx-status-connections-')); + }); + + afterEach(async () => { + await rm(tempDir, { recursive: true, force: true }); + }); + + async function projectWithConnections(ids: string[]): Promise { + const projectDir = join(tempDir, 'project'); + await initKtxProject({ projectDir }); + const project = await loadKtxProject({ projectDir }); + project.config.llm = { ...project.config.llm, provider: { backend: 'claude-code' }, models: { default: 'sonnet' } }; + for (const id of ids) { + project.config.connections[id] = { driver: 'sqlite', url: `file:${id}.db` }; + } + await project.fileStore.writeFile( + 'ktx.yaml', + serializeKtxProjectConfig(project.config), + 'ktx', + 'ktx@example.com', + 'configure connections', + ); + return loadKtxProject({ projectDir }); + } + + it('warns when a wiki page references a connection id absent from ktx.yaml', async () => { + const project = await projectWithConnections(['sales_db']); + await writeLocalKnowledgePage(project, { + key: 'orders-removed', + scope: 'GLOBAL', + summary: 'Orders for a removed db', + content: 'Orders.', + connections: ['removed_db'], + }); + + const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbeForFileBacked }); + expect(status.warnings).toEqual( + expect.arrayContaining([ + expect.objectContaining({ + message: expect.stringContaining('reference connection id(s) not in ktx.yaml: removed_db'), + }), + ]), + ); + }); + + it('does not warn when all referenced connection ids are configured', async () => { + const project = await projectWithConnections(['sales_db']); + await writeLocalKnowledgePage(project, { + key: 'orders-sales', + scope: 'GLOBAL', + summary: 'Sales orders', + content: 'Orders.', + connections: ['sales_db'], + }); + + const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbeForFileBacked }); + expect(status.warnings.some((warning) => warning.message.includes('not in ktx.yaml'))).toBe(false); + }); +}); diff --git a/packages/cli/test/telemetry/project-snapshot.test.ts b/packages/cli/test/telemetry/project-snapshot.test.ts index 82dd88f1..264247de 100644 --- a/packages/cli/test/telemetry/project-snapshot.test.ts +++ b/packages/cli/test/telemetry/project-snapshot.test.ts @@ -61,6 +61,7 @@ describe('buildProjectStackSnapshotFields', () => { profileSampleRows: 10000, profileConcurrency: 4, validationConcurrency: 4, + detectionBudgetMs: 600000, }, }, storage: { diff --git a/packages/cli/test/text-ingest.test.ts b/packages/cli/test/text-ingest.test.ts index 5e599814..747dc7e1 100644 --- a/packages/cli/test/text-ingest.test.ts +++ b/packages/cli/test/text-ingest.test.ts @@ -2,6 +2,26 @@ import { describe, expect, it, vi } from 'vitest'; import type { MemoryIngestStatus } from '../src/context/memory/memory-runs.js'; import type { KtxLocalProject } from '../src/context/project/project.js'; import { runKtxTextIngest, type TextMemoryIngestPort } from '../src/text-ingest.js'; +import type { VerbatimIngestItem, VerbatimIngestorPort } from '../src/verbatim-ingest.js'; + +function fakeVerbatim( + options: { calls?: VerbatimIngestItem[]; throwOn?: (item: VerbatimIngestItem) => boolean } = {}, +): VerbatimIngestorPort { + return { + ingest: vi.fn(async (item: VerbatimIngestItem) => { + options.calls?.push(item); + if (options.throwOn?.(item)) { + throw new Error(`verbatim write failed for ${item.origin.kind}`); + } + return { + pageKey: item.origin.kind === 'file' && item.origin.path ? 'haversine' : 'page', + outcome: 'written' as const, + connections: item.connectionId ? [item.connectionId] : [], + commitHash: null, + }; + }), + }; +} function makeIo(options: { isTTY?: boolean } = {}) { let stdout = ''; @@ -336,4 +356,102 @@ describe('runKtxTextIngest', () => { ).resolves.toBe(1); expect(emptyIo.stderr()).toContain('Text item "text-1" is empty'); }); + + it('routes verbatim file items to the verbatim ingestor instead of the memory agent', async () => { + const io = makeIo(); + const calls: VerbatimIngestItem[] = []; + const verbatim = fakeVerbatim({ calls }); + const createMemoryIngest = vi.fn(() => fakeIngest()); + + await expect( + runKtxTextIngest( + { + projectDir: '/tmp/project', + texts: [], + files: ['/tmp/docs/haversine.md'], + userId: 'local-cli', + json: true, + failFast: false, + verbatim: true, + }, + io.io, + { + loadProject: vi.fn(async () => fakeProject()), + createMemoryIngest, + createVerbatimIngestor: vi.fn(() => verbatim), + readFile: vi.fn(async (path) => `file:${path}`), + now: () => 1, + }, + ), + ).resolves.toBe(0); + + expect(createMemoryIngest).not.toHaveBeenCalled(); + expect(verbatim.ingest).toHaveBeenCalledTimes(1); + expect(calls[0]?.origin).toEqual({ kind: 'file', path: '/tmp/docs/haversine.md' }); + expect(calls[0]?.content).toBe('file:/tmp/docs/haversine.md'); + expect(JSON.parse(io.stdout())).toMatchObject({ + status: 'done', + results: [{ status: 'done', captured: { wiki: ['haversine'] } }], + }); + }); + + it('routes verbatim inline text with a text origin and forwards the connection id', async () => { + const io = makeIo(); + const calls: VerbatimIngestItem[] = []; + const verbatim = fakeVerbatim({ calls }); + + await expect( + runKtxTextIngest( + { + projectDir: '/tmp/project', + texts: ['# Title\n\nbody'], + files: [], + connectionId: 'db1', + userId: 'local-cli', + json: true, + failFast: false, + verbatim: true, + }, + io.io, + { + loadProject: vi.fn(async () => fakeProject()), + createVerbatimIngestor: vi.fn(() => verbatim), + now: () => 1, + }, + ), + ).resolves.toBe(0); + + expect(calls[0]?.origin).toEqual({ kind: 'text' }); + expect(calls[0]?.content).toBe('# Title\n\nbody'); + expect(calls[0]?.connectionId).toBe('db1'); + }); + + it('fails the run when a verbatim item throws and honors fail-fast', async () => { + const io = makeIo(); + const calls: VerbatimIngestItem[] = []; + const verbatim = fakeVerbatim({ calls, throwOn: () => true }); + + await expect( + runKtxTextIngest( + { + projectDir: '/tmp/project', + texts: [], + files: ['/tmp/a.md', '/tmp/b.md'], + userId: 'local-cli', + json: true, + failFast: true, + verbatim: true, + }, + io.io, + { + loadProject: vi.fn(async () => fakeProject()), + createVerbatimIngestor: vi.fn(() => verbatim), + readFile: vi.fn(async (path) => `file:${path}`), + now: () => 1, + }, + ), + ).resolves.toBe(1); + + expect(verbatim.ingest).toHaveBeenCalledTimes(1); + }); }); diff --git a/packages/cli/test/verbatim-ingest.test.ts b/packages/cli/test/verbatim-ingest.test.ts new file mode 100644 index 00000000..b46bcce1 --- /dev/null +++ b/packages/cli/test/verbatim-ingest.test.ts @@ -0,0 +1,375 @@ +import { createHash } from 'node:crypto'; +import { mkdtemp, readFile, rm } from 'node:fs/promises'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { afterEach, beforeEach, describe, expect, it } from 'vitest'; +import type { KtxEmbeddingPort } from '../src/context/core/embedding.js'; +import type { KtxLlmRuntimePort } from '../src/context/llm/runtime-port.js'; +import { initKtxProject, loadKtxProject, type KtxLocalProject } from '../src/context/project/project.js'; +import { readLocalKnowledgePage, searchLocalKnowledgePages } from '../src/context/wiki/local-knowledge.js'; +import { + buildVerbatimFrontmatter, + createLocalProjectVerbatimIngestor, + deriveDegradedSummary, + deriveVerbatimPageKey, + splitInputDocument, +} from '../src/verbatim-ingest.js'; + +describe('splitInputDocument', () => { + it('splits leading YAML frontmatter from the body', () => { + const result = splitInputDocument('---\nsummary: In doc\neffective_date: 2024-01-01\n---\n\nBody here\n'); + expect(result.frontmatter).toEqual({ summary: 'In doc', effective_date: '2024-01-01' }); + expect(result.body).toBe('Body here'); + }); + + it('treats a document without frontmatter as an empty-frontmatter body', () => { + const result = splitInputDocument('# Title\n\ncontent\n'); + expect(result.frontmatter).toEqual({}); + expect(result.body).toBe('# Title\n\ncontent'); + }); +}); + +describe('deriveVerbatimPageKey', () => { + it('derives a file key from the basename without extension', () => { + expect(deriveVerbatimPageKey({ kind: 'file', path: '/docs/haversine-formula.md' }, 'irrelevant')).toBe( + 'haversine-formula', + ); + }); + + it('slugifies a messy file basename', () => { + expect(deriveVerbatimPageKey({ kind: 'file', path: '/docs/RFM Buckets.md' }, 'irrelevant')).toBe('RFM-Buckets'); + }); + + it('derives an inline-text key from a leading Markdown heading', () => { + expect(deriveVerbatimPageKey({ kind: 'text' }, '# Haversine Formula\n\ndetails')).toBe('Haversine-Formula'); + }); + + it('rejects inline text with no leading heading', () => { + expect(() => deriveVerbatimPageKey({ kind: 'text' }, 'no heading here')).toThrow(/heading|--file/); + }); + + it('derives a stdin key from a leading heading like inline text', () => { + expect(deriveVerbatimPageKey({ kind: 'stdin' }, '## RFM Buckets\n\nrows')).toBe('RFM-Buckets'); + }); +}); + +describe('deriveDegradedSummary', () => { + it('uses the leading heading text when present', () => { + expect(deriveDegradedSummary('# Haversine Formula\n\nThe formula computes distance.')).toBe('Haversine Formula'); + }); + + it('falls back to the first non-empty sentence when there is no heading', () => { + expect(deriveDegradedSummary('The haversine formula computes great-circle distance. More text.')).toBe( + 'The haversine formula computes great-circle distance.', + ); + }); +}); + +describe('buildVerbatimFrontmatter', () => { + it('gap-fills absent fields with generated metadata and defaults usage_mode to auto', () => { + const fm = buildVerbatimFrontmatter({ + inputFrontmatter: {}, + summary: 'generated summary', + tags: ['finance'], + slRefs: ['orders'], + }); + expect(fm.summary).toBe('generated summary'); + expect(fm.tags).toEqual(['finance']); + expect(fm.sl_refs).toEqual(['orders']); + expect(fm.usage_mode).toBe('auto'); + }); + + it('preserves an explicit input summary instead of the generated one', () => { + const fm = buildVerbatimFrontmatter({ + inputFrontmatter: { summary: 'authoritative summary' }, + summary: 'generated summary', + tags: ['x'], + slRefs: [], + }); + expect(fm.summary).toBe('authoritative summary'); + }); + + it('passes through unknown frontmatter fields verbatim', () => { + const fm = buildVerbatimFrontmatter({ + inputFrontmatter: { effective_date: '2024-01-01', version: 3, owner: 'data-team' }, + summary: 'generated summary', + tags: [], + slRefs: [], + }); + expect(fm.effective_date).toBe('2024-01-01'); + expect(fm.version).toBe(3); + expect(fm.owner).toBe('data-team'); + }); + + it('keeps an explicit usage_mode', () => { + const fm = buildVerbatimFrontmatter({ + inputFrontmatter: { usage_mode: 'always' }, + summary: 'generated summary', + tags: [], + slRefs: [], + }); + expect(fm.usage_mode).toBe('always'); + }); + + it('sets connections from the flag when the input declares none', () => { + const fm = buildVerbatimFrontmatter({ + inputFrontmatter: {}, + summary: 's', + tags: [], + slRefs: [], + connectionId: 'db1', + }); + expect(fm.connections).toEqual(['db1']); + }); + + it('keeps input connections when the flag matches', () => { + const fm = buildVerbatimFrontmatter({ + inputFrontmatter: { connections: ['db1'] }, + summary: 's', + tags: [], + slRefs: [], + connectionId: 'db1', + }); + expect(fm.connections).toEqual(['db1']); + }); + + it('keeps input connections when no flag is given', () => { + const fm = buildVerbatimFrontmatter({ + inputFrontmatter: { connections: ['db2'] }, + summary: 's', + tags: [], + slRefs: [], + }); + expect(fm.connections).toEqual(['db2']); + }); + + it('errors when input connections differ from the flag', () => { + expect(() => + buildVerbatimFrontmatter({ + inputFrontmatter: { connections: ['db2'] }, + summary: 's', + tags: [], + slRefs: [], + connectionId: 'db1', + }), + ).toThrow(/connection/i); + }); +}); + +class FakeEmbeddingPort implements KtxEmbeddingPort { + readonly maxBatchSize = 16; + + async computeEmbedding(text: string): Promise { + return /haversine|distance|geospatial|sphere|proximity|great-circle/i.test(text) ? [1, 0] : [0, 1]; + } + + async computeEmbeddingsBulk(texts: string[]): Promise { + return Promise.all(texts.map((text) => this.computeEmbedding(text))); + } +} + +function fakeLlmRuntime(metadata: { summary: string; tags: string[]; sl_refs: string[] }): KtxLlmRuntimePort { + return { + async generateText() { + throw new Error('generateText is not used by verbatim ingest'); + }, + async generateObject(input) { + return input.schema.parse(metadata); + }, + async runAgentLoop() { + throw new Error('runAgentLoop is not used by verbatim ingest'); + }, + subprocessForkSpec() { + return null; + }, + }; +} + +function throwingLlmRuntime(): KtxLlmRuntimePort { + return { + async generateText() { + throw new Error('generateText is not used by verbatim ingest'); + }, + async generateObject() { + throw new Error('rate limit exceeded'); + }, + async runAgentLoop() { + throw new Error('runAgentLoop is not used by verbatim ingest'); + }, + subprocessForkSpec() { + return null; + }, + }; +} + +describe('LocalVerbatimIngestor', () => { + let projectDir: string; + let project: KtxLocalProject; + + beforeEach(async () => { + projectDir = await mkdtemp(join(tmpdir(), 'ktx-verbatim-')); + await initKtxProject({ projectDir }); + project = await loadKtxProject({ projectDir }); + }); + + afterEach(async () => { + await rm(projectDir, { recursive: true, force: true }); + }); + + it('stores the document body byte-for-byte (after trim)', async () => { + const body = '# Haversine Formula\n\nUse R = 6371 km. The DRS threshold = 0.5 and bucket boundary is [30, 60).'; + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + const result = await ingestor.ingest({ origin: { kind: 'file', path: '/docs/haversine-formula.md' }, content: body }); + + expect(result.pageKey).toBe('haversine-formula'); + expect(result.outcome).toBe('written'); + const page = await readLocalKnowledgePage(project, { key: 'haversine-formula' }); + expect(page?.content).toBe(body.trim()); + expect(createHash('sha256').update(page!.content).digest('hex')).toBe( + createHash('sha256').update(body.trim()).digest('hex'), + ); + }); + + it('stores a document larger than the LLM clip limit in full', async () => { + const body = `# Big Doc\n\n${'x'.repeat(60_000)}`; + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await ingestor.ingest({ origin: { kind: 'file', path: '/docs/big-doc.md' }, content: body }); + + const page = await readLocalKnowledgePage(project, { key: 'big-doc' }); + expect(page!.content.length).toBeGreaterThanOrEqual(body.trim().length); + }); + + it('is idempotent when re-ingesting the same document', async () => { + const body = '# Doc\n\nstable body content'; + const item = { origin: { kind: 'file' as const, path: '/docs/doc.md' }, content: body }; + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + + const first = await ingestor.ingest(item); + expect(first.outcome).toBe('written'); + const second = await ingestor.ingest(item); + expect(second.outcome).toBe('unchanged'); + + const page = await readLocalKnowledgePage(project, { key: 'doc' }); + expect(page?.content).toBe(body.trim()); + }); + + it('hard-errors on a different body at the same key without modifying the existing page', async () => { + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await ingestor.ingest({ origin: { kind: 'file', path: '/docs/doc.md' }, content: '# Doc\n\nfirst body' }); + + await expect( + ingestor.ingest({ origin: { kind: 'file', path: '/docs/doc.md' }, content: '# Doc\n\nsecond body' }), + ).rejects.toThrow(/doc/); + + const page = await readLocalKnowledgePage(project, { key: 'doc' }); + expect(page?.content).toContain('first body'); + expect(page?.content).not.toContain('second body'); + }); + + it('passes through unknown frontmatter and never overwrites an explicit summary', async () => { + const content = + '---\nsummary: Authoritative summary\neffective_date: 2024-01-01\n---\n\n# Metric Spec\n\nbody text'; + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await ingestor.ingest({ origin: { kind: 'file', path: '/docs/metric-spec.md' }, content }); + + const page = await readLocalKnowledgePage(project, { key: 'metric-spec' }); + expect(page?.summary).toBe('Authoritative summary'); + const raw = await readFile(join(projectDir, 'wiki/global/metric-spec.md'), 'utf-8'); + expect(raw).toContain('effective_date: 2024-01-01'); + }); + + it('derives a degraded summary and empty tags with no LLM backend', async () => { + const body = '# RFM Buckets\n\nRecency 1-30 days is bucket A.'; + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await ingestor.ingest({ origin: { kind: 'file', path: '/docs/rfm-buckets.md' }, content: body }); + + const page = await readLocalKnowledgePage(project, { key: 'rfm-buckets' }); + expect(page?.summary).toBe('RFM Buckets'); + expect(page?.tags).toEqual([]); + expect(page?.slRefs).toEqual([]); + }); + + it('scopes the page to a configured connection via the flag', async () => { + project.config.connections = { db1: { driver: 'sqlite' } }; + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await ingestor.ingest({ + origin: { kind: 'file', path: '/docs/scoped.md' }, + content: '# Scoped\n\nbody', + connectionId: 'db1', + }); + + const page = await readLocalKnowledgePage(project, { key: 'scoped' }); + expect(page?.connections).toEqual(['db1']); + }); + + it('rejects an unknown connection id and lists the configured ids', async () => { + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await expect( + ingestor.ingest({ origin: { kind: 'file', path: '/docs/x.md' }, content: '# X\n\nbody', connectionId: 'nope' }), + ).rejects.toThrow(/Configured connections/); + }); + + it('errors when the flag connection disagrees with frontmatter connections', async () => { + project.config.connections = { db1: { driver: 'sqlite' } }; + const content = '---\nconnections:\n - db2\n---\n\n# Amb\n\nbody'; + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await expect( + ingestor.ingest({ origin: { kind: 'file', path: '/docs/amb.md' }, content, connectionId: 'db1' }), + ).rejects.toThrow(/connection/i); + }); + + it('errors on inline text without a leading heading', async () => { + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await expect(ingestor.ingest({ origin: { kind: 'text' }, content: 'no heading here' })).rejects.toThrow( + /heading|--file/, + ); + }); + + it('uses LLM-generated metadata to gap-fill absent fields', async () => { + const runtime = fakeLlmRuntime({ summary: 'LLM summary', tags: ['t1'], sl_refs: ['orders'] }); + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: runtime }); + await ingestor.ingest({ origin: { kind: 'file', path: '/docs/llm-doc.md' }, content: '# LLM Doc\n\nabout orders' }); + + const page = await readLocalKnowledgePage(project, { key: 'llm-doc' }); + expect(page?.summary).toBe('LLM summary'); + expect(page?.tags).toEqual(['t1']); + expect(page?.slRefs).toEqual(['orders']); + }); + + it('fails the item on LLM error and writes no page when a backend is configured', async () => { + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: throwingLlmRuntime() }); + await expect( + ingestor.ingest({ origin: { kind: 'file', path: '/docs/fail-doc.md' }, content: '# Fail Doc\n\nbody' }), + ).rejects.toThrow(); + + const page = await readLocalKnowledgePage(project, { key: 'fail-doc' }); + expect(page).toBeNull(); + }); + + it('is findable by a body phrase via the lexical lane', async () => { + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await ingestor.ingest({ + origin: { kind: 'file', path: '/docs/overtake.md' }, + content: '# Overtake Rule\n\nThe overtake rule grants DRS within one second.', + }); + + const results = await searchLocalKnowledgePages(project, { query: 'overtake rule grants DRS' }); + expect(results.some((result) => result.key === 'overtake')).toBe(true); + }); + + it('is findable by a topic paraphrase via the semantic lane when embeddings are enabled', async () => { + const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null }); + await ingestor.ingest({ + origin: { kind: 'file', path: '/docs/haversine.md' }, + content: '# Haversine\n\nThe haversine formula computes great-circle distance.', + }); + + const results = await searchLocalKnowledgePages(project, { + query: 'geospatial proximity', + embeddingService: new FakeEmbeddingPort(), + }); + const match = results.find((result) => result.key === 'haversine'); + expect(match).toBeDefined(); + expect(match?.matchReasons).toContain('semantic'); + }); +}); diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index c0e2b062..5521fa08 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -212,6 +212,12 @@ importers: pg: specifier: ^8.21.0 version: 8.21.0 + pino: + specifier: ^10.3.1 + version: 10.3.1 + pino-pretty: + specifier: ^13.1.3 + version: 13.1.3 posthog-node: specifier: ^5.34.9 version: 5.34.9 @@ -1656,6 +1662,9 @@ packages: cpu: [x64] os: [win32] + '@pinojs/redact@0.4.0': + resolution: {integrity: sha512-k2ENnmBugE/rzQfEcdWHcCY+/FM3VLzH9cYEsbdsoqrvzAKRhUZeRNhAZvB8OitQJ1TBed3yqWtdjzS6wJKBwg==} + '@pnpm/config.env-replace@1.1.0': resolution: {integrity: sha512-htyl8TWnKL7K/ESFa1oW2UB5lVDxuF5DpM7tBi6Hu2LNL3mWkIzNLG6N4zoCUP1lCKNxWy/3iu8mS8MvToGd6w==} engines: {node: '>=12.22.0'} @@ -2776,6 +2785,10 @@ packages: asynckit@0.4.0: resolution: {integrity: sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q==} + atomic-sleep@1.0.0: + resolution: {integrity: sha512-kNOjDqAh7px0XWNI+4QbzoiR/nTkHAWNud2uvnJquD1/x5a7EQZMJT0AczqK0Qn67oY/TTQ1LbUKajZpp3I9tQ==} + engines: {node: '>=8.0.0'} + auto-bind@5.0.1: resolution: {integrity: sha512-ooviqdwwgfIfNmDwo94wlshcdzfO64XV0Cg6oDsDYBJfITDz1EngD2z7DkbvCWn+XIMsIqW27sEVF6qcpJrRcg==} engines: {node: ^12.20.0 || ^14.13.1 || >=16.0.0} @@ -3021,6 +3034,9 @@ packages: resolution: {integrity: sha512-ezmVcLR3xAVp8kYOm4GS45ZLLgIE6SPAFoduLr6hTDajwb3KZ2F46gulK3XpcwRFb5KKGCSezCBAY4Dw4HsyXA==} engines: {node: '>=18'} + colorette@2.0.20: + resolution: {integrity: sha512-IfEDxwoWIjkeXL1eXcDiow4UbKjhLdq6/EuSVR9GMN7KVH3r9gQ83e73hsz1Nd1T3ijd5xv1wcWRYO+D6kCI2w==} + combined-stream@1.0.8: resolution: {integrity: sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg==} engines: {node: '>= 0.8'} @@ -3170,6 +3186,9 @@ packages: resolution: {integrity: sha512-0R9ikRb668HB7QDxT1vkpuUBtqc53YyAwMwGeUFKRojY/NWKvdZ+9UYtRfGmhqNbRkTSVpMbmyhXipFFv2cb/A==} engines: {node: '>= 12'} + dateformat@4.6.3: + resolution: {integrity: sha512-2P0p0pFGzHS5EMnhdxQi7aJN+iMheud0UhG4dlE1DLAlvL8JHjJJTX/CSm4JXwV0Ka5nGk3zC5mcb5bUQUxxMA==} + debug@4.4.3: resolution: {integrity: sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA==} engines: {node: '>=6.0'} @@ -3430,9 +3449,15 @@ packages: fast-content-type-parse@3.0.0: resolution: {integrity: sha512-ZvLdcY8P+N8mGQJahJV5G4U88CSvT1rP8ApL6uETe88MBXrBHAkZlSEySdUlyztF7ccb+Znos3TFqaepHxdhBg==} + fast-copy@4.0.3: + resolution: {integrity: sha512-58apWr0GUiDFM8+3afrO6eYwJBn9ZAhDOzG3L+/9llab/haCARS2UIfffmOurYLwbgDRs8n0rfr6qAAPEAuAQw==} + fast-deep-equal@3.1.3: resolution: {integrity: sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q==} + fast-safe-stringify@2.1.1: + resolution: {integrity: sha512-W+KJc2dmILlPplD/H4K9l9LcAHAfPtP6BY84uVLXQ6Evcz9Lcg33Y2z1IVblT6xdY54PXYVHEv+0Wpq8Io6zkA==} + fast-sha256@1.3.0: resolution: {integrity: sha512-n11RGP/lrWEFI/bWdygLxhI+pVeo1ZYIVwvvPkW7azl/rOy+F3HYRZ2K5zeE9mmkhQppyv9sQFx0JM9UabnpPQ==} @@ -3828,6 +3853,9 @@ packages: hastscript@9.0.1: resolution: {integrity: sha512-g7df9rMFX/SPi34tyGCyUBREQoKkapwdY/T04Qn9TDWfHhAYt4/I0gMVirzK5wEzeUqIjEB+LXC/ypb7Aqno5w==} + help-me@5.0.0: + resolution: {integrity: sha512-7xgomUX6ADmcYzFik0HzAxh/73YlKR9bmFzf51CZwR+b6YtzU2m0u49hQCqV6SvlqIqsaxovfwdvbnsw3b/zpg==} + highlight.js@10.7.3: resolution: {integrity: sha512-tzcUFauisWKNHaRkN4Wjl/ZA07gENAjFl3J/c480dprkGTg5EQstgaNFqBfUqCq54kZRIEcreTsAgF/m2quD7A==} @@ -4094,6 +4122,10 @@ packages: jose@6.2.3: resolution: {integrity: sha512-YYVDInQKFJfR/xa3ojUTl8c2KoTwiL1R5Wg9YCydwH0x0B9grbzlg5HC7mMjCtUJjbQ/YnGEZIhI5tCgfTb4Hw==} + joycon@3.1.1: + resolution: {integrity: sha512-34wB/Y7MW7bzjKRjUKTa46I2Z7eV62Rkhva+KkopW7Qvv/OSWBqvkSY7vusOPrNuZcUG3tApvdVgNB8POj3SPw==} + engines: {node: '>=10'} + js-md4@0.3.2: resolution: {integrity: sha512-/GDnfQYsltsjRswQhN9fhv3EMw2sCpUdrdxyWDOUK7eyD++r3gRhzgiQgc/x4MAv2i1iuQ4lxO5mvqM3vj4bwA==} @@ -4820,6 +4852,10 @@ packages: obug@2.1.1: resolution: {integrity: sha512-uTqF9MuPraAQ+IsnPf366RG4cP9RtUi7MLO1N3KEc+wb0a6yKpeL0lmk2IB1jY5KHPAlTc6T/JRdC/YqxHNwkQ==} + on-exit-leak-free@2.1.2: + resolution: {integrity: sha512-0eJJY6hXLGf1udHwfNftBqH+g73EU4B504nZeKpz1sYRKafAghwxEJunB2O7rDZkL4PGfsMVnTXZ2EjibbqcsA==} + engines: {node: '>=14.0.0'} + on-finished@2.4.1: resolution: {integrity: sha512-oVlzkg3ENAhCk2zdv7IJwd/QUD4z2RxRwpkcGY8psCVcCYZNq4wYnVWALHM+brtuJjePWiYF/ClmuDr8Ch5+kg==} engines: {node: '>= 0.8'} @@ -5046,6 +5082,20 @@ packages: resolution: {integrity: sha512-C3FsVNH1udSEX48gGX1xfvwTWfsYWj5U+8/uK15BGzIGrKoUpghX8hWZwa/OFnakBiiVNmBvemTJR5mcy7iPcg==} engines: {node: '>=4'} + pino-abstract-transport@3.0.0: + resolution: {integrity: sha512-wlfUczU+n7Hy/Ha5j9a/gZNy7We5+cXp8YL+X+PG8S0KXxw7n/JXA3c46Y0zQznIJ83URJiwy7Lh56WLokNuxg==} + + pino-pretty@13.1.3: + resolution: {integrity: sha512-ttXRkkOz6WWC95KeY9+xxWL6AtImwbyMHrL1mSwqwW9u+vLp/WIElvHvCSDg0xO/Dzrggz1zv3rN5ovTRVowKg==} + hasBin: true + + pino-std-serializers@7.1.0: + resolution: {integrity: sha512-BndPH67/JxGExRgiX1dX0w1FvZck5Wa4aal9198SrRhZjH3GxKQUKIBnYJTdj2HDN3UQAS06HlfcSbQj2OHmaw==} + + pino@10.3.1: + resolution: {integrity: sha512-r34yH/GlQpKZbU1BvFFqOjhISRo1MNx1tWYsYvmj6KIRHSPMT2+yHOEb1SG6NMvRoHRF0a07kCOox/9yakl1vg==} + hasBin: true + pkce-challenge@5.0.1: resolution: {integrity: sha512-wQ0b/W4Fr01qtpHlqSqspcj3EhBvimsdh0KlHhH8HRZnMsEa0ea2fTULOXOS9ccQr3om+GcGRk4e+isrZWV8qQ==} engines: {node: '>=16.20.0'} @@ -5096,6 +5146,9 @@ packages: process-nextick-args@2.0.1: resolution: {integrity: sha512-3ouUOpQhtgrbOa17J7+uxOTpITYWaGP7/AhoR3+A+/1e9skrzelGi/dXzEYyvbxubEF6Wn2ypscTKiKJFFn1ag==} + process-warning@5.0.0: + resolution: {integrity: sha512-a39t9ApHNx2L4+HBnQKqxxHNs1r7KF+Intd8Q/g1bUh6q0WIp9voPXJ/x0j+ZL45KF1pJd9+q2jLIRMfvEshkA==} + process@0.11.10: resolution: {integrity: sha512-cdGef/drWFoydD1JsMzuFf8100nZl+GT+yacc2bEced5f9Rjk4z+WtFUTBu9PhOi9j/jfmBPu0mMEY4wIdAF8A==} engines: {node: '>= 0.6.0'} @@ -5125,6 +5178,9 @@ packages: resolution: {integrity: sha512-Rzq0KEyX/w/tEybncDgdkZrJgVUsUMk3xjh3t5bv3S1HTAtg+uOYt72+ZfwiQwKdysThkTBdL/rTi6HDmX9Ddw==} engines: {node: '>=0.6'} + quick-format-unescaped@4.0.4: + resolution: {integrity: sha512-tYC1Q1hgyRuHgloV/YXs2w15unPVh8qfu/qCTfhTYamaw7fyhumKa2yGpdSo87vY32rIclj+4fWYQXUMs9EHvg==} + range-parser@1.2.1: resolution: {integrity: sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg==} engines: {node: '>= 0.6'} @@ -5213,6 +5269,13 @@ packages: resolution: {integrity: sha512-9u/XQ1pvrQtYyMpZe7DXKv2p5CNvyVwzUB6uhLAnQwHMSgKMBR62lc7AHljaeteeHXn11XTAaLLUVZYVZyuRBQ==} engines: {node: '>= 20.19.0'} + real-require@0.2.0: + resolution: {integrity: sha512-57frrGM/OCTLqLOAh0mhVA9VBMHd+9U7Zb2THMGdBUoZVOtGbJzjxsYGDJ3A9AYYCP4hn6y1TVbaOfzWtm5GFg==} + engines: {node: '>= 12.13.0'} + + real-require@1.0.0: + resolution: {integrity: sha512-P4nbQYQfePJxRSmY+v/KINxVucm4NF3p3s7pJveMTtom52FR4YGltUQLB8idDXwDDWW+eYrWDFbuzUnjoWHF7g==} + recma-build-jsx@1.0.0: resolution: {integrity: sha512-8GtdyqaBcDfva+GUKDr3nev3VpKAhup1+RvkMvUxURHpW7QyIvk9F5wz7Vzo06CEMSilw6uArgRqhpiUcWp8ew==} @@ -5323,6 +5386,9 @@ packages: scroll-into-view-if-needed@3.1.0: resolution: {integrity: sha512-49oNpRjWRvnU8NyGVmUaYG4jtTkNonFZI86MmGRDqBphEK2EXT9gdEUoQPZhuBM8yWHxCWbobltqYO5M4XrUvQ==} + secure-json-parse@4.1.0: + resolution: {integrity: sha512-l4KnYfEyqYJxDwlNVyRfO2E4NTHfMKAWdUuA8J0yve2Dz/E/PdBepY03RvyJpssIpRFwJoCD55wA+mEDs6ByWA==} + semantic-release@25.0.3: resolution: {integrity: sha512-WRgl5GcypwramYX4HV+eQGzUbD7UUbljVmS+5G1uMwX/wLgYuJAxGeerXJDMO2xshng4+FXqCgyB5QfClV6WjA==} engines: {node: ^22.14.0 || >= 24.10.0} @@ -5432,6 +5498,9 @@ packages: peerDependencies: asn1.js: ^5.4.1 + sonic-boom@4.2.1: + resolution: {integrity: sha512-w6AxtubXa2wTXAUsZMMWERrsIRAdrK0Sc+FUytWvYAhBJLyuI4llrMIC1DtlNSdI99EI86KZum2MMq3EAZlF9Q==} + source-map-js@1.2.1: resolution: {integrity: sha512-UXWMKhLOwVKb728IUtQPXxfYU+usdybtUrK/8uGE8CQMvrhOpwvzDBwj0QhSL7MQc7vIsISBG8VQ8+IDQxpfQA==} engines: {node: '>=0.10.0'} @@ -5657,6 +5726,10 @@ packages: thenify@3.3.1: resolution: {integrity: sha512-RVZSIV5IG10Hk3enotrhvz0T9em6cyHBLkH/YAZuKqd8hRkKhSfCGIcP2KUY0EPxndzANBmNllzWPwak+bheSw==} + thread-stream@4.2.0: + resolution: {integrity: sha512-e2zZ96wSChazBsbENf/Pcm/4swHt2cEKQ92rhUjkL9GCKiTDJIaTBenjE/m9DXi0QBmTMDkFDdOomUy20A1tDQ==} + engines: {node: '>=20'} + through2@2.0.5: resolution: {integrity: sha512-/mrRod8xqpA+IHSLyGCQ2s8SPHiCDEeQJSep1jqLYeEUClOFG2Qsh+4FU6G9VeqpZnGW/Su8LQGc4YKni5rYSQ==} @@ -7547,6 +7620,8 @@ snapshots: '@oxc-resolver/binding-win32-x64-msvc@11.19.1': optional: true + '@pinojs/redact@0.4.0': {} + '@pnpm/config.env-replace@1.1.0': {} '@pnpm/network.ca-file@1.0.2': @@ -8726,6 +8801,8 @@ snapshots: asynckit@0.4.0: {} + atomic-sleep@1.0.0: {} + auto-bind@5.0.1: {} aws-ssl-profiles@1.1.2: {} @@ -8969,6 +9046,8 @@ snapshots: color-convert: 3.1.3 color-string: 2.1.4 + colorette@2.0.20: {} + combined-stream@1.0.8: dependencies: delayed-stream: 1.0.0 @@ -9098,6 +9177,8 @@ snapshots: data-uri-to-buffer@4.0.1: {} + dateformat@4.6.3: {} + debug@4.4.3: dependencies: ms: 2.1.3 @@ -9412,8 +9493,12 @@ snapshots: fast-content-type-parse@3.0.0: {} + fast-copy@4.0.3: {} + fast-deep-equal@3.1.3: {} + fast-safe-stringify@2.1.1: {} + fast-sha256@1.3.0: {} fast-string-truncated-width@3.0.3: {} @@ -9882,6 +9967,8 @@ snapshots: property-information: 7.1.0 space-separated-tokens: 2.0.2 + help-me@5.0.0: {} + highlight.js@10.7.3: {} homedir-polyfill@1.0.3: @@ -10124,6 +10211,8 @@ snapshots: jose@6.2.3: {} + joycon@3.1.1: {} + js-md4@0.3.2: {} js-tokens@10.0.0: {} @@ -11005,6 +11094,8 @@ snapshots: obug@2.1.1: {} + on-exit-leak-free@2.1.2: {} + on-finished@2.4.1: dependencies: ee-first: 1.1.1 @@ -11246,6 +11337,42 @@ snapshots: pify@3.0.0: {} + pino-abstract-transport@3.0.0: + dependencies: + split2: 4.2.0 + + pino-pretty@13.1.3: + dependencies: + colorette: 2.0.20 + dateformat: 4.6.3 + fast-copy: 4.0.3 + fast-safe-stringify: 2.1.1 + help-me: 5.0.0 + joycon: 3.1.1 + minimist: 1.2.8 + on-exit-leak-free: 2.1.2 + pino-abstract-transport: 3.0.0 + pump: 3.0.4 + secure-json-parse: 4.1.0 + sonic-boom: 4.2.1 + strip-json-comments: 5.0.3 + + pino-std-serializers@7.1.0: {} + + pino@10.3.1: + dependencies: + '@pinojs/redact': 0.4.0 + atomic-sleep: 1.0.0 + on-exit-leak-free: 2.1.2 + pino-abstract-transport: 3.0.0 + pino-std-serializers: 7.1.0 + process-warning: 5.0.0 + quick-format-unescaped: 4.0.4 + real-require: 0.2.0 + safe-stable-stringify: 2.5.0 + sonic-boom: 4.2.1 + thread-stream: 4.2.0 + pkce-challenge@5.0.1: {} pkg-conf@2.1.0: @@ -11294,6 +11421,8 @@ snapshots: process-nextick-args@2.0.1: {} + process-warning@5.0.0: {} + process@0.11.10: {} property-information@7.1.0: {} @@ -11318,6 +11447,8 @@ snapshots: dependencies: side-channel: 1.1.0 + quick-format-unescaped@4.0.4: {} + range-parser@1.2.1: {} raw-body@3.0.2: @@ -11427,6 +11558,10 @@ snapshots: readdirp@5.0.0: {} + real-require@0.2.0: {} + + real-require@1.0.0: {} + recma-build-jsx@1.0.0: dependencies: '@types/estree': 1.0.9 @@ -11603,6 +11738,8 @@ snapshots: dependencies: compute-scroll-into-view: 3.1.1 + secure-json-parse@4.1.0: {} + semantic-release@25.0.3(typescript@6.0.3): dependencies: '@semantic-release/commit-analyzer': 13.0.1(semantic-release@25.0.3(typescript@6.0.3)) @@ -11830,6 +11967,10 @@ snapshots: - debug - supports-color + sonic-boom@4.2.1: + dependencies: + atomic-sleep: 1.0.0 + source-map-js@1.2.1: {} source-map@0.6.1: {} @@ -12053,6 +12194,10 @@ snapshots: dependencies: any-promise: 1.3.0 + thread-stream@4.2.0: + dependencies: + real-require: 1.0.0 + through2@2.0.5: dependencies: readable-stream: 2.3.8 diff --git a/python/ktx-daemon/src/ktx_daemon/database_introspection.py b/python/ktx-daemon/src/ktx_daemon/database_introspection.py index 69be4209..be5cd924 100644 --- a/python/ktx-daemon/src/ktx_daemon/database_introspection.py +++ b/python/ktx-daemon/src/ktx_daemon/database_introspection.py @@ -162,11 +162,26 @@ class DatabaseIntrospectionRequest(BaseModel): return value +# Mirrors the Node KtxScanWarning shape so the daemon cannot emit a code the +# Node adapter (mapDaemonSnapshot) cannot render. +OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed" + + +class DatabaseIntrospectionWarning(BaseModel): + code: str + message: str + table: str | None = None + column: str | None = None + recoverable: bool = True + metadata: dict[str, Any] | None = None + + class DatabaseIntrospectionResponse(BaseModel): connection_id: str extracted_at: str metadata: dict[str, Any] tables: list[LiveDatabaseTable] + warnings: list[DatabaseIntrospectionWarning] = Field(default_factory=list) @dataclass(frozen=True) @@ -264,13 +279,48 @@ def _load_postgres_rows( ) -def _map_rows_to_tables(rows: DatabaseIntrospectionRows) -> list[LiveDatabaseTable]: +def _object_introspection_warning( + row: Mapping[str, Any], error: ValueError +) -> DatabaseIntrospectionWarning: + name = _optional_string(row, "table_name") + label = ".".join( + part + for part in ( + _optional_string(row, "table_catalog"), + _optional_string(row, "table_schema"), + name, + ) + if part + ) + column = _optional_string(row, "column_name") or _optional_string( + row, "from_column" + ) + return DatabaseIntrospectionWarning( + code=OBJECT_INTROSPECTION_FAILED_CODE, + message=str(error), + table=name, + column=column, + recoverable=True, + metadata={"object": label or "object"}, + ) + + +def _map_rows_to_tables( + rows: DatabaseIntrospectionRows, +) -> tuple[list[LiveDatabaseTable], list[DatabaseIntrospectionWarning]]: tables: dict[str, LiveDatabaseTable] = {} + warnings: list[DatabaseIntrospectionWarning] = [] for row in rows.table_rows: - catalog = _optional_string(row, "table_catalog") - db = _required_string(row, "table_schema") - name = _required_string(row, "table_name") + # One malformed/inaccessible object is skipped with a warning rather than + # aborting introspection of every healthy object. + try: + catalog = _optional_string(row, "table_catalog") + db = _required_string(row, "table_schema") + name = _required_string(row, "table_name") + except ValueError as error: + warnings.append(_object_introspection_warning(row, error)) + continue key = _table_key(catalog, db, name) tables[key] = LiveDatabaseTable( catalog=catalog, @@ -280,44 +330,51 @@ def _map_rows_to_tables(rows: DatabaseIntrospectionRows) -> list[LiveDatabaseTab ) for row in rows.column_rows: - catalog = _optional_string(row, "table_catalog") - db = _required_string(row, "table_schema") - table_name = _required_string(row, "table_name") - table = tables.get(_table_key(catalog, db, table_name)) - if table is None: - continue - - table.columns.append( - LiveDatabaseColumn( - name=_required_string(row, "column_name"), - type=_required_string(row, "formatted_type"), - nullable=bool(row.get("is_nullable")), - primary_key=bool(row.get("is_primary_key")), - comment=_optional_string(row, "column_comment"), + try: + catalog = _optional_string(row, "table_catalog") + db = _required_string(row, "table_schema") + table_name = _required_string(row, "table_name") + table = tables.get(_table_key(catalog, db, table_name)) + if table is None: + continue + table.columns.append( + LiveDatabaseColumn( + name=_required_string(row, "column_name"), + type=_required_string(row, "formatted_type"), + nullable=bool(row.get("is_nullable")), + primary_key=bool(row.get("is_primary_key")), + comment=_optional_string(row, "column_comment"), + ) ) - ) + except ValueError as error: + warnings.append(_object_introspection_warning(row, error)) + continue for row in rows.foreign_key_rows: - catalog = _optional_string(row, "table_catalog") - db = _required_string(row, "table_schema") - table_name = _required_string(row, "table_name") - table = tables.get(_table_key(catalog, db, table_name)) - if table is None: + try: + catalog = _optional_string(row, "table_catalog") + db = _required_string(row, "table_schema") + table_name = _required_string(row, "table_name") + table = tables.get(_table_key(catalog, db, table_name)) + if table is None: + continue + table.foreign_keys.append( + LiveDatabaseForeignKey( + from_column=_required_string(row, "from_column"), + to_table=_required_string(row, "to_table"), + to_column=_required_string(row, "to_column"), + constraint_name=_optional_string(row, "constraint_name"), + ) + ) + except ValueError as error: + warnings.append(_object_introspection_warning(row, error)) continue - table.foreign_keys.append( - LiveDatabaseForeignKey( - from_column=_required_string(row, "from_column"), - to_table=_required_string(row, "to_table"), - to_column=_required_string(row, "to_column"), - constraint_name=_optional_string(row, "constraint_name"), - ) - ) - - return sorted( + sorted_tables = sorted( tables.values(), key=lambda table: _table_key(table.catalog, table.db, table.name), ) + return sorted_tables, warnings def introspect_database_response( @@ -332,9 +389,11 @@ def introspect_database_response( rows = (load_rows or _load_postgres_rows)(request) timestamp = now() if now else datetime.now(timezone.utc).isoformat() + tables, warnings = _map_rows_to_tables(rows) return DatabaseIntrospectionResponse( connection_id=request.connection_id, extracted_at=timestamp, metadata={"driver": driver, "schemas": list(request.schemas)}, - tables=_map_rows_to_tables(rows), + tables=tables, + warnings=warnings, ) diff --git a/python/ktx-daemon/tests/test_database_introspection.py b/python/ktx-daemon/tests/test_database_introspection.py index b0fb7a5b..f182292d 100644 --- a/python/ktx-daemon/tests/test_database_introspection.py +++ b/python/ktx-daemon/tests/test_database_introspection.py @@ -3,6 +3,7 @@ from __future__ import annotations import pytest from ktx_daemon.database_introspection import ( + OBJECT_INTROSPECTION_FAILED_CODE, DatabaseIntrospectionRequest, DatabaseIntrospectionRows, LiveDatabaseTableScopeRef, @@ -126,6 +127,123 @@ def test_introspect_database_response_maps_postgres_catalog_rows() -> None: } +def test_introspect_database_response_isolates_a_broken_object() -> None: + def fake_load_rows( + request: DatabaseIntrospectionRequest, + ) -> DatabaseIntrospectionRows: + return DatabaseIntrospectionRows( + table_rows=[ + { + "table_catalog": "warehouse", + "table_schema": "public", + "table_name": "customers", + "table_comment": None, + }, + # Malformed/inaccessible object: missing table_name. + { + "table_catalog": "warehouse", + "table_schema": "public", + "table_name": None, + "table_comment": None, + }, + ], + column_rows=[], + foreign_key_rows=[], + ) + + response = introspect_database_response( + DatabaseIntrospectionRequest( + connection_id="warehouse", + driver="postgres", + url="postgresql://readonly@example.test/warehouse", + schemas=["public"], + ), + load_rows=fake_load_rows, + now=lambda: "2026-04-28T10:00:00+00:00", + ) + + assert [table.name for table in response.tables] == ["customers"] + assert len(response.warnings) == 1 + # Parity with the Node KtxScanWarningCode the adapter renders. + assert ( + response.warnings[0].code + == OBJECT_INTROSPECTION_FAILED_CODE + == "object_introspection_failed" + ) + assert response.warnings[0].recoverable is True + + +def test_introspect_database_response_warns_on_broken_column_and_fk_rows() -> None: + # A malformed column or foreign-key row must surface a warning, not vanish + # silently — the table-row path already does, and a dropped column is data loss. + def fake_load_rows( + request: DatabaseIntrospectionRequest, + ) -> DatabaseIntrospectionRows: + return DatabaseIntrospectionRows( + table_rows=[ + { + "table_catalog": "warehouse", + "table_schema": "public", + "table_name": "orders", + "table_comment": None, + } + ], + column_rows=[ + { + "table_catalog": "warehouse", + "table_schema": "public", + "table_name": "orders", + "column_name": "id", + "formatted_type": "integer", + "is_nullable": False, + "is_primary_key": True, + "column_comment": None, + }, + # Malformed column: missing formatted_type. + { + "table_catalog": "warehouse", + "table_schema": "public", + "table_name": "orders", + "column_name": "broken_col", + "formatted_type": None, + "is_nullable": True, + "is_primary_key": False, + "column_comment": None, + }, + ], + foreign_key_rows=[ + # Malformed FK: missing to_table. + { + "table_catalog": "warehouse", + "table_schema": "public", + "table_name": "orders", + "from_column": "customer_id", + "to_table": None, + "to_column": "id", + "constraint_name": "orders_customer_id_fkey", + } + ], + ) + + response = introspect_database_response( + DatabaseIntrospectionRequest( + connection_id="warehouse", + driver="postgres", + url="postgresql://readonly@example.test/warehouse", + schemas=["public"], + ), + load_rows=fake_load_rows, + now=lambda: "2026-04-28T10:00:00+00:00", + ) + + assert [column.name for column in response.tables[0].columns] == ["id"] + assert {(w.code, w.table, w.column) for w in response.warnings} == { + (OBJECT_INTROSPECTION_FAILED_CODE, "orders", "broken_col"), + (OBJECT_INTROSPECTION_FAILED_CODE, "orders", "customer_id"), + } + assert all(warning.recoverable for warning in response.warnings) + + def test_introspect_database_response_rejects_non_postgres_driver() -> None: with pytest.raises(ValueError, match='supports only driver "postgres"'): introspect_database_response( diff --git a/spider2-specs/README.md b/spider2-specs/README.md new file mode 100644 index 00000000..1cf2acdb --- /dev/null +++ b/spider2-specs/README.md @@ -0,0 +1,62 @@ +# spider2-specs — feature specs driven by the Spider 2.0-Lite benchmark + +This directory is the handoff point between two agents working on different +sides of the same goal: making Claude Code + ktx score well on the Spider +2.0-Lite benchmark **without benchmark-specific instructions** — the agent +should succeed using only what ktx provides (skills, semantic layer, wiki). + +## Mechanics + +Three directories form a pipeline. A feature flows `todo/` → `specs/` → +(implemented), and only its intake draft moves to `done/`: + +- **`todo/`** — intake drafts. A **playground agent** (works in + `/Users/andrey/projects/kaelio/spider-clean-submission/playground`, runs the + benchmark, identifies ktx capability gaps) writes a draft spec here when it + finds a gap. +- **`specs/`** — refined specs. A **refinement pass** (brainstorming) takes a + `todo/` draft and produces a proper, implementation-ready spec at + `specs/.md`: sharpened requirements, resolved ambiguities, + acceptance criteria, and orientation hints. The refined spec is the **durable + artifact** the implementer builds from — it stays in `specs/` permanently and + never moves. +- **`done/`** — intake drafts whose feature has shipped (see below). + +The **ktx worktree agent** (started from a ktx repo worktree, e.g. +`/Users/andrey/conductor/workspaces/ktx/tallinn-v2`) implements from the +refined spec in `specs/` (falling back to the `todo/` draft only if no refined +spec exists yet). When the feature is implemented it: + +1. appends a short **"Implementation notes"** section to the refined spec in + `specs/` (what was built, where, any deviations); and +2. **moves the original intake draft from `todo/` to `done/`.** + +Location is status: `todo/` = draft awaiting implementation, `done/` = draft +whose feature shipped, `specs/` = refined specs (permanent home, do not move). +A draft and its refined spec share the same filename so they correspond +(`todo/01-foo.md` ↔ `specs/01-foo.md` ↔ `done/01-foo.md`). No other tracking. + +## Rules for specs + +1. **Generic, not benchmark-overfit.** ktx is a general-purpose product; the + benchmark only surfaces the need. Every spec must state a real-world use + case independent of Spider 2.0-Lite. If a requirement only makes sense for + the benchmark, it doesn't belong in ktx. +2. Specs are **requirement-level**, not implementation plans. Code pointers in + specs are orientation hints from exploration (line numbers may have + drifted); the implementer owns the design. +3. One spec per file, kebab-case, numeric prefix = suggested priority order. + A refined spec in `specs/` keeps the same filename as its `todo/` draft. + +## For the implementer + +- After implementing, rebuild and re-link the dev binary so the playground + picks it up: `pnpm run build && pnpm run link:dev` (provides `ktx-dev`). +- Add/extend tests in the ktx test suites; specs list acceptance criteria to + cover. +- Build from the refined spec in `specs/`. On completion, append + "Implementation notes" to that spec (it stays in `specs/`) and move the + intake draft from `todo/` to `done/`. +- If a spec turns out to be wrong or already satisfied, don't silently drop + it — record why in the refined spec's notes and move the draft to `done/` + explaining why no change was needed. diff --git a/spider2-specs/done/.gitkeep b/spider2-specs/done/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/spider2-specs/done/01-connection-scoped-wiki.md b/spider2-specs/done/01-connection-scoped-wiki.md new file mode 100644 index 00000000..cbd220dc --- /dev/null +++ b/spider2-specs/done/01-connection-scoped-wiki.md @@ -0,0 +1,74 @@ +# Connection-scoped wiki pages + +## Problem + +Wiki pages have only two scopes today: `GLOBAL` and `USER` +(`packages/cli/src/context/wiki/types.ts`, frontmatter schema ~lines 14-29). +There is no way to associate a page with a connection. In a project with many +connections, all pages share one search index, so `wiki_search` for a generic +term ("orders", "revenue", "average order value") surfaces pages about the +wrong database. Concept names collide across databases constantly in +real-world multi-connection projects (several databases each with `orders`, +`customers`, etc.). + +Today, when `memory_ingest` is called with a `connectionId`, that id is only +used to scope which semantic-layer sources the triage agent can see +(`memory-agent.service.ts` ~46-72, ~107-109); it is **not** persisted on the +resulting wiki page in any form. + +## Generic use case + +Any org with multiple databases/warehouses in one ktx project: org-wide +definitions ("fiscal year starts in February") should be visible everywhere, +while database-specific conventions ("in the events DB, `user_id` is the +anonymous device id, not the account id") should not pollute searches about +other databases. + +## Requirements + +1. **Frontmatter field.** Add an optional `connections:` field to wiki page + frontmatter — a list of connection ids (accept a single string too, + normalize to list). + - **Absent or empty ⇒ unscoped: the page applies to all connections.** + This is exactly today's behavior, so every existing page is unaffected + (backward compatible by construction). +2. **Search filtering.** `wiki_search` (MCP tool, `context-tools.ts` ~46-64) + and `ktx wiki search` / `ktx wiki list` (CLI, + `knowledge-commands.ts`) accept an optional `connectionId`: + - With `connectionId: X` ⇒ return pages scoped to X **∪** unscoped pages. + - Without ⇒ current behavior, all pages. + - The filter must apply to **all three search lanes** (lexical FTS5, + semantic/embedding, token fallback) in + `local-knowledge.ts` / `sqlite-knowledge-index.ts` — not as a post-filter + that eats into the result limit unevenly. +3. **Index.** Persist the scoping in the `.ktx/db.sqlite` knowledge index + (the index is already re-synced from files on every search, + `local-knowledge.ts` ~286-310, so a schema addition + sync is sufficient). +4. **Write path.** The memory agent's wiki-write tool accepts the connections + field; when `memory_ingest` is invoked with a `connectionId`, the agent + should default new database-specific pages to that connection, while still + being allowed to write unscoped pages for clearly org-wide content (prompt + guidance, not a hard rule). +5. **`wiki_read` and refs are unchanged** — pages remain addressable by key + regardless of scoping; `connections` is a search/relevance concern only. +6. **Validation.** Warn (don't fail) when a page references a connection id + not present in `ktx.yaml` — config and content can evolve independently. + +## Acceptance criteria + +- A page with `connections: [db_a]` is returned by + `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but + **not** by `wiki_search(query, connectionId: "db_b")`. +- A page with no `connections` field is returned in all three cases above. +- Existing projects with no scoped pages behave identically before/after. +- Filtering works in each lane independently (test with embeddings disabled + to exercise lexical/token lanes alone). +- `memory_ingest(content, connectionId)` produces a page scoped to that + connection for database-specific content. + +## Benchmark context (motivation only) + +Spider 2.0-Lite local subset = one project with 30 SQLite connections whose +schemas share table/concept names (Northwind, sakila, two e-commerce DBs…). +External-knowledge docs (RFM definition, F1 overtake rules) are each relevant +to exactly one database and must not surface for the other 29. diff --git a/spider2-specs/done/02-verbatim-ingest-mode.md b/spider2-specs/done/02-verbatim-ingest-mode.md new file mode 100644 index 00000000..03e86a02 --- /dev/null +++ b/spider2-specs/done/02-verbatim-ingest-mode.md @@ -0,0 +1,71 @@ +# Verbatim ingest mode for authoritative documents + +## Problem + +`ktx ingest --text/--file` routes content through the memory agent +(`text-ingest.ts` ~246-357 → `memory-agent.service.ts`), an LLM triage loop +(30-step budget for `external_ingest`, content clipped at ~48k chars, +`memory-agent.service.ts` ~165) that may rewrite, condense, or split the +content before writing wiki pages. + +For *authoritative* documents — formula definitions, specs, runbooks, +compliance text — paraphrasing is a bug, not a feature: + +- exact thresholds, constants, and rule wording must survive byte-for-byte; +- lexical (BM25) search works best when the stored text matches the phrasing + users/agents will query with; +- ingestion should be deterministic and reproducible — same input file, same + resulting page. + +## Generic use case + +Any team ingesting documents that are already the source of truth: metric +definition sheets, SLA documents, calculation methodology docs, regulatory +text. The user wants ktx to *index and surface* the document, not to +re-author it. + +## Requirements + +1. **Flag.** `ktx ingest --file --verbatim` (apply to `--text` too). + Composes with the existing optional `--connection ` so the resulting + page can be connection-scoped (see spec 01). +2. **Body preservation is enforced by code, not by prompt.** The stored page + body must be the input content byte-for-byte. The LLM is used **only** to + generate metadata: `summary`, `tags`, `sl_refs`, suggested page key/slug + (and `connections` default from the flag). Implementation freedom: a + single constrained LLM call is fine — the full memory-agent loop is not + required for this mode. +3. **No clipping of the stored body.** The ~48k clip may apply to what is + *sent to the LLM* for metadata generation, never to what is *written* to + the wiki page. +4. **Existing frontmatter.** If the input file already has YAML frontmatter, + preserve user-provided fields and only fill gaps (don't overwrite an + explicit `summary` with a generated one). +5. **Key collisions.** Deterministic, non-destructive behavior: error or + suffix — never silently overwrite an existing page. +6. **Degraded mode.** With `llm.provider.backend: none`, `--verbatim` should + still work, deriving `summary` from the first heading/sentence and leaving + optional metadata empty. (Regular agent ingest can't do this; verbatim + mode can and should.) + +## Acceptance criteria + +- Ingesting a file with `--verbatim` produces a wiki page whose body is + byte-identical to the input (assert with a hash in tests). +- Running the same ingest twice is idempotent or fails loudly on the second + run (per requirement 5) — no duplicated/divergent pages. +- A >48k-char file is stored in full. +- `--verbatim --connection X` yields a page scoped to X (depends on spec 01; + if 01 isn't implemented yet, the flag composition can land later). +- Generated metadata makes the page findable: `wiki_search` for a phrase + from the document body returns it (lexical lane), and for a paraphrase of + its topic returns it when embeddings are enabled (semantic lane). + +## Benchmark context (motivation only) + +Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket +definitions, haversine formula, F1 overtake rules…). Gold SQL was authored +against their exact text; an LLM paraphrase that drops a bucket boundary +loses a question. We currently work around this by hand-writing frontmatter +and copying files into `wiki/global/` — verbatim mode makes that a supported +ktx workflow instead of a manual step. diff --git a/spider2-specs/done/06-scan-tolerate-broken-objects.md b/spider2-specs/done/06-scan-tolerate-broken-objects.md new file mode 100644 index 00000000..c56e3dd5 --- /dev/null +++ b/spider2-specs/done/06-scan-tolerate-broken-objects.md @@ -0,0 +1,63 @@ +# Schema scan must tolerate individual objects that fail introspection + +> Priority: MEDIUM. Found during the first full Spider2-lite sqlite ingest +> (2026-06-13): one database (`oracle_sql`) failed to ingest **entirely** +> because a single broken VIEW errored during introspection, leaving that +> connection with no semantic layer at all. + +## Problem + +`ktx ingest ` aborts the whole database's schema scan when one +table/view errors during introspection/profiling. In `oracle_sql` the view +`emp_hire_periods_with_name` is defined as +`SELECT ehp.start_date, ehp.end_date ... FROM emp_hire_periods ehp ...` but the +base table has no `start_date`/`end_date` columns — so any attempt to read it +raises `no such column: ehp.start_date`. That single broken object failed the +ingest of all ~48 healthy tables/views in the database. + +A second, related symptom: setting `enabled_tables: [main.customers]` to work +around it produced a different hard failure (`Adapter "database schema" did not +recognize fetched source output`), so the documented allowlist escape hatch did +not provide a clean fallback either. + +## Generic use case + +Real databases routinely contain broken or inaccessible objects: views over +dropped/renamed columns, views referencing tables the connection role can't +read, permission-denied tables, or vendor system views that error. ktx should +ingest everything it *can* and skip what it can't — never let one bad object +zero out an entire connection's context. This is basic robustness for +production warehouses, not benchmark-specific. + +## Requirements + +1. **Per-object isolation.** If introspecting/profiling one table or view + throws, skip that object, record a warning (object name + error), and + continue scanning the rest. The connection's semantic layer is built from + the objects that succeeded. +2. **Surface, don't hide.** Report skipped objects in the ingest summary and in + `ktx status` (e.g. "oracle_sql: 1 object skipped — emp_hire_periods_with_name: + no such column ehp.start_date"). Honor `failureMode` for whole-connection + aborts, but a single bad object should not count as a connection failure. +3. **Views vs tables.** A broken view should never block base-table ingest. + Consider profiling views defensively (they are read-only projections). +4. **Allowlist fallback should work.** `enabled_tables` should reliably restrict + the scan to the listed objects (and the qualification format for sqlite must + be documented and accepted). Fix the `did not recognize fetched source + output` failure when the allowlist yields a small/edge-case set. + +## Acceptance criteria + +- Ingesting a sqlite DB containing one broken view plus N healthy tables yields + a semantic layer for the N healthy tables and a warning naming the broken view + — exit is success (not "failed"), subject to `failureMode`. +- The skipped object is listed in the ingest summary and `ktx status`. +- `enabled_tables` restricted to a subset ingests exactly that subset without the + adapter-output error. + +## Benchmark context (motivation only) + +`oracle_sql` (8 of the 135 sqlite questions) currently has no semantic layer +because of its one broken view; those questions must be solved from raw +`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning +would restore enriched context for that database. diff --git a/spider2-specs/done/07-analytics-skill-sql-craft.md b/spider2-specs/done/07-analytics-skill-sql-craft.md new file mode 100644 index 00000000..97d64904 --- /dev/null +++ b/spider2-specs/done/07-analytics-skill-sql-craft.md @@ -0,0 +1,112 @@ +# Add universal SQL-authoring craft to the ktx-analytics skill + +> Priority: HIGH. The `ktx-analytics` skill currently tells the agent *which +> ktx tools to call and in what order*, but gives almost no guidance on +> *writing correct SQL*. In benchmark runs the agent reliably produced +> runnable SQL (0 execution errors) yet failed on correctness — precision, +> determinism, type mismatches, and answer completeness. These are universal +> analytics-engineering truths that every ktx user benefits from, so they +> belong in the shipped skill, not in any caller's prompt. + +## Scope guard (read first) + +Only **universally-true** SQL/analytics craft goes here — guidance that helps a +real ktx user querying a **live** database. The test for inclusion: *"Would this +advice be correct and useful for an analyst on a current, production database?"* + +**Dialect-specific syntax is out of scope here.** The v9 harnesses' only +per-dialect content (Snowflake: `DB.SCHEMA.TABLE` FQTNs, double-quoted +lowercase cols, VARIANT colon-paths; BigQuery: backtick FQTNs, `_TABLE_SUFFIX` +for sharded tables; sqlite: `strftime`/`julianday`) is genuinely useful but +belongs in a **dialect-aware** location (per-driver notes), not this flat +skill. Track separately as a follow-up; the rules below must stay +dialect-agnostic. + +Explicitly **do NOT** add (these are application/consumer concerns, not skill +concerns, and some are actively wrong for live data): +- Output-format contracts ("return a bare result set with exactly these + columns, no prose"). The skill is for interactive analysis and already + favors readable tables + summaries; a caller that needs a strict result + shape specifies that itself. +- Anchoring relative time ("recent", "past N months") to `MAX(date)` of the + data. On a live database "recent" means relative to *now*; this is only true + for static snapshots and must not be baked into the product. +- Anything justified by a grader/scoring comparator. + +## File + +`packages/cli/src/skills/analytics/SKILL.md` (the shipped skill; +`setup-agents.ts` installs it into agent environments — the copy under a +project's `.claude/skills/` is regenerated from this source). Extend the +existing `` block and step 5 ("Query") / step 6 ("Validate and +explain"); keep the existing interactive guidance intact. + +## Requirements — add these as general rules (behavior only, no rationale that +references answers/graders) + +**Schema discovery before writing SQL** +1. Inspect representative sample rows of each table before composing SQL — + confirm date/time encoding (e.g. `YYYYMMDD` vs ISO vs epoch), null + prevalence in join/filter keys, and the actual set of categorical/enum + values. (`entity_details` + a small `sql_execution` sample.) +2. Cast a column to its real type before comparing it in `WHERE`/`JOIN`. A + string column compared against a numeric literal (or vice versa) can + silently match nothing. + +**Composition discipline** +3. Build complex queries incrementally — one CTE at a time, verifying each + layer's output on a small sample before stacking the next. +4. Avoid joins that fan out row counts. Add columns only from tables already + required by the grain, or pre-aggregate to the target grain before joining. + +**Window-function correctness** +5. Give every ranking/ordering window function a complete, deterministic + tie-breaker (append unique key columns), so `RANK`/`ROW_NUMBER`/`LAG` + results are stable rather than flickering across runs. +6. Apply row filters **after** window functions for sequence / "first" / + "most recent" / "since" questions — compute over the full partition, then + filter. + +**Numeric precision** +7. Compute at full precision; round only in the final projection, never inside + intermediate CTEs. +8. Be explicit about truncation (`CAST AS INT` truncates; use explicit + rounding when rounding is intended). +9. Distinguish "average of per-group averages" (macro: `AVG(group_metric)`) + from "overall/weighted average" (micro: `SUM(num)/SUM(den)`) based on the + question's wording. + +**Answer completeness / interpretation** +10. "top / highest / most / lowest" → return only the winning row(s) (e.g. + `RANK() = 1` / `QUALIFY`), not the full ranked list, unless a list is asked + for. +11. "for each X / per X / by X" → exactly one row per X; don't collapse to a + single value unless the question says "overall" or "total across X". +12. When a question asks for inputs and a derived value ("X, Y, and their + ratio"), include the inputs as columns alongside the derived value. +13. When grouping by a human-readable label (a name), also expose the entity's + identifier — identity, not just the label, is part of the result. +14. When a result is unexpectedly empty, relax filters one at a time to find + which predicate removed the rows. + +## Acceptance criteria + +- The shipped `analytics/SKILL.md` contains the rules above, phrased as general + truths with **no reference to any benchmark, gold answer, or scoring + comparator**. +- Existing interactive guidance (compact result tables, summaries, + clarification prompts, the tool-order workflow) is preserved — the skill must + still read well for an interactive human-facing analysis session. +- None of the excluded items (output-shape contract, `MAX(date)` anchoring, + grader-driven advice) appear. +- Skill stays within a reasonable size; group the new rules under clear + sub-headings so they're scannable. + +## Benchmark context (motivation only) + +On the Spider 2.0-Lite sqlite subset, the solver produced 0 execution errors +but ~50 result mismatches; a large share traced to exactly these gaps +(premature rounding, string-vs-number compares, non-deterministic window +ordering, returning full lists for "top" questions, dropping inputs to derived +values). These are generic SQL-authoring defects — fixing them in the skill +improves ktx for everyone and, as a side effect, the benchmark. diff --git a/spider2-specs/done/08-per-dialect-sql-syntax-notes.md b/spider2-specs/done/08-per-dialect-sql-syntax-notes.md new file mode 100644 index 00000000..3cb0a815 --- /dev/null +++ b/spider2-specs/done/08-per-dialect-sql-syntax-notes.md @@ -0,0 +1,83 @@ +# Per-dialect SQL syntax notes (dialect-aware, scoped to the connection) + +> Intake draft. Companion to `specs/07-analytics-skill-sql-craft.md`, which kept +> the analytics SQL craft dialect-agnostic and explicitly deferred per-dialect +> syntax here. + +## Problem + +Spec 07 deliberately keeps the analytics SQL-authoring craft +**dialect-agnostic** — every rule must read correctly on any engine. But a lot of +*real* correctness depends on dialect-specific syntax that spec 07 excludes and +defers to this follow-up: + +- **Snowflake:** `DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase identifiers, + VARIANT colon-paths. +- **BigQuery:** backtick FQTNs, `_TABLE_SUFFIX` for sharded tables, `QUALIFY`. +- **sqlite:** `strftime`/`julianday` for dates, no `QUALIFY`. + +This guidance is genuinely useful to an agent writing SQL against a live +database, but it must **not** pollute the flat dialect-agnostic skill — an agent +querying sqlite should never see Snowflake VARIANT syntax. It belongs in a +**dialect-aware** location, surfaced only for the dialect the active connection +actually uses. + +## Generic use case + +Any ktx project whose connections span more than one warehouse engine (e.g. a +Snowflake warehouse + a BigQuery export + a local sqlite extract). When the agent +writes SQL for a given connection, it should get that engine's syntax +conventions — and nothing for the engines it isn't querying. + +## Requirements + +1. **Per-driver dialect notes.** Author concise, correct syntax notes per + supported driver: FQTN form, identifier quoting/case, date/time functions, + top-N / window-filtering idiom, semi-structured access. These are genuine + per-engine invariants, so enumerating them per driver is acceptable (unlike a + denylist of bad specifics). +2. **Scope to the active dialect, derived from state.** Which notes the agent + sees must be selected from the connection's configured driver/dialect + (`ktx.yaml` connections / the connector registry), not guessed and not shown + all at once. The flat analytics skill stays dialect-agnostic (spec 07 + invariant preserved). +3. **Delivery mechanism (enabling sub-requirement).** The shipped skill is + installed as a **single `SKILL.md`** per target (`setup-agents.ts` / + `readAnalyticsSkillContent`). Surfacing per-dialect notes on demand needs one + of two approaches; the refinement pass should compare them before committing: + - **Multi-file skill delivery** — bundle `reference/.md` files and + have the skill point to the one matching the connection. Requires extending + `setup-agents.ts` to copy a skill *directory* (Claude Code, Codex, universal + `.agents`) and a multi-file zip (Claude Desktop), a **flatten/concatenate + transform** for the single-file targets (Cursor `.mdc`, OpenCode `.md`), and + **per-file manifest entries** for clean uninstall. This is the + install-mechanism improvement spec 07's Model section flags as future work. + - **Dynamic MCP delivery** — an MCP surface returns the dialect hints for a + given `connectionId` (the MCP layer already resolves the connection's + dialect), so no install change is needed and Cursor/OpenCode get identical + behavior. May be the lower-cost, more uniform path; weigh it first. +4. **No dialect syntax leaks into the dialect-agnostic skill.** Spec 07's + acceptance criterion (no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in + `analytics/SKILL.md`) stays green. This work adds a *separate* dialect-aware + channel; it does not amend the flat skill. + +## Acceptance criteria + +- An agent querying a sqlite connection gets sqlite date idioms and never sees + Snowflake/BigQuery-only syntax; an agent querying Snowflake gets + FQTN/identifier/VARIANT guidance. +- The dialect shown is **derived from the connection's configured driver**, not + hardcoded per project and not guessed. +- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are + unaffected. +- Whichever delivery mechanism is chosen installs/serves correctly across **all** + supported agent targets, including the single-file Cursor/OpenCode shape. + +## Benchmark context (motivation only) + +The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake +(`DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase cols, VARIANT colon-paths), +BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite +(`strftime`/`julianday`). That content is real and useful but engine-specific; +spec 07 kept it out of the flat skill and deferred it here so the +dialect-agnostic rules stay clean. diff --git a/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md b/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md new file mode 100644 index 00000000..12334325 --- /dev/null +++ b/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md @@ -0,0 +1,150 @@ +# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill + +## Problem + +The `ktx-analytics` skill already carries a fan-out rule (spec 07, rule 4: +*"Avoid fan-out joins — add columns only from tables already at the target +grain, or pre-aggregate to that grain before joining; a join that multiplies +rows quietly inflates every downstream `SUM`/`COUNT`"*). In practice the agent +honors it on a single join but still **silently fan-outs on multi-hop join +chains**, where the inflation is one or two joins removed from the aggregate and +therefore much harder to notice. + +The failure shape: a metric that lives at a *coarse* grain (e.g. one row per +parent record) is counted/summed *after* the parent has been joined down to a +*finer* grain (e.g. one row per child line). Every parent-level value is then +duplicated by its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by an +amount that depends on the data — runnable SQL, plausible-looking number, +quietly wrong. + +The rule today is stated as a *prohibition* ("avoid"). It needs to be a +*detect-and-fix habit*: a concrete multi-hop example of the trap, and an active +verification step the agent runs while composing, not just an instruction to be +careful. + +## Generic use case (independent of any benchmark) + +An analyst on any production warehouse asks: *"How many orders are there per +region?"* where the path from region to the order's detail runs through several +hops (region → store → order → order line). The honest answer counts each order +once. If the query descends to the line-level table along the way (e.g. for a +filter), each order is counted once **per line on the order**, inflating the +per-region total. Attribution here is unambiguous — each order belongs to exactly +one store and thus one region — so the *only* thing that can go wrong is the row +multiplication, which is exactly what makes it a clean teaching case. This is one +of the most common silently-wrong analytics mistakes on normalized schemas — it +is not +specific to any dataset, dialect, or benchmark. + +## Requirements + +This extends the existing `` "Composition" guidance in the +`ktx-analytics` skill (spec 07). Additive only; keep it inline, dialect-agnostic, +and stated as a heuristic-plus-why (consistent with spec 07's style). + +1. **Generalize the fan-out rule to multi-hop chains.** Make explicit that the + danger is *cumulative*: any one-to-many hop on the path between the table that + owns a measure and the aggregate inflates that measure, even when the + offending join is several hops away from the `SUM`/`COUNT`. The fix is the + same as the single-hop case — **pre-aggregate the measure to its own grain in + a CTE, then join the already-aggregated result** — but the agent must apply it + per measure-owning table along the whole chain, not just at the final join. + +2. **Add a verification habit, not just a prohibition.** While composing, the + agent should confirm a join did not change the grain it intends to aggregate + at — e.g. check that the row count (or the count of the aggregate's key) is + unchanged across a join that is supposed to be one-to-one / many-to-one, and + pre-aggregate the finer table to grain when it is one-to-many. This is the same + "build incrementally and check each layer" discipline spec 07 already endorses, + pointed specifically at grain preservation. + + **Pre-aggregate is the general fix; `COUNT(DISTINCT)` is a count-only + shortcut.** Pre-aggregating the finer table to the measure's grain in a CTE and + then joining one-to-one is the remedy that works for every aggregate + (`COUNT`/`SUM`/`AVG`). `COUNT(DISTINCT )` is a valid one-liner *for counts + only* — it must NOT be generalized to a fanned-out `SUM`/`AVG`, because two + rows can legitimately hold equal amounts and `DISTINCT` would wrongly collapse + them. State this trap explicitly; a naïve "just use `COUNT(DISTINCT)`" rule is + silently wrong for sums. + +3. **One concrete, generic multi-hop example.** Include a short worked example + that shows the inflation and the fix. It must use an **invented, generic + schema** — **no benchmark table names, no benchmark SQL, and no benchmark + result values** (see "Leak-safety" below — hard constraint). The example must: + (a) use a **plain `COUNT`** (not an average) so it isolates the fan-out lesson + and does not entangle the skill's separate *macro-vs-micro average* rule; and + (b) use a chain with **unambiguous single-owner attribution** so the only thing + that can go wrong is row multiplication. The intended example is the chain + `regions → stores → orders → order_lines` answering *"how many orders per region + include at least one backordered line"* — each order belongs to exactly one + store and thus exactly one region, so attribution is clean; the line-level + filter gives `order_lines` a genuine reason to be joined (so the fix is the + pre-aggregate remedy, not "drop the join"), and that join sits **several hops + below** the region-level COUNT (the multi-hop point): + + ```sql + -- "How many orders per region include at least one backordered line?" + -- (order_lines is genuinely needed here — for the backordered filter — so the + -- fix is NOT "just drop the join".) + -- WRONG: the order_lines join is one row per matching line, joined several hops + -- BELOW the COUNT. An order with 3 backordered lines is counted 3 times, so the + -- per-region total is inflated by backordered-lines-per-order — silently wrong. + SELECT r.region_id, COUNT(*) AS n_orders + FROM regions r + JOIN stores s ON s.region_id = r.region_id + JOIN orders o ON o.store_id = s.store_id + JOIN order_lines l ON l.order_id = o.order_id AND l.is_backordered -- one-to-many: fan-out + GROUP BY r.region_id; + + -- RIGHT (general remedy): collapse the finer table to the measure's grain in a + -- CTE FIRST, then join one-to-one so nothing multiplies. This same shape works + -- for SUM/AVG, not just COUNT. + WITH qualifying_orders AS ( -- back to ONE row per order + SELECT DISTINCT order_id FROM order_lines WHERE is_backordered + ) + SELECT r.region_id, COUNT(*) AS n_orders + FROM regions r + JOIN stores s ON s.region_id = r.region_id + JOIN orders o ON o.store_id = s.store_id + JOIN qualifying_orders q ON q.order_id = o.order_id + GROUP BY r.region_id; + + -- Count-only shortcut: COUNT(DISTINCT o.order_id) over the WRONG query also works + -- HERE. But it is counts-only — a fanned-out SUM/AVG of a per-order measure (e.g. + -- summing each order's shipping_fee after joining lines) must pre-aggregate; + -- DISTINCT would wrongly merge two orders that happen to share the same fee. + ``` + +## Leak-safety (hard constraint on this spec and its example) + +The benchmark's gold answers must never appear in ktx. The worked example must +be a **synthetic, generic schema invented for teaching** — not the tables, +column names, query, or numeric results of any Spider 2.0-Lite question. The +example demonstrates the *pattern* (coarse-grain measure counted after a +one-to-many join), which is universal; it must be reconstructable from first +principles by anyone, with zero reference to benchmark data. A reviewer should +be able to read the example and find nothing that ties it to a specific +benchmark instance. + +## Acceptance criteria + +- The skill's `` Composition section states the multi-hop + generalization of the fan-out rule and a grain-verification habit, inline and + dialect-agnostic. +- It includes exactly one short, **generic** worked example (wrong vs. + pre-aggregated-right) using an invented schema, with no benchmark-derived + identifiers or values. +- No new tool, flag, or config; this is skill-content only (additive to spec 07). +- Existing analytics-skill content tests are updated to cover the added rule's + presence (mirroring spec 07's `analytics-skill-content.test.ts`). + +## Benchmark context (motivation only) + +Multi-hop aggregation questions (counting/averaging a coarse-grained measure +reached through several one-to-many joins) are a recurring source of +result-mismatch failures in the SQLite subset: the agent produces runnable SQL +with the right tables but a fan-out-inflated number. These are correctness +failures, not knowledge or schema-discovery failures (zero execution errors in +the latest run), so the fix belongs in the product's authoring craft — where it +also helps any real analyst — not in a benchmark-specific prompt. +``` diff --git a/spider2-specs/done/10-panel-completeness-spine.md b/spider2-specs/done/10-panel-completeness-spine.md new file mode 100644 index 00000000..91b9294b --- /dev/null +++ b/spider2-specs/done/10-panel-completeness-spine.md @@ -0,0 +1,65 @@ +# Panel/period completeness — emit the full set of groups, not only the populated ones + +## Problem + +When a question asks for a result *per period* or *per category* ("orders for each +month of 2023", "revenue by region", "count per status"), the natural `GROUP BY` +only returns groups that actually have rows. Periods/categories with **zero** +activity silently vanish, so a "12 months" answer comes back with 9 rows and the +ones that should read `0` are simply absent. The agent writes runnable SQL with +the right aggregate but an **incomplete panel**. + +This is a universal reporting correctness issue: a monthly report with missing +months, or a category breakdown missing the empty categories, is wrong for any +analyst — and it is also a frequent result-mismatch shape on the benchmark. + +## Generic use case (independent of any benchmark) + +"How many orders were placed in each month of 2023?" must return **12 rows** even +if March had no orders (March = 0), not 11 rows. "Sales per region" should include +regions with no sales (as 0/NULL) when the question asks for *each* region. + +## Requirements + +Additive to the `ktx-analytics` skill's `` "Answer completeness / +interpretation" group (consistent with spec 07's inline, dialect-agnostic, heuristic ++ why style). + +1. **Recognize "full-panel" phrasing.** Cues like *each / every / per / + for all / by month* signal that the answer's row set should be the + **complete** set of periods or categories in scope, not just those present in + the filtered fact rows. + +2. **Build a spine, then LEFT JOIN.** Generate the full set of expected + groups — a date/number series via a recursive CTE for periods, or the distinct + dimension values from the authoritative dimension table for categories — and + LEFT JOIN the aggregated facts onto it, defaulting missing measures with + `COALESCE(metric, 0)` (or NULL when 0 would be wrong). *Why:* a plain inner + `GROUP BY` can only emit groups that have at least one fact row. + +3. **Don't over-apply.** When the question asks only about groups that exist + ("which months had orders"), the spine is unnecessary; the cue is *each/all* + vs *which*. + +## Leak-safety (hard constraint) + +Any worked example must use a **synthetic generic schema** (e.g. an `orders` +table with an `order_date`) and demonstrate only the *pattern* (spine + LEFT JOIN ++ COALESCE). No benchmark table names, SQL, or result values. The behavior is +reconstructable from first principles and tied to no specific instance. + +## Acceptance criteria + +- `` states the full-panel cue, the spine + LEFT JOIN + COALESCE recipe, + and the over-application guard — inline and dialect-agnostic. +- At most one short generic example (recursive-CTE date spine or distinct-dimension + spine), no benchmark-derived content. +- Skill-content only; analytics-skill content tests updated to cover the rule. + +## Benchmark context (motivation only) + +Per-period / per-category questions where some periods are empty produce +short-row result mismatches in the SQLite subset. The fix is a universal +reporting habit (complete panels), so it belongs in the product's craft, where it +also helps real analysts — not in a benchmark-specific prompt. Related to spec 11 +(rolling/cumulative windows need a complete date spine to be correct). diff --git a/spider2-specs/done/11-time-series-window-recipes.md b/spider2-specs/done/11-time-series-window-recipes.md new file mode 100644 index 00000000..7c9bb355 --- /dev/null +++ b/spider2-specs/done/11-time-series-window-recipes.md @@ -0,0 +1,73 @@ +# Time-series window craft — running totals, rolling-N (min-periods), period-over-period + +## Problem + +A large share of analytics questions are time-series shaped: a **running/cumulative +balance**, a **rolling N-day average**, or **period-over-period growth**. The agent +knows window functions exist (spec 07 covers determinism and window-then-filter) but +gets the *time-series specifics* wrong: + +- cumulative balance computed without an unbounded preceding frame (or with the + frame defaulting incorrectly when there are ties on the order key); +- "rolling 30-day" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** daily + data, so the window spans the wrong calendar span when days are missing; +- no **minimum-periods** handling — a rolling average is reported before the window + is actually full; +- "growth vs previous period" without `LAG`, or comparing to the wrong neighbor. + +These are runnable-but-wrong; the structure is close, the edge case diverges. + +## Generic use case (independent of any benchmark) + +- "Each account's month-end running balance over 2023" — cumulative sum of monthly + net over an ordered window. +- "30-day rolling average of daily revenue, only once 30 days of history exist." +- "Month-over-month revenue growth rate." + +All three are bread-and-butter for any analyst on any time-series table. + +## Requirements + +Additive to the `ktx-analytics` skill's `` "Window functions" group +(inline, dialect-agnostic, heuristic + why). + +1. **Cumulative / running total.** `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS + BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, with a complete tie-breaker in + `ORDER BY` (spec 07 rule). *Why:* the default frame with a non-unique `ORDER BY` + can include/exclude peers unexpectedly. + +2. **Rolling window over time, not over rows.** When "rolling N days/months" is + asked, the window must span a calendar range. Over gappy data, either build a + complete date spine first (see spec 10) so `ROWS BETWEEN n-1 PRECEDING` equals + the intended span, or use a range/self-join keyed on the date. *Why:* row-count + frames over missing dates silently measure the wrong span. + +3. **Minimum periods.** When the question says "only after N periods of data" (or + it is implied by a rolling metric), emit NULL/skip until the window is full + (e.g. guard on `COUNT(*) OVER (...) = N`). *Why:* a partial early window is not + the requested metric. + +4. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)` + for prior-period comparisons; growth rate = `(cur - prev) / prev` computed at + full precision (round only at the end). Guard divide-by-zero/NULL prev. + +## Leak-safety (hard constraint) + +Worked examples must use a **synthetic generic schema** (e.g. `daily_revenue(day, +amount)` or `account_txns(account_id, txn_date, net)`) and show only the *pattern*. +No benchmark table names, SQL, or result values. + +## Acceptance criteria + +- `` "Window functions" gains the cumulative, rolling-over-time + + min-periods, and period-over-period recipes — inline, dialect-agnostic. +- At most one or two compact generic examples; no benchmark-derived content. +- Skill-content only; analytics-skill content tests updated. + +## Benchmark context (motivation only) + +Running-balance / rolling / period-over-period questions are the single largest +result-mismatch cluster in the SQLite subset (financial-transactions style DBs). +The methodology is universal analyst craft, so it belongs in the product's skill +(transfers to real users), not in a benchmark-specific prompt. Depends on spec 10 +(date spine) for the gappy-rolling case. diff --git a/spider2-specs/done/12-parse-text-encoded-numbers.md b/spider2-specs/done/12-parse-text-encoded-numbers.md new file mode 100644 index 00000000..43100e6c --- /dev/null +++ b/spider2-specs/done/12-parse-text-encoded-numbers.md @@ -0,0 +1,61 @@ +# Parse text-encoded numeric columns before doing math on them + +## Problem + +Numeric measures are often stored as **text** with human formatting: unit suffixes +(`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators (`"$1,200"`), +percent signs (`"12%"`), or non-numeric sentinels for missing/zero (`"-"`, `"N/A"`, +`""`). Aggregating or comparing such a column directly is silently wrong: string +comparison orders `"100" < "9"`, and a naive `CAST(x AS REAL)` yields `0`/NULL on +the formatted values rather than the intended number. + +The agent already samples schemas (spec 07 schema-discovery), but when it sees a +"numeric" column it tends to assume it is a real number type and skips the parse — +so the arithmetic runs on garbage. Runnable, plausible, wrong. + +## Generic use case (independent of any benchmark) + +A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become `1200 / 3000000 +/ 0` before you can sum it or compute a daily change. A `price` stored as +`"$1,299.00"` must become `1299.00` before averaging. This is routine data hygiene +on real, messy production tables. + +## Requirements + +Extend the `ktx-analytics` skill's `` "Schema discovery before writing +SQL" group (inline, dialect-agnostic, heuristic + why). + +1. **Detect text-encoded numerics during sampling.** When a column that the + question treats as a number is stored as text, sample distinct values to learn + the encodings actually present (suffixes, symbols, separators, sentinels) before + composing — never assume the format from the column name. + +2. **Parse and scale before arithmetic.** Strip currency/separator/percent + characters; multiply by the suffix scale (K=10^3, M=10^6, B=10^9); map sentinels + (`-`, `N/A`, empty) to `0` or `NULL` per the question's intent; then cast to a + numeric type. Do this in an early CTE so all downstream math sees clean numbers. + *Why:* string columns compared/aggregated as-is sort lexically and cast to 0, + producing silently wrong results instead of errors. + +3. **Confirm coverage.** After parsing, sanity-check that no intended-numeric value + failed to parse (would surface as NULL), to catch an encoding the sample missed. + +## Leak-safety (hard constraint) + +Worked examples must use a **synthetic generic schema** and made-up values (e.g. a +`metrics(label, value_text)` table with `"1.2K"`, `"-"`). No benchmark table names, +SQL, or result values; the parsing pattern is universal and tied to no instance. + +## Acceptance criteria + +- `` schema-discovery gains the detect → parse/scale → verify guidance — + inline, dialect-agnostic, with at most one short generic example. +- No benchmark-derived content. Skill-content only; content tests updated. + +## Benchmark context (motivation only) + +At least one SQLite-subset question stores trading volume as suffix-encoded text +("K"/"M", "-" for zero) and fails because the agent aggregates the raw strings. The +fix — parse messy encodings before math — is universal data hygiene that helps any +analyst, so it belongs in the product's craft rather than a benchmark-specific +prompt. diff --git a/spider2-specs/done/14-output-completeness-final-check.md b/spider2-specs/done/14-output-completeness-final-check.md new file mode 100644 index 00000000..49445e18 --- /dev/null +++ b/spider2-specs/done/14-output-completeness-final-check.md @@ -0,0 +1,105 @@ +# Enforce answer-output completeness with a final pre-emit check in the analytics skill + +## Problem + +The single largest correctness failure mode is **incomplete output**: the query runs and the +methodology is roughly right, but the result is missing columns the question asked for. Three +recurring sub-patterns: + +1. **Multi-part questions answered partially.** A question that asks for several things ("report + the highest *and* the lowest month, each with its count and average, *and* the difference") + comes back with only the first part — one column instead of the several requested. +2. **Identity dropped.** Grouping by a human-readable name but not projecting the entity's + identifier (e.g. a product name without its product id, a customer name without its + customer id). +3. **Inputs to a derived value dropped.** Returning a ratio / percentage / difference but not + the underlying counts the question also asked for. + +Sub-patterns 2 and 3 are **already covered by `` rules** in the analytics skill +(spec 07: *"expose identity, not just the label"* and *"keep the inputs to a derived value"*), +yet they are frequently **not applied**. So the gap is not missing knowledge — it is that these +rules are passive heuristics buried in a list, and the agent doesn't reliably check them before +finalizing. The fix is to (a) add the missing multi-part-completeness rule and (b) turn +output-completeness into an **explicit final verification step** the agent performs before +emitting SQL. + +This is reinforced by evidence that the failure is **model-independent**: a markedly stronger +model produced the same incomplete-output mistakes on these questions, which means it is a +craft/enforcement gap, not a capability gap. + +## Generic use case (independent of any benchmark) + +An analyst is asked: *"For each region, report the highest and the lowest monthly order count, +and the difference between them."* A complete, useful answer has a column for the region's id +and name, the highest count, the lowest count, and the difference — five columns. Returning just +the region and a single number answers only part of the request. This is a universal expectation +on any database: answer **every** part of a multi-part request, identify the entities, and show +the inputs behind any derived figure. + +## Requirements + +Additive to the analytics skill's `` "Answer completeness / interpretation" group and +its workflow's validate step (inline, dialect-agnostic, heuristic + why, consistent with spec 07). + +1. **Multi-part / multi-output completeness (new rule).** When a question requests several + outputs — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a + value plus its components ("X, Y, and their ratio") — the final projection must contain a + column for **each** requested output. *Why:* answering only the first clause is the most common + way a runnable query is still wrong; the grain and methodology can be perfect yet the answer + is short by columns. + +2. **Fold the existing identity / inputs rules into the same completeness notion.** The + already-shipped rules — project the entity **identifier** alongside any human-readable label, + and **keep the inputs** to any derived value — are part of output completeness; reference them + from the check below so they are actually applied, not just listed. + +3. **Add an explicit final completeness check (the enforcement mechanism).** Before emitting the + final SQL, the skill should have the agent **re-read the question and confirm the projection + covers**: every named metric/attribute; the identifier of every grouped/named entity; every + input to a derived value; all at the grain the question specifies. This is a short, concrete + checkpoint at the validate step — the point is to convert the passive heuristics into an active + pre-finalize verification. (Do **not** add unrequested/extra columns to be "safe" — that is + grader-gaming; the check is about matching the request exactly, not padding it.) + + Generic teaching example (synthetic schema — see Leak-safety): + ```sql + -- "For each region, report the highest and lowest monthly order count and their difference." + -- WRONG: answers only the first clause; no region id, no lowest, no difference. + SELECT region_name, MAX(monthly_orders) AS highest + FROM region_monthly GROUP BY region_name; + + -- RIGHT: one column per requested output + the entity's identity, at the region grain. + SELECT r.region_id, r.region_name, + MAX(m.monthly_orders) AS highest_monthly_orders, + MIN(m.monthly_orders) AS lowest_monthly_orders, + MAX(m.monthly_orders) - MIN(m.monthly_orders) AS difference + FROM regions r + JOIN region_monthly m ON m.region_id = r.region_id + GROUP BY r.region_id, r.region_name; + ``` + +## Leak-safety (hard constraint) + +The example must use an **invented, generic schema** (`regions`, `region_monthly`) and made-up +columns — **no benchmark table names, SQL, or result values.** It teaches the *pattern* (cover +every requested output + identity + inputs), which is universal and tied to no specific instance. + +## Acceptance criteria + +- The skill states the multi-part-completeness rule and a concrete **final completeness check** + (re-read question → verify metrics + identity + inputs + grain), inline and dialect-agnostic, + cross-referencing the existing identity/inputs rules so they're enforced. +- Includes the over-projection guard (don't pad with extra columns — that's grader-gaming). +- One short generic example (wrong vs complete); no benchmark-derived content. +- Skill-content only; analytics-skill content tests updated to cover the new rule + check. + +## Benchmark context (motivation only) + +In the latest SQLite-subset run, **incomplete output was the single largest failure bucket +(~13 of 51 voted failures)**: multi-part questions answered partially, and identity / derived-value +inputs dropped — the latter two being spec-07 rules that already exist but weren't applied. A +probe with a much stronger model reproduced the *same* incomplete-output failures, confirming this +is a craft-enforcement gap rather than a model-capability one. The fix — answer every requested +part, identify entities, keep inputs — is universal analyst craft, so it belongs in the product +skill (and transfers to real users), enforced as a final check rather than left as a passive hint. +``` diff --git a/spider2-specs/done/15-mcp-server-structured-logging.md b/spider2-specs/done/15-mcp-server-structured-logging.md new file mode 100644 index 00000000..294c986c --- /dev/null +++ b/spider2-specs/done/15-mcp-server-structured-logging.md @@ -0,0 +1,116 @@ +# Structured, leveled logging for the ktx MCP server + +> **Scope: observability only.** This spec is about *seeing* what the MCP server +> does (which tool, what params, when, how long, outcome). *Preventing* a runaway +> query from blocking the server (off-event-loop / interruptible query execution) +> is a separate concern — see "Non-goals" and the sibling spec note below. + +## Problem + +The ktx MCP server (`packages/cli/src/mcp-http-server.ts` + +`mcp-server-factory.ts`; raw `node:http` + `@modelcontextprotocol/sdk` +`StreamableHTTPServerTransport`) emits almost no operational logs. There is no +server-side record of **which MCP tool was called, with what parameters, when, +how long it took, or whether it succeeded** — nor of session open/close or +transport errors. When a tool call is slow, hangs, or a client connection drops +("Transport channel closed"), an operator has no trail to diagnose it and must +resort to process sampling / `lsof` / guesswork — and the offending input +(e.g. the exact SQL) is typically unrecoverable. + +## Generic use case + +Anyone running a long-lived ktx MCP server — a developer's local instance, a +shared team server, or a hosted deployment — needs observability into tool-call +activity to: +- diagnose slow or hung tool calls (which `sql_execution` ran, against which + connection, with what SQL, for how long); +- explain client-visible connection failures from the server side (session + lifecycle, transport-closed events); +- audit what agents asked the server to do; +- spot patterns (hot tools, slow connections, error rates). + +This is standard production-server hygiene; the server currently provides none. + +## Requirements (sketch — refine when picked up) + +1. **One structured (JSON) logger, low overhead.** Suggested `pino` (orientation + only; implementer owns the choice). A single shared instance; write **JSON to + stdout** (12-factor — the launcher/aggregator routes it). No in-app file + rotation. Optional human-readable pretty output only when attached to a TTY + (dev). +2. **Configurable level via env** (e.g. `KTX_LOG_LEVEL`, default `info`; `debug` + for diagnosis) — verbose logging on demand without code changes. +3. **Per-session / per-call context** via child loggers: every line carries a + `sessionId` (from the transport session) and, for tool calls, a `callId` + + `tool` name, so one session's or call's activity can be traced/grepped. +4. **Tool-call logging — START logged BEFORE execution, COMPLETION after.** For + every MCP tool invocation: + - on entry: log `{ tool, params, sessionId, callId }` **before** running the + handler (so the record exists even if the handler never returns); + - on exit: log `durationMs` + outcome (ok with result size, or error with + stack). + This makes a **hung / never-returning call identifiable**: a start with no + matching completion is the culprit, with its exact parameters and timestamp. + This matters specifically because handlers like `sql_execution` run a + *synchronous* better-sqlite3 query — a runaway query blocks the process and no + completion is ever logged, so the start line (flushed before the blocking + call) is the only record. For `sql_execution`, `params` should include the SQL + text (the most useful field). Emit a **WARN** when a *completed* call exceeds a + configurable slow threshold (e.g. `KTX_SLOW_TOOL_MS`). +5. **Connection / session lifecycle:** log session open/close (with `sessionId`) + and transport errors (the SDK's closed-channel / "Transport channel closed" + events) so client-side connection failures have a server-side counterpart. +6. **Error logging** with structured stack traces (a standard error serializer), + not bare strings. +7. **Light redaction — credentials only** (bearer token, connection + passwords/secrets). SQL text and tool params are *not* secrets and must be + logged. Do not over-redact. +8. **Synchronous logging is fine.** The server uses a synchronous DB client, so + logging need not be async; prefer the simpler synchronous stdout path over + async/worker transports (which can lose buffered lines on a hard crash). Do + not introduce async-logging machinery. + +## Acceptance criteria (sketch) + +- With `KTX_LOG_LEVEL=debug`, invoking any MCP tool produces a `tool.start` + (tool, params, sessionId, callId) and a `tool.end` (durationMs, outcome) line + on the server's stdout, as JSON. +- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a + `tool.start` line carrying its **exact SQL and timestamp** and **no** + `tool.end` — so the offending query is recoverable from the log alone, with no + process sampling. +- A completed tool call slower than the configured threshold emits a WARN with + its duration. +- Session open/close and transport-closed events are logged with the `sessionId`. +- At default level (`info`), routine per-tool lines are suppressed but lifecycle, + slow-call warnings, and errors are present. +- Credentials (bearer token, connection secrets) never appear in logs; SQL and + tool params do. +- No new heavy dependencies beyond the logger; no OpenTelemetry/metrics stack; no + async-transport machinery. + +## Non-goals + +- **Preventing/interrupting runaway queries** (off-event-loop execution, query + timeouts, worker-thread isolation). That is a *separate* spec; a single + synchronous query that fans out into a massive nested-loop join can peg the + single-threaded server for hours and break new connections — observability + surfaces *which* query, but the fix is execution-model work. (This logging is + also a prerequisite for a future watchdog that detects a `tool.start` with no + `tool.end` past a threshold and recycles the server.) +- Metrics/tracing/OpenTelemetry exporters. +- Forwarding logs to the MCP *client* via the protocol's logging capability + (`notifications/message`, `logging/setLevel`) — a possible later enhancement, + distinct from operational stdout logging. + +## Benchmark context (motivation, not a requirement) + +Running Spider 2.0-Lite against the MCP server at concurrency, an +adversarial-reviewer-generated query degenerated into a massive nested-loop join; +synchronous better-sqlite3 executed it on the event loop, pegging a server at +~100% CPU for hours and breaking new MCP connections to it ("Transport channel +closed"). We could not determine *which* query, because the server logs nothing +about tool calls — diagnosis required `sample`/`lsof` on the live process and the +exact SQL was never recovered. Structured tool-call logging (especially +start-before-execute) would have turned this into a one-line `grep` of the server +log. diff --git a/spider2-specs/done/16-bounded-query-execution-timeout.md b/spider2-specs/done/16-bounded-query-execution-timeout.md new file mode 100644 index 00000000..5ecd43d3 --- /dev/null +++ b/spider2-specs/done/16-bounded-query-execution-timeout.md @@ -0,0 +1,131 @@ +# Bounded query execution (deadline + non-blocking) for read SQL + +> Priority: HIGH. Found empirically during a Spider2-lite sqlite run +> (2026-06-18): a single `sql_execution` MCP call wedged a worker at 100% CPU +> for 13+ minutes and never returned. The query +> `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on the +> `complex_oracle` sqlite database hit a VIEW (`costs ⋈ sales`, 918,843 × 82,112 +> rows, joined on a 4-column key with no composite index) whose plan degraded to +> an O(N×M) nested-loop scan. Because the sqlite connector runs +> `better_sqlite3 .all()` **synchronously with no timeout**, it blocked the MCP +> worker's entire event loop: no `tool.end` was ever logged, the port went +> unresponsive, and the query could not be cancelled. One of four eval shards +> stalled until the worker was killed by hand. + +## Problem + +Two compounding gaps on the read-query path: + +1. **No execution deadline.** A single expensive query runs unbounded. This is + handled divergently per connector, with no shared contract: BigQuery has a + real server-side job timeout (`job_timeout_ms`); ClickHouse has an HTTP + `request_timeout`; Snowflake, Postgres, MySQL, and SQL Server bound only + connection/pool *acquisition*, not statement *execution*; SQLite has nothing. + So whether a runaway query is bounded depends entirely on which driver the + caller happened to hit. + +2. **In-process engines block the event loop and can't be cancelled.** The + sqlite connector executes on the main thread via synchronous + `better_sqlite3 .all()`. A slow query freezes the whole MCP server (it can't + serve other requests, send progress, or write `tool.end`), and there is no + way to interrupt it: better-sqlite3 exposes no interrupt/cancel API — its + documented mechanism for slow queries is to run them in a **worker thread**, + and the only way to stop a runaway synchronous query is to terminate the + thread executing it. + +The net effect is a query that produces a `tool.start` with no matching +`tool.end`, an unresponsive server, and no self-recovery. A row cap (`maxRows`) +does not help — it bounds returned rows, not scan work, and the failing query +returned a single aggregate row. + +## Generic use case + +Any data agent that lets an LLM author SQL will eventually issue an +accidentally-expensive query — an unindexed or cartesian join, an expensive +VIEW, a wide aggregate over a large fact table. A general-purpose context layer +must bound that and return a clean, fast "query exceeded Ns" error so the agent +can revise (add filters, query base tables, narrow the range) instead of hanging +the tool and the server. This matters for embedded/local warehouses (sqlite, +duckdb) and remote ones alike, and is wholly independent of any benchmark. + +## Requirements + +1. Every read-query execution path (`executeReadOnly`) enforces a single + canonical execution deadline. One opinionated default; **not** a per-call + user flag. Where a driver already supports a per-connection timeout + (BigQuery `job_timeout_ms`), reuse that as the per-connection override rather + than inventing a parallel knob. +2. On exceeding the deadline the path resolves with a `KtxQueryError` + ("query exceeded {N}s") — a finite, decision-reaching outcome, never an + unbounded hang. +3. The deadline is a **shared contract at the connector boundary**, defined once + (on the `executeReadOnly` contract or a shared wrapper at the call site) so + all drivers participate. Bring the existing divergent timeouts (BigQuery job + timeout, ClickHouse request timeout) under this one contract instead of + leaving parallel mechanisms. +4. For in-process engines (sqlite today, any future embedded driver), execution + MUST NOT block the MCP server event loop. Run the query off the main thread + and enforce the deadline by terminating that thread on timeout (the + better-sqlite3-documented approach, since synchronous queries are + uncancellable in-thread). The event loop must stay responsive so `tool.end` + is always written and concurrent requests on the same port are served. +5. Prefer real cancellation over client-side give-up. Where the engine supports + a server-side statement timeout (Postgres `statement_timeout`, MySQL + `max_execution_time`, Snowflake `STATEMENT_TIMEOUT_IN_SECONDS`, ClickHouse + `max_execution_time`, BigQuery job timeout, SQL Server request timeout), set + it so the deadline actually stops work, not merely abandons the promise while + the query keeps running. For in-process engines, thread termination is the + cancellation. +6. The MCP `sql_execution` tool surfaces the timeout as an expected error + (classified as `KtxQueryError`, not a `$exception` fault, consistent with + existing expected-error classification) and logs a `tool.end` with the error + outcome. +7. Read-only enforcement (`assertReadOnlySql`) and the `maxRows` row cap remain + unchanged. The deadline is additive; `maxRows` is not a substitute for it. + +## Acceptance criteria + +- A read query that exceeds the deadline returns a `KtxQueryError` within + roughly the deadline; the MCP worker stays responsive (a concurrent tool call + on the same server completes while the slow query is still pending) and writes + a matching `tool.end` with a non-ok outcome. +- sqlite specifically: executing a deliberately pathological query (e.g. an + expensive VIEW or an unindexed cross join) on a fixture does not block the + event loop, is terminated at the deadline, and CPU returns to idle afterward + (the off-main-thread executor is killed, not left spinning). +- No regression: normal fast queries return identical results; read-only + rejection still works; `maxRows` still bounds returned rows. +- Tests cover the deadline path for at least the in-process driver (sqlite, + terminate-on-deadline) and one server-side-timeout driver. + +## Benchmark context (motivation only) + +The Spider2-lite local set loads several warehouses into sqlite, some with +expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` = +`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112 +rows, no composite index, with `promo_id` (the index the optimizer picks) being +95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a +view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval +shard for 10+ minutes; with one, the agent gets a fast error and can scope the +query instead. + +## Orientation hints (code pointers; may have drifted) + +- Shared contract: `packages/cli/src/context/scan/types.ts` — + `KtxScanConnector.executeReadOnly` (~343), `KtxReadOnlyQueryInput` (~285). +- MCP call site: `packages/cli/src/context/mcp/local-project-ports.ts:70` + (`connector.executeReadOnly`); tool registration in + `packages/cli/src/context/mcp/context-tools.ts`. +- In-process sync execution (the acute hang): + `packages/cli/src/connectors/sqlite/connector.ts:311-313` + (`better_sqlite3 .prepare().all()`). +- Existing divergent timeouts to unify: `connectors/bigquery/connector.ts` + (`job_timeout_ms` / `jobTimeoutMs`), `connectors/clickhouse/connector.ts:602` + (`request_timeout`), `connectors/snowflake/connector.ts:342` (test/pool only), + `connectors/postgres/connector.ts`, `connectors/mysql/connector.ts`, + `connectors/sqlserver/connector.ts` (pool/connection only). +- Error class: `packages/cli/src/errors.ts:25` (`KtxQueryError`). +- better-sqlite3 (context7 `/wiselibs/better-sqlite3`, v12.x): no + interrupt/cancel API; `docs/threads.md` documents the worker-thread pattern + for slow queries (master owns worker lifecycle and respawns on exit) — extend + it with terminate-on-deadline to enforce the timeout. diff --git a/spider2-specs/done/18-bigquery-cross-project-datasets.md b/spider2-specs/done/18-bigquery-cross-project-datasets.md new file mode 100644 index 00000000..e83c74d8 --- /dev/null +++ b/spider2-specs/done/18-bigquery-cross-project-datasets.md @@ -0,0 +1,68 @@ +# 18 — BigQuery cross-project dataset support (introspect foreign-hosted datasets, bill in own project) + +**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/18-…`. + +## Problem (generic, real-world) + +Analysts routinely query datasets that live in a **different** BigQuery project than the one +they bill jobs to — Google's `bigquery-public-data`, a partner's shared project, an +organization's central data project, etc. To make those connectable in ktx (so `discover_data`, +the semantic layer, dictionary sampling, and `sql_dialect_notes` work), ktx must be able to +**introspect a dataset hosted in a foreign project while running/billing jobs in the +credentials' own project**. + +Today it can't. ktx's BigQuery connector derives a single `projectId` from +`credentials.project_id` and uses it for **both** job billing **and** schema introspection: + +- `connectors/bigquery/connector.ts:294` — `projectId` is read only from `credentials.project_id`; + there is no separate billing-vs-dataset project knob. +- `:544` (`introspectDataset`) — calls `this.getClient().dataset(datasetId)`, which resolves the + dataset **in the client's (billing) project**, and labels every table `catalog: this.resolved.projectId`. +- `:453` (`listTables`) — queries `\`${projectId}\`.\`region-…\`.INFORMATION_SCHEMA.TABLES`, i.e. the + **billing** project's INFORMATION_SCHEMA. +- `:163` (`datasetIds()`) — returns `dataset_ids` verbatim; it never parses a `project.` prefix. + +So a `dataset_id` naming a dataset in another project can't be introspected, even though querying +it works fine (cross-project reads bill to the caller's project — that path already works). + +### Empirical confirmation +With a service account in project `ktx-spider2-lite`: +- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (looks in + `projects/ktx-spider2-lite/datasets/austin_311`). +- The cross-project form `DatasetReference("bigquery-public-data","austin_311")` → **succeeds** + (lists the public tables; public metadata is readable by any authenticated principal). +- There is **no config knob** to separate the introspection project from the billing project. + +## Requirement + +The BigQuery connector must accept **fully-qualified `project.dataset` entries** in `dataset_ids` +(a single connection may span more than one source project), and for each: +- **introspect** via the *dataset's* project — `client.dataset(id, { projectId })` / + `DatasetReference(project, dataset)`, query the **dataset project's** `INFORMATION_SCHEMA`, and + label the table `catalog` with the dataset's project; +- **run jobs / bill** in `credentials.project_id` (unchanged). + +A bare `dataset` (no `project.`) keeps today's behavior (resolve in `credentials.project_id`), so +existing single-project connections are unaffected. + +## Acceptance + +- `dataset_ids: ['bigquery-public-data.austin_311']` (credentials in a *different* project) → + `ktx ingest ` introspects the tables, enriches, and samples values; `discover_data` / + `dictionary_search` return them. +- A connection mixing `['bigquery-public-data.x', 'other-project.y']` introspects both. +- `sql_execution` of a fully-qualified `project.dataset.table` query still runs and bills in + `credentials.project_id`. +- Single-project `dataset_ids: ['my_dataset']` behaves exactly as before (no regression). + +## Benchmark context (motivation only — do not encode benchmark specifics) + +Spider 2.0-Lite's **BigQuery slice (205 questions)** is otherwise **unservable faithfully**: every +one of its ~74 logical databases groups datasets hosted in foreign public projects +(`bigquery-public-data`, `isb-cgc-bq`, `data-to-insights`, …), never in a project we own. Query +execution already works cross-project (proven), but ktx-only *discovery* (the whole point of the +faithful surface) is blocked because the connector can't introspect them. Scope is small: of 74 +BQ dbs only **1** spans more than one source project, so "let `dataset_ids` carry `project.dataset` +and introspect each in its own project" covers the benchmark and the general case alike. This is +the sole blocker for the BigQuery leaderboard slice (the Snowflake slice needed no connector +change and is already baselined). diff --git a/spider2-specs/done/19-durable-bounded-relationship-detection.md b/spider2-specs/done/19-durable-bounded-relationship-detection.md new file mode 100644 index 00000000..3435d2a7 --- /dev/null +++ b/spider2-specs/done/19-durable-bounded-relationship-detection.md @@ -0,0 +1,89 @@ +# 19 — Durable, resumable, bounded relationship detection during ingest enrichment + +**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/19-…`. + +## Problem (generic, real-world) + +Ingest enrichment runs three stages in a fixed order inside `runLocalScanEnrichment` +(`packages/cli/src/context/scan/local-enrichment.ts`): + +1. `descriptions` (`:530`) — per-table LLM descriptions (the expensive step: one model call per + table; on a large schema this is minutes of paid LLM work). +2. `embeddings` (`:559`) — column embeddings. +3. `relationships` (`:593`) — FK/join discovery: profiles a row sample of **every** table, then + validates candidate joins. + +The queryable semantic-layer artifacts are persisted **once, at the very end**, by +`writeLocalScanEnrichmentArtifacts` in `local-scan.ts:510` — which runs **after** +`runLocalScanEnrichment` returns, i.e. after all three stages. + +This creates three failure modes that compound on large schemas (hundreds of tables): + +1. **Enrichment is lost if relationship detection is interrupted.** The descriptions + embeddings + are computed and held in memory, but they only reach the durable, queryable artifacts when the + final write runs after the `relationships` stage. If the process is killed/crashes/times out + **during** relationship detection (the last, slowest, silent stage), the artifacts are never + written — the schema survives (it was written earlier at `local-scan.ts:473`) but **all the + paid LLM enrichment is discarded**. Empirically: ingesting a 95-table BigQuery dataset produced + full descriptions + embeddings (progress reached "Building embeddings 17/17"), then the + relationships stage ran silently past a supervising deadline and was killed — the persisted + `_schema` had **0** AI descriptions, only the native column comments. Every larger dataset hits + this, so the most expensive work is the most likely to be thrown away. + +2. **Re-running does not resume — it re-spends.** There is a stage state store + (`SqliteLocalScanEnrichmentStateStore`) and a `runEnrichmentStage` helper (`:413`) that saves + each completed stage's output. But the completed-stage lookup keys on **`runId`** + (`findCompletedStage({ runId, stage, inputHash })`, `:427`), and `runId` is fresh per ingest + invocation. So resume only works *within* a single run; re-running an interrupted ingest gets a + new `runId`, misses the cache, and **re-computes descriptions + embeddings from scratch** + (re-paying for the LLM work that already succeeded). + +3. **Relationship detection is unobservable and unbounded.** The stage emits no progress between + "Detecting relationships" and the final "Relationship detection found N accepted" — minutes of + silence on a large schema. A supervisor watching for liveness cannot distinguish a slow-but- + working profile from a true hang, and there is no internal time/work budget, so on a very large + schema it can run far longer than any reasonable deadline. + +## Requirements + +1. **Checkpoint queryable artifacts before relationship detection.** Persist the descriptions + + embeddings into the semantic-layer artifacts as soon as the `embeddings` stage completes, before + the `relationships` stage runs. Relationship detection then appends/merges its own artifact on + completion. Net: the expensive LLM + embedding enrichment is **always durable and queryable**, + even if relationship detection fails, is interrupted, or is skipped. (A failed/partial + relationship stage should degrade to "no/partial joins", never to "no descriptions".) + +2. **Make stage resume work across runs.** Resolve a completed stage by stable content identity + — `(connectionId, stage, inputHash)` — independent of `runId`, so re-running an interrupted + ingest resumes the finished `descriptions`/`embeddings` stages from cache and only re-runs what + actually failed (e.g. `relationships`). Re-running after an interruption must not re-spend LLM + credits on stages that already succeeded. + +3. **Make relationship detection observable and bounded** (mirrors spec 16's bounded query + execution). Emit progress through the existing progress port — e.g. "Profiling table K/N", + "Validating candidate K/M" — so liveness is visible. Enforce an overall time/work budget + (configurable, e.g. under `scan.relationships`) so on a very large schema the stage stops + gracefully and returns the relationships found so far (partial) rather than running unboundedly. + Partial completion is persisted (per requirement 1) and marked as such. + +## Acceptance + +- Interrupting an ingest **during** relationship detection still leaves a queryable semantic layer + with the table/column descriptions + embeddings that were generated (verified: re-open the + connection, descriptions are present). +- Re-running an interrupted ingest **does not** regenerate descriptions/embeddings whose stage + already completed (verified: no LLM description calls for the cached tables; only the failed + stage re-runs). +- A connection with hundreds of tables emits relationship-stage progress and completes within the + configured budget, persisting partial relationships if the budget is hit — without discarding + enrichment. +- Small/single-run ingests behave exactly as before (no regression in artifacts or relationship + output when nothing is interrupted). + +## Benchmark context (motivation only — do not encode benchmark specifics) + +The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables (`ebi_chembl` +785, `fec` 486, `ga360` 366, …). Enriching them with claude-code costs real, rate-limited LLM +budget; losing that enrichment to a relationship-stage interruption — and re-spending it on every +retry — makes large-schema ingest impractical. This is a general durability/cost property of the +ingest pipeline, independent of the benchmark; the benchmark only made it acute at scale. diff --git a/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md b/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md new file mode 100644 index 00000000..ab1e176e --- /dev/null +++ b/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md @@ -0,0 +1,101 @@ +# 20 — Resilient enrichment under a slow/hung LLM backend + +**Status:** draft (intake). Requirement-level; the implementer refines into `specs/20-*.md`. + +This is the **enrichment-stage** analog of two already-shipped specs: +- spec 16 (bounded query execution) — bound *and actually cancel* a runaway read query (child-thread/process kill, not a cosmetic JS deadline); +- spec 19 (durable/bounded relationship detection) — checkpoint expensive ingest work so an interruption doesn't lose it. + +Spec 16 hardened the **read-query** path and spec 19 checkpointed at **stage boundaries**. The same two +weaknesses still exist *inside the descriptions enrichment stage*, and together they turned a single hung +table into an indefinite wedge plus total loss of an entire stage's LLM work. + +## Problem / requirement + +Two compounding gaps on the per-table description-enrichment path, observed end-to-end: + +### 1. The per-table LLM timeout does not actually terminate the work + +The per-table `generateObject` enrichment call is wrapped in `retryAsync` with a fresh +`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (ktx commit `01f63380`). When the LLM +backend is a **subprocess** (the `codex` backend spawns a child `codex` process; `claude-code` likewise +spawns a child) and that child **hangs with an open connection to the provider** (TCP ESTABLISHED, ~0% +CPU, no bytes flowing), the JS-level `AbortSignal` fires but **does not kill the child process or unblock +the await** — so the call sits *past* its own timeout indefinitely. + +Observed (BigQuery ingest, codex backend, 2026-06-23): with `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min), +two of `covid19_usa`'s widest tables (252 columns) hung; the stage sat at **268/285 for 41+ minutes** — +well past the 30-min per-attempt timeout — with exactly two codex children, each holding 3 ESTABLISHED +connections at ~0% CPU, until killed by hand. The timeout was cosmetic: it never terminated the hung +child. (This is precisely the failure mode spec 16 fixed for SQL — a deadline that fires in JS but cannot +interrupt the underlying work — applied to the enrichment LLM call instead of the query.) + +**Requirement:** the per-table enrichment-call timeout must be **enforced**, not advisory — when it fires, +the in-flight work is actually cancelled (subprocess SIGKILL for process-backed providers; request abort +for HTTP-backed ones) and the call returns/throws *promptly* so the stage can proceed (skip the table per +the existing no-retry-on-timeout policy). A hung table must cost at most ~one timeout, never unbounded +wall-clock. Provider-agnostic: it must hold for `codex`, `claude-code`, and HTTP backends alike. + +### 2. Descriptions are checkpointed only at full-stage completion, so a few bad tables lose all the good ones + +Spec 19 persists the descriptions checkpoint **after the descriptions stage completes** (before +relationships). There is no *within-stage* persistence: while the stage runs, every enriched table's +description lives only in memory. So if the stage cannot complete — e.g. 2 tables out of 285 hang (gap #1), +or the process is killed, or it hits the stall watchdog — **all** the already-enriched tables are lost, +even though their (expensive) LLM descriptions were finished. + +Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but **0** rows in +`local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; killing the wedged ingest discarded +all 283, forcing a from-scratch re-ingest. The cost of 2 pathological tables was 283 tables' worth of +redone LLM calls. + +**Sharper observation (re-ingest with a short, enforced timeout):** even when the stage *does* run to +the end — the 2 hung tables hit a 4-min timeout and were skipped, so 283/285 descriptions were generated +and the ingest reported success (`Scan completed` / `Ingest finished`, embeddings built, exit 0) — the +descriptions were **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the discard is **not** just +"lost on kill": a stage that completes with *any* skipped/aborted table currently persists **nothing**, +throwing away every successfully-generated description. The skip must be graceful — a skipped table costs +one missing description, not the entire stage's output. (This is the strongest argument for per-table +incremental persistence: the 283 good descriptions should have been durable the moment each was produced.) + +**Requirement:** persist enriched descriptions **incrementally** (per-table or per-batch) during the +descriptions stage, so that (a) tables that finished are durable even if the stage never completes, and +(b) a resumed ingest re-does only the *unfinished* tables, not the whole stage. The existing additive-write +design (spec 19 already preserves existing descriptions on re-ingest) is the foundation; this extends the +checkpoint granularity from once-per-stage to incremental. + +## Sketch (implementer to refine) + +- **Enforced timeout:** route enrichment-call cancellation through real termination — kill the codex/ + claude-code child process on timeout (reuse spec 16's child-kill mechanism), abort the HTTP request for + network backends. A fired `AbortSignal` must guarantee the await settles within a bounded grace period. +- **Sane default + the right tradeoff:** the default per-table timeout should be **moderate** (single-digit + minutes) with a small retry count, not very large — because the cost of a *hang* is the timeout value + itself, a long timeout is strictly worse for hangs. (The 30-min value used in the incident was an operator + override chosen to avoid cutting off slow-but-completing wide tables; with #1 enforced and incremental + checkpointing, a moderate default + skip is the better operating point.) +- **Incremental persistence:** flush descriptions per-batch (e.g. every N completed tables or on a timer) to + the same store/format used at stage completion; on resume, treat already-persisted tables as done and only + enrich the remainder. Keep it idempotent and additive (don't clobber prior descriptions). +- **Interaction with the stall watchdog:** with #1 enforced, no single table can starve progress for longer + than ~one timeout, so an external stall watchdog stops being the only backstop. + +## Generic use case (independent of the benchmark) + +Anyone ingesting a large or wide schema with an LLM enrichment backend (especially a *subprocess* backend, +which is the common local/desktop setup) will eventually hit a table whose description call hangs — a +provider stall, a rate-limit black-hole, a pathologically large prompt. Without an *enforced* timeout, one +such table wedges the whole ingest indefinitely; without *incremental* persistence, any interruption throws +away all the per-table LLM work already done (the dominant ingest cost). Both fixes make large-schema +enrichment **resilient and resumable** — a few bad tables degrade to a few skipped descriptions, not a +hung process and a from-scratch redo. This is core robustness for a general-purpose ingestion product, +wholly independent of any benchmark. + +## Benchmark context (motivation only — not a benchmark-specific rule) + +Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment backend). Re-enriching +the giant public datasets, `covid19_usa` wedged at 268/285 for 41+ minutes on 2 hung 252-column tables; the +30-min per-table `AbortSignal` timeout never killed the hung codex children, and because descriptions +checkpoint only at stage completion, the 283 already-enriched tables were unrecoverable — the operator had +to kill, cache-bust, and re-ingest the db from scratch (with a short timeout as a stopgap). The benchmark +just exercised a large/wide multi-dataset ingest at scale; the gap and the fix are generic. diff --git a/spider2-specs/done/21-selective-enrichment-stages.md b/spider2-specs/done/21-selective-enrichment-stages.md new file mode 100644 index 00000000..6226fbf2 --- /dev/null +++ b/spider2-specs/done/21-selective-enrichment-stages.md @@ -0,0 +1,91 @@ +# 21 — Selective enrichment stages (`--stages`) + per-stage cache keys + +**Status:** draft (intake). Requirement-level; the implementer refines into `specs/21-*.md`. + +Follow-on to spec 19 (durable/resumable relationship detection) and spec 20 (resilient enrichment). +Those made enrichment *survivable and resumable*; this makes it *selectively re-runnable* — re-run one +enrichment stage without re-paying for the others. + +## Problem / requirement + +Enrichment has three stages — **`descriptions`** (per-table LLM text), **`embeddings`** +(sentence-transformers over the schema/descriptions), **`relationships`** (FK/join detection, optionally +LLM-proposed). Today you cannot re-run a *subset* of them, and three facts in the current code make a +targeted re-run impossible without a full, expensive re-enrich: + +1. **One coarse cache key gates all three stages.** `context/scan/local-enrichment.ts:611` computes a + single `inputHash` from `{snapshot, mode, detectRelationships, providerIdentity, relationshipSettings}`, + and all three stages reuse it (descriptions ~`:641`, embeddings ~`:672`, relationships ~`:728`). So + changing *any* one stage's inputs invalidates *every* stage's cache. Concretely: flipping + `scan.relationships.llmProposals`, switching the LLM backend, or upgrading the embeddings model forces + ktx to re-run the **expensive per-table descriptions** even though they didn't conceptually change. +2. **No CLI surface to select stages.** The enrichment internally already supports a relationships-only + path (`mode: 'relationships'`, which skips the description/embedding stages — they're gated on + `mode === 'enriched'`), but `ktx ingest` exposes no flag to invoke it (only `--no-query-history`). + The capability is built; it's just not reachable. +3. **The per-stage storage already exists** (`local_scan_enrichment_stages` PK `(connection_id, stage, + input_hash)`) and the **additive write already preserves existing descriptions** on re-ingest — so the + foundation for "touch one stage, keep the rest" is in place; only the key granularity and the CLI + surface are missing. + +**Requirement:** let an operator re-run a chosen subset of enrichment stages on already-ingested +connection(s), recomputing only those stages and **preserving the others' artifacts untouched** — cheaply, +without re-running unchanged (especially the costly `descriptions`) stages. + +## Design decisions (resolved during intake; implementer may refine) + +- **CLI flag: `--stages `** (plural). Accepts a comma-separated subset of + `descriptions,embeddings,relationships`; default = all three (current behaviour). Plural because it takes + a *set*; `--stages relationships` and `--stages descriptions,embeddings` both read naturally, and the + plural signals "list expected" (singular `--stage` implies exactly one). **Validate** the names — an + unknown stage is an error, never silently ignored. +- **Per-stage `inputHash`.** Split the single coarse hash so each stage keys on *only its own* inputs: + - `descriptions` → `{snapshot, mode, providerIdentity}` (NOT relationship settings, NOT embedding model) + - `embeddings` → `{snapshot, embeddings model/provider, + the description text it embeds}` + - `relationships`→ `{snapshot, relationshipSettings (incl. llmProposals), providerIdentity}` + Then flipping `llmProposals` invalidates only `relationships`; swapping the embeddings model invalidates + only `embeddings`; improving description prompts/LLM invalidates only `descriptions`. +- **Preserve-others semantics.** Stages not named in `--stages` are left exactly as on disk (additive write, + already the behaviour). A selective run never deletes another stage's artifacts. +- **Downstream-staleness handling.** Stages have a dependency order (`descriptions → embeddings`; + `relationships` depends only on the schema snapshot). Re-running `descriptions` alone can leave existing + `embeddings` semantically stale (they embedded the old text). The run must **warn** when a selected + re-run leaves an unselected downstream stage stale, and the operator can opt to cascade + (`--stages descriptions,embeddings`). Do not silently leave a stale-but-unflagged downstream. +- **`relationships` uses existing descriptions as context.** When re-running `relationships` only, the + stage should read the existing enriched schema (incl. on-disk `ai:` descriptions) so `llmProposals` has + full context — not just raw column names. +- **Scope:** the three enrichment stages for now. Design the stage-name namespace so it can later extend to + the broader scan phases (schema / query-history / source / memory) and subsume the inconsistent + `--no-query-history` negative flag, but that unification is out of scope here. + +## Sketch (implementer to refine) + +- Add `--stages` to `ktx ingest`; parse+validate into a stage set; thread it to the enrichment entry so it + selects which stage blocks run (reuse the existing `mode`/stage gating — `mode: 'relationships'` is the + precedent). +- Replace the single `computeKtxScanEnrichmentInputHash` call with per-stage hash computation keyed on each + stage's own inputs; gate each stage's resume/skip on its own hash. +- Ensure selective runs read + preserve the on-disk enriched schema and write additively. +- Emit a clear staleness warning when an unselected downstream stage is invalidated by a selected one. + +## Generic use case (independent of the benchmark) + +Any team running ktx in production maintains its semantic layer over time: they improve description prompts +or switch the description LLM, upgrade the embeddings model, or turn on LLM-proposed joins. Today each of +those forces a **full re-enrich of every connection** — re-running the expensive per-table descriptions +even when only embeddings or relationships changed. Selective `--stages` re-runs makes these routine +maintenance operations cheap and targeted: "re-embed everything on the new model" or "backfill joins now +that llmProposals is on" become a single fast pass that leaves the untouched stages — and their cost — +alone. This is core operability for a long-lived ingestion product and is wholly independent of any +benchmark. + +## Benchmark context (motivation only — not a benchmark-specific rule) + +Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A level-aware audit found (a) a +tail of BigQuery dbs with poor *column*-description coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) +that want a **`descriptions`-only** re-run with a longer timeout, and (b) a desire to **backfill joins** +across all already-ingested dbs after enabling `llmProposals` — without re-paying for descriptions. Both +were blocked by the coarse single `inputHash` (flipping `llmProposals` or re-describing would invalidate +the whole enrichment) and the absence of a stage-selective CLI flag. The benchmark just exercised +large-scale multi-backend ingestion; the gap and the fix are generic. diff --git a/spider2-specs/specs/01-connection-scoped-wiki.md b/spider2-specs/specs/01-connection-scoped-wiki.md new file mode 100644 index 00000000..1ffed215 --- /dev/null +++ b/spider2-specs/specs/01-connection-scoped-wiki.md @@ -0,0 +1,300 @@ +# Connection-scoped wiki pages + +> Refined spec. Intake draft: `todo/01-connection-scoped-wiki.md`. + +## Problem + +Wiki pages have only two scopes today: `GLOBAL` and `USER` +(`packages/cli/src/context/wiki/types.ts`, `WikiScope`). Scope is expressed by +directory (`wiki/global/.md`, `wiki/user//.md`) and the +search path filters by loading only the in-scope pages before any lane runs. +There is no way to associate a page with a **connection** (a warehouse/database +defined under `connections:` in `ktx.yaml`). + +In a project with many connections this causes two distinct failures: + +1. **Cross-database relevance pollution.** All pages share one search index, so + `wiki_search` for a generic term (`orders`, `revenue`, `average order + value`) surfaces pages written about the wrong database. Concept names + collide across databases constantly in real multi-connection projects + (several databases each with `orders`, `customers`, …). +2. **Silent overwrite on shared keys.** Page keys are a flat, global namespace. + The write path resolves a repeated key to the existing file and updates it + in place. So if the agent writes an `orders` page while ingesting database B + and an `orders` page already exists for database A, B's content **overwrites + A's** — same-concept pages for different databases cannot coexist today. + +Today, when `memory_ingest` is called with a `connectionId`, that id only +scopes which semantic-layer sources the triage agent can see +(`memory-agent.service.ts`); it is **not** persisted on the resulting wiki page +and **not** validated against `ktx.yaml`. + +## Generic use case + +Any org with multiple databases/warehouses in one **ktx** project: org-wide +definitions ("fiscal year starts in February") should be visible everywhere, +while database-specific conventions ("in the events DB, `user_id` is the +anonymous device id, not the account id") should not pollute searches about +other databases — and two databases that both have an `orders` concept must be +able to keep separate, non-colliding pages. + +## Model + +`connections` is **additive frontmatter metadata**, orthogonal to the existing +`GLOBAL`/`USER` directory scope — not a third scope dimension: + +- A page is still `GLOBAL` or `USER` and lives where it lives today. It may + **additionally** carry a `connections` list. +- **Page keys remain a flat, globally-unique namespace.** `connections` does + **not** namespace keys; a page is addressable by key alone, unchanged. +- A page may list **multiple** connections. +- **Absent or empty `connections` ⇒ unscoped: the page applies to all + connections.** This is exactly today's behavior, so every existing page is + unaffected. + +This keeps `wiki_read` and refs untouched and adds no parallel scope axis; +filtering by connection is purely a search/relevance concern. + +## Requirements + +### 1. Frontmatter field + +Add an optional `connections` field to wiki page frontmatter — a list of +connection ids. + +- Accept a single string too; normalize to a list at parse time (reuse the + existing array-coercion helper used for `tags`/`refs`/`sl_refs`). +- Round-trips through parse/serialize without loss. +- Absent or empty ⇒ unscoped (see Model). Existing pages are unaffected by + construction. + +### 2. Page identity and key distinctness + +`connections` does not change how pages are identified or addressed: + +- Keys stay flat and globally unique; `wiki_read(key)` is unchanged. +- Because the write path updates a page in place when its key already exists, + same-concept pages for different connections **MUST** use distinct keys + (e.g. `orders_sales_db` vs `orders_events_db`). Connection-distinctive keys + for database-specific pages are the primary mechanism (driven by write-path + prompt guidance, requirement 5). +- **Data-loss guard (code, not prompt):** a connection-scoped write whose key + matches an existing page whose `connections` scope is **disjoint** from the + incoming scope MUST surface a collision instead of silently overwriting the + existing page. (Updating a page within the same connection scope, or + broadening/narrowing its own `connections`, is a normal update — not a + collision.) The implementer owns whether the collision is a hard error or a + suffixed new key; it must not be a silent clobber. + +### 3. Search filtering + +Add an optional connection filter to the search surfaces: + +- **MCP:** `wiki_search(query, connectionId?)` (`context-tools.ts`). +- **CLI:** `ktx wiki search` and `ktx wiki list` accept `--connection ` + (with `-c` alias), matching the `ktx sql` connection flag. + +Semantics: + +- With `connectionId: X` ⇒ return pages whose `connections` is empty + (unscoped) **∪** pages whose `connections` contains X. +- Without ⇒ current behavior, all pages. +- The filter **MUST** apply uniformly to **all three search lanes** (lexical + FTS5, semantic/embedding, token fallback) at the **candidate-source level**, + so each lane draws its full candidate pool from the already-scoped set. It + **MUST NOT** be a post-filter on the merged/ranked results — that would let + off-scope candidates consume both the per-lane pool and the final result + limit unevenly. + +*Orientation:* the existing `GLOBAL`/`USER` scoping already filters at the +disk-load step that feeds both the in-memory token lane and the synced SQLite +index (`local-knowledge.ts`); the connection filter fits the same seam. + +### 4. Index persistence + +The `.ktx/db.sqlite` knowledge index is re-synced from files on every search. +The implementer owns whether to persist `connections` as index columns / a side +table, or to filter the loaded page-set before the per-search sync. The binding +requirement is the uniform-across-lanes behavior in requirement 3 — not a +specific schema. + +*Trade-off note (non-binding):* filtering the loaded page-set re-syncs only the +scoped subset and gives up a little embedding-cache reuse when searches +alternate between connections (recompute is one embedding per scoped page per +connection switch — negligible at the scale this targets). Persisting +`connections` in the index avoids that at the cost of a schema addition and a +per-lane predicate. Either is acceptable. + +### 5. Write path + +- The memory agent's page-write tool (`wiki-write.tool.ts`) accepts a + `connections` input field with the same REPLACE semantics as + `tags`/`refs`/`sl_refs`: omit ⇒ keep existing on update; `[]` ⇒ clear to + unscoped; `[ids]` ⇒ set. +- When `memory_ingest` / the memory agent runs with a `connectionId`, prompt + guidance directs the agent to: + - set `connections: [connectionId]` on new **database-specific** pages, using + connection-distinctive keys; and + - leave `connections` empty for clearly **org-wide** content. +- This is **prompt guidance, not a code auto-default.** A connection-scoped + ingest must remain able to produce unscoped org-wide pages, so the tool must + not force the session's `connectionId` onto every page. + +### 6. `wiki_read` and refs unchanged + +Pages remain addressable by key regardless of scoping. `wiki_read`, `refs`, and +`sl_refs` semantics are unchanged; `connections` is a search/relevance concern +only. + +### 7. Validation + +Validation behavior splits by surface, because an explicit argument is a +typo-prone input while persisted content drifts independently of config: + +- **Explicit argument** — a connection id supplied as a command/tool argument + (`wiki_search`/`memory_ingest` `connectionId`, `ktx wiki … --connection`) + MUST be validated against `ktx.yaml` connections and **rejected with a clear + error listing the configured ids** when unknown. Reuse the canonical + `project.config.connections[id]` check. This also closes the current gap + where `memory_ingest`'s `connectionId` is accepted unvalidated. +- **Persisted frontmatter** — a connection id that appears only in a stored + page's `connections` and is not in `ktx.yaml` MUST **warn (not fail)** during + validation/doctor, and MUST NOT break loading, searching, or reading that + page. Config and content can evolve independently. + +### 8. Scope boundary + +This spec delivers the **mechanism** (frontmatter storage + uniform filter + +write surface + validation). Driving the agent to actually pass `connectionId` +during analytics work is the concern of +`03-multi-connection-routing-in-analytics-skill`. It composes with the +`--connection` flag on `ktx ingest` from `02-verbatim-ingest-mode`. + +## Acceptance criteria + +- A page with `connections: [db_a]` is returned by + `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but + **not** by `wiki_search(query, connectionId: "db_b")`. +- A page with no `connections` field is returned in all three cases above. +- Two pages — `orders_sales_db` (`connections: [sales_db]`) and + `orders_events_db` (`connections: [events_db]`) — coexist; a search scoped to + `sales_db` returns the first and not the second, and neither overwrote the + other on write. +- A connection-scoped write whose key matches an existing page scoped to a + **different** connection surfaces a collision instead of silently + overwriting (data-loss guard, requirement 2). +- Filtering works in each lane independently (test with embeddings disabled to + exercise the lexical and token lanes alone). +- `memory_ingest(content, connectionId)` produces a page scoped to that + connection for database-specific content. +- `wiki_search`/`ktx wiki search --connection ` fails with an error + that lists the configured connection ids. +- A page whose `connections` references an id absent from `ktx.yaml` produces a + warning but stays searchable and readable; search and read do not throw. +- `connections` accepts a single string and a list, both normalized to a list. +- Existing projects with no scoped pages and no `connectionId`/`--connection` + behave identically before/after. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the design. + +- **Frontmatter type + parse/serialize:** `wiki/types.ts` (`WikiFrontmatter`), + `wiki/knowledge-wiki.service.ts` (`parsePage`/`serializePage`), array + coercion `wiki/local-knowledge.ts` (`stringArray`). +- **Search lanes + per-search re-sync:** `wiki/local-knowledge.ts` + (`searchLocalKnowledgePagesWithSqlite`; the disk-load step that already + scopes `GLOBAL`/`USER`; token lane), `wiki/sqlite-knowledge-index.ts` + (FTS5 `knowledge_pages_fts` lexical lane, semantic scan, `sync`). +- **MCP surface:** `mcp/context-tools.ts` (`wiki_search`, `wiki_read`, + `memory_ingest`; `connectionId` already present on `memory_ingest` but + unvalidated). +- **CLI surface:** `commands/knowledge-commands.ts` + (`ktx wiki search`/`list`/`read`); canonical `--connection` flag in + `commands/sql-commands.ts`; validation pattern + `project.config.connections[id]` in `mcp/local-project-ports.ts`. +- **Write path:** `wiki/tools/wiki-write.tool.ts` (input schema, REPLACE + semantics, scope decision), `memory/memory-agent.service.ts` (`connectionId` + threaded through the capture session and tool session; + `external_ingest` forces `GLOBAL` scope). +- **Connection config:** `context/project/config.ts` (`connections` record in + `ktx.yaml`). + +## Benchmark context (motivation only) + +Spider 2.0-Lite local subset = one project with ~30 SQLite connections whose +schemas share table/concept names (Northwind, sakila, two e-commerce DBs…). +External-knowledge docs (RFM definition, F1 overtake rules) are each relevant +to exactly one database and must not surface for the other 29. + +## Implementation notes + +Shipped on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All +acceptance criteria covered; full package suite green (2924 passing), +type-check, knip/biome dead-code, and pre-commit clean. + +**What was built / where** + +1. **Frontmatter field (req 1).** `connections?: string[]` added to + `WikiFrontmatter` (`context/wiki/types.ts`) and to the file-layer page model + `LocalKnowledgePage` (`context/wiki/local-knowledge.ts`). Parsed via a new + `stringList()` coercion (single string → list); round-trips through both + serializers. Absent/empty ⇒ unscoped. +2. **Search/list filter (req 3, req 4).** `connectionId?` threaded through + `searchLocalKnowledgePages` → both the sqlite-FTS and scan impls → + `loadAllKnowledgePages`, and through `listLocalKnowledgePages`. The filter is + applied at the **disk-load seam** (`pageMatchesConnection`: unscoped ∪ pages + listing the id), so the token lane and the per-search SQLite sync (lexical + + semantic) both draw their candidate pool from the already-scoped set — + candidate-source level, not a post-filter. + - Chose req 4 **option B (filter the loaded page-set)** over persisting a + column. Verified-safe here: standalone ktx's memory agent reads pages from + files via a no-op `LocalKnowledgeIndex`, so `.ktx/db.sqlite`'s + `knowledge_pages` is a per-search cache that `searchLocalKnowledgePages` + rebuilds every call — scoping the sync corrupts no shared state. Only cost + is one embedding recompute per scoped page on a connection switch (the + spec's acknowledged, negligible trade-off). No index-schema change. +3. **Page identity + data-loss guard (req 2).** Keys stay flat/global; + `wiki_read`/refs unchanged. The write tool (`wiki/tools/wiki-write.tool.ts`) + rejects (hard error, no silent clobber) a connection-scoped write whose + incoming `connections` is **disjoint** from a same-key existing page's + non-empty `connections`, suggesting a connection-distinctive key. Same-scope, + overlapping, broaden/narrow, and unscoped-existing updates are allowed. + Chose a hard error over auto-suffixing so the conflict reaches the agent + (the decision-maker) instead of silently forking the key namespace. +4. **Write path (req 5).** `wiki_write` accepts `connections` (string or list) + with REPLACE semantics (omit ⇒ keep, `[]` ⇒ unscoped, `[ids]` ⇒ set); no + code auto-default of the session connection. Prompt guidance added to the + shared `wiki_capture` skill (new "Connection scoping" section) and the + `memory_agent_external_ingest` prompt. The session `connectionId` is now + surfaced to the agent so the guidance is actionable: in the memory-agent + prompt header and in the ingest work-unit `` block + (`build-wu-context.ts`, fed from `ingest-bundle.runner.ts`). +5. **Validation (req 7).** New shared helper + `context/connections/configured-connections.ts → assertConfiguredConnectionId` + validates explicit connection-id arguments against `ktx.yaml` and throws an + error listing the configured ids. Routed from all three explicit-arg + surfaces: MCP `wiki_search` (`local-project-ports.ts`), MCP `memory_ingest` + (validated at the boundary in `mcp-server-factory.ts` — this also closes the + prior gap where `memory_ingest`'s `connectionId` was accepted unvalidated), + and CLI `ktx wiki --connection`/`-c` (`commands/knowledge-commands.ts` + + `knowledge.ts`). Persisted-frontmatter ids absent from config are **warn-only**: + `listReferencedConnectionIds` + a non-fatal `ktx status` warning + (`status-project.ts`); loading/searching/reading never throw on them. + +**Deviations / notes** + +- Req 1 says "reuse the existing array-coercion helper used for `tags`/`refs`". + That helper (`stringArray`) is array-only and does **not** coerce a single + string; added a dedicated `stringList` for `connections` to meet the + single-string acceptance criterion rather than change `stringArray`'s + behavior for the other fields. +- **Scope boundary kept:** `discover_data` (MCP) also searches wiki and already + takes `connectionId`, but req 3/8 scope the filter to `wiki_search` + CLI, so + its wiki lane is intentionally left unscoped. Worth a follow-up if + `discover_data`'s wiki results should also be connection-scoped for + consistency. +- MCP tools-list snapshot and the `mcp-server-factory` test were updated for the + new `wiki_search.connectionId` param and the `memory_ingest` validation + wrapper (the port is no longer the raw service object; it delegates). diff --git a/spider2-specs/specs/02-verbatim-ingest-mode.md b/spider2-specs/specs/02-verbatim-ingest-mode.md new file mode 100644 index 00000000..a16645d8 --- /dev/null +++ b/spider2-specs/specs/02-verbatim-ingest-mode.md @@ -0,0 +1,327 @@ +# Verbatim ingest mode for authoritative documents + +> Refined spec. Intake draft: `todo/02-verbatim-ingest-mode.md`. + +## Problem + +`ktx ingest --text/--file` routes captured content through the memory agent. +`runKtxTextIngest` (`packages/cli/src/text-ingest.ts`) builds a +`MemoryAgentInput` with `sourceType: 'external_ingest'` and hands it to +`MemoryAgentService.ingest` (`context/memory/memory-agent.service.ts`), which +runs a multi-step LLM triage loop (≈30-step budget, content clipped to ~48k +chars) inside a session worktree. The agent decides — via the `wiki_write` +tool — what to persist, so it may **rewrite, condense, split, or re-title** the +content before it lands as a wiki page. The body is produced by an LLM, not +copied by code. + +For *authoritative* documents — formula definitions, metric specs, runbooks, +compliance text — paraphrasing is a defect, not a feature: + +- exact thresholds, constants, and rule wording must survive unchanged; +- lexical (BM25/FTS5) search works best when the stored text matches the + phrasing users and agents query with; +- ingestion should be deterministic and reproducible — the same input file + yields the same page, and re-running is safe. + +Two further gaps block authoritative ingest today: + +- The memory agent hard-requires an LLM backend + (`context/memory/local-memory.ts` throws when `llm.provider.backend: none` + and no runner is injected), so there is **no** offline ingest path at all. +- The agent's write tool *merges* a repeated same-scope key in place (REPLACE + frontmatter semantics in `wiki/tools/wiki-write.tool.ts`), i.e. exactly the + silent in-place rewrite an authoritative-document workflow must avoid. + +## Generic use case + +Any team ingesting documents that are already the source of truth: metric +definition sheets, SLA documents, calculation-methodology docs, regulatory +text. The user wants **ktx** to *index and surface* the document, not to +re-author it. Today they work around the memory agent by hand-writing +frontmatter and copying files into `wiki/global/`; verbatim mode makes that a +first-class, supported `ktx ingest` workflow. + +## Model + +`ktx ingest --verbatim` is a **distinct, code-driven ingest path**, not a +constrained prompt over the existing agent loop. Its defining invariants: + +- **The stored page body is the input document body, written by code.** The LLM + never produces, edits, or relays the body. It is confined to generating + *metadata* about the body. +- **Behavior follows from inputs, not from a mode prompt.** Whether metadata is + LLM-generated or derived offline follows from the configured backend + (`llm.provider.backend`), not from a second user-facing switch. +- **Pages are `GLOBAL`-scoped.** Verbatim ingest targets org/project + authoritative docs (the content teams copy into `wiki/global/` today). + Connection association is expressed by the **additive `connections` + frontmatter** from spec 01, never by directory. +- **Deterministic and idempotent.** The page key, the merged frontmatter, and + the stored body are all functions of the input alone (given a fixed backend), + so the same input produces the same page and a re-run is a safe no-op. + +### "Byte-for-byte" scope + +The guarantee is on the document's **interior**: no paraphrase, no condense, no +split, no re-title, no reflow, **no clipping**. The shared wiki store +canonicalizes *surrounding* whitespace — `parsePage` trims the body and +`serializePage` emits a single trailing newline +(`wiki/knowledge-wiki.service.ts`) — so leading/trailing blank lines are +normalized by the storage layer. Verbatim mode **MUST** write through that +shared `writePage`/`serializePage` path rather than fork a parallel serializer; +the interior bytes (thresholds, constants, wording) are what must be preserved +exactly, and they are. Acceptance hashes compare the stored body against the +**trimmed** input body. + +## Requirements + +### 1. Flag + +`ktx ingest --file --verbatim` and `ktx ingest --text +--verbatim`. `--verbatim` is a boolean that applies to every `--file`/`--text` +item in the invocation; each item becomes its own page. + +- It composes with the existing `--connection-id ` flag + (`commands/ingest-commands.ts`) so the resulting page can be + connection-scoped (see spec 01). **Note:** the intake draft wrote + `--connection`; the shipped flag is `--connection-id`. Use `--connection-id`. +- No new `--key` flag (see requirement 4). No second behavioral switch beyond + `--verbatim` itself. + +### 2. Body preservation is enforced by code, not by prompt + +The stored page body is the input content (interior preserved exactly, per +**Model → "Byte-for-byte" scope**). + +- Verbatim mode **MUST NOT** route the body through the memory-agent LLM loop + or any `wiki_write` tool call where a model could alter it. +- The LLM, when used, generates **only** metadata: `summary`, `tags`, and + `sl_refs`. A single constrained structured-output call (AI SDK v6 + `generateObject` with a `zod` schema) is the intended mechanism — the full + memory-agent loop, worktree, and squash-merge are **not** required and should + not be used. +- The page key is **not** LLM-generated (requirement 4). + +### 3. No clipping of the stored body + +The ~48k clip may apply only to the text **sent to the LLM** for metadata +generation. It **MUST NOT** apply to the text **written** to the page. A +document larger than the clip limit is stored in full; only its metadata is +derived from the clipped prefix. + +### 4. Deterministic page key + +The key is derived from the input, never chosen by the LLM (an LLM-chosen slug +would break determinism and the requirement-6 idempotency guarantee): + +- **`--file `** → `suggestFlatWikiKey(basename without extension)` + (`wiki/keys.ts`). This is the primary document case and is always + deterministic. +- **`--text `** → if the content opens with a Markdown heading, the + key is `suggestFlatWikiKey(heading text)`. If there is no leading heading, + **hard error**: inline verbatim text needs a leading heading to derive a + stable key, or should be passed as `--file`. +- No hash-based keys (unfindable) and no `--key` override flag. A real need for + explicit key control can add `--key` later. + +### 5. Frontmatter: passthrough + gap-fill + +If the input has its own YAML frontmatter, split it from the body: the body is +everything after the closing `---`; the frontmatter is authoritative metadata. + +- **Passthrough.** Every input frontmatter field is preserved in the stored + page, **including fields not in `WikiFrontmatter`** (`effective_date`, + `version`, `owner`, …). The serializer `YAML.stringify`s the object, so + unknown keys round-trip. Dropping them would be silent data loss on + authoritative docs. +- **Gap-fill only.** Generated/derived metadata fills **absent** fields only; + it **MUST NOT** overwrite an explicit value. An input `summary:` is never + replaced by a generated one; explicit `tags`/`sl_refs` are likewise kept. +- **Defaults.** `usage_mode` defaults to `auto` (findable via search, not + force-injected) when the input does not set it. +- **Connection scoping.** `--connection-id X` (validated via + `assertConfiguredConnectionId`, `context/connections/configured-connections.ts`) + sets `connections: [X]` when the input frontmatter does not already declare + `connections`. If the input frontmatter declares a **different** + `connections` than the flag, **hard error** (ambiguous intent) rather than + silently choosing one. If they match, or only one source is present, proceed. + +### 6. Degraded mode (`llm.provider.backend: none`) + +`--verbatim` **MUST** work with no LLM backend — this is its capability the +regular agent ingest lacks. + +- `summary` is derived from the leading Markdown heading text, or, if none, the + first non-empty sentence of the body (trimmed to a reasonable length). +- `tags` and `sl_refs` are left empty. +- The body is still stored in full (requirement 3 applies unchanged). + +### 7. Key collisions: idempotent-if-identical, else hard error + +Verbatim mode does **not** reuse the agent write tool's in-place merge. Before +writing, read any existing `GLOBAL` page at the derived key: + +- **No existing page** → write. +- **Existing page, stored body identical** to the new body (compared after the + storage-layer normalization in **Model**) → **idempotent no-op success** + (re-running the same file is safe). +- **Existing page, body differs** → **hard error** naming the conflicting key + and directing the user to a distinct key. Never a silent overwrite, never an + auto-suffixed second page (which would produce the duplicated/divergent pages + this mode must avoid). + +### 8. LLM-failure handling + +When a backend **is** configured but the metadata call fails (rate limit, +transport error, malformed output after retries), **fail the item** (honoring +`--fail-fast` and the per-item exit-code aggregation in `text-ingest.ts`). +**MUST NOT** silently fall back to degraded derivation: a degraded page written +on a transient error would, under requirement 7, refuse to be replaced by a +healthy re-run — breaking reproducibility. Degraded derivation is reserved for +`backend: none`. + +### 9. Findability + +After write, the page is reindexed so search returns it: + +- `wiki_search` for a phrase taken from the document body returns the page via + the lexical lane (the body is indexed in `buildKnowledgeSearchText`). +- `wiki_search` for a paraphrase of the document's topic returns it via the + semantic lane **when embeddings are enabled** (this is what the generated + `summary`/`tags` buy over a bare degraded page). + +## Acceptance criteria + +- Ingesting a file with `--verbatim` produces a page whose body is + byte-identical to the trimmed input body (assert with a hash in tests). +- A >48k-char file is stored in full (assert stored body length ≥ input length + minus trim). +- Running the same `--verbatim` ingest twice is idempotent: one page, identical + bytes both times, no error on the second run. +- A second ingest to the same derived key with **different** body content fails + loudly (requirement 7) and does not modify the existing page or create a + suffixed one. +- Input frontmatter with an unknown field (e.g. `effective_date`) is preserved + in the stored page; an explicit input `summary` is **not** overwritten by a + generated one. +- With `llm.provider.backend: none`, `--verbatim` still produces a page: full + body stored, `summary` derived from the heading/first sentence, `tags` and + `sl_refs` empty. +- `--verbatim --connection-id X` yields a page with `connections: [X]`; an + unknown id is rejected with an error listing the configured ids. (Depends on + spec 01, now shipped.) +- `--verbatim --connection-id X` where the input frontmatter already declares a + different `connections` fails with an ambiguity error. +- `ktx ingest --text "no heading here" --verbatim` errors asking for a leading + heading or `--file`. +- `wiki_search` for a body phrase returns the page (lexical lane); for a topic + paraphrase it returns the page when embeddings are enabled (semantic lane). + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +module layout and design, subject to the invariants above. + +- **Command flag:** `commands/ingest-commands.ts` (`ktx ingest` option table; + `--text`/`--file`/`--connection-id`/`--fail-fast` already present — add + `--verbatim` and thread it into `KtxTextIngestArgs`). +- **Orchestration:** `text-ingest.ts` (`runKtxTextIngest`, `loadItems`, + `validateItems`, per-item loop and exit-code aggregation). The verbatim flow + reuses item loading and replaces the `memoryIngest.ingest(...)` call with a + code-driven write for `--verbatim` items. Keep the new logic in a focused + module (e.g. a `verbatim-ingest` sibling) rather than swelling `text-ingest`. +- **Frontmatter split / write / serialize:** `wiki/knowledge-wiki.service.ts` + (`parsePage` for the `---…---` split shape, `serializePage`, `writePage`, + `readPage` for the collision check). Write through this shared path — do not + re-implement YAML framing. +- **Key derivation:** `wiki/keys.ts` (`suggestFlatWikiKey`, `assertFlatWikiKey`). +- **Frontmatter type:** `wiki/types.ts` (`WikiFrontmatter`; `summary` and + `usage_mode` are the required fields; unknown passthrough fields live + alongside). +- **Connection validation:** `context/connections/configured-connections.ts` + (`assertConfiguredConnectionId`, shipped with spec 01). +- **Metadata LLM call:** the local LLM runtime/config resolution in + `context/llm/` (e.g. `local-config.ts`; `backend: none` ⇒ no runtime). Use a + single `generateObject` call with a `zod` metadata schema; the `ai-sdk` skill + covers v6 patterns. +- **Reindex / search lanes:** `wiki/local-knowledge.ts` + (`loadAllKnowledgePages`, `buildKnowledgeSearchText`, the lexical/token/ + semantic lanes) and `wiki/sqlite-knowledge-index.ts` (`sync`). +- **Tests:** extend `packages/cli/test/text-ingest.test.ts` and add a + verbatim-focused test file covering the acceptance criteria above. + +## Benchmark context (motivation only) + +Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket +definitions, the haversine formula, F1 overtake rules, …). Gold SQL was +authored against their **exact** text; an LLM paraphrase that drops a bucket +boundary or rounds a constant loses the corresponding question. The current +workaround is hand-writing frontmatter and copying files into `wiki/global/`. +Verbatim mode turns that manual step into a supported **ktx** workflow, and +composes with the connection scoping from spec 01 so a doc relevant to exactly +one of the benchmark's ~30 SQLite databases does not surface for the other 29. + +## Implementation notes + +Shipped on branch `write-feature-spec-wiki`. All acceptance criteria are covered +by tests and verified end-to-end through the linked `ktx-dev` binary. + +**What was built** + +- New module `packages/cli/src/verbatim-ingest.ts`: `createLocalProjectVerbatimIngestor` + + `LocalVerbatimIngestor`, plus the pure helpers `splitInputDocument`, + `deriveVerbatimPageKey`, `deriveDegradedSummary`, and `buildVerbatimFrontmatter` + (the last four are `@internal` exports for unit testing). +- `--verbatim` flag added to `ktx ingest` in `commands/ingest-commands.ts`, with a + guard that rejects `--verbatim` without `--text`/`--file`. The flag is threaded + into `KtxTextIngestArgs.verbatim`. +- `text-ingest.ts` now tags each loaded item with an `origin` + (`file` / `text` / `stdin`) and, when `verbatim` is set, constructs the verbatim + ingestor once and branches the per-item loop to a code-driven write instead of + `memoryIngest.ingest(...)`. The shared view, exit-code aggregation, and + `--fail-fast` handling are reused. + +**Deviations from the literal spec (design refinements, per "implementer owns the design")** + +- *Metadata call.* The spec suggested raw AI SDK v6 `generateObject`. The + implementation routes through the existing `KtxLlmRuntimePort.generateObject` + instead — it is implemented by all three backends (ai-sdk, claude-code, codex), + and the ai-sdk one already wraps `generateText` + `Output.object({schema})`. + This realizes the spec's "single constrained structured-output call" intent via + the canonical cross-backend path rather than forking a second LLM entry point. +- *Reindex (requirement 9).* In the standalone CLI, `searchLocalKnowledgePages` + rebuilds the SQLite index from disk on every call (recomputing embeddings for + changed pages), so a written page is findable without a dedicated reindex step. + The write still goes through the shared `KnowledgeWikiService.writePage` + + `syncSinglePage` path, so the page is also eagerly indexed. +- *Gap-fill optimization.* The LLM is skipped entirely when the input frontmatter + already supplies `summary`, `tags`, and `sl_refs` (generated metadata only fills + absent fields, so there is nothing to generate). A fully specified document thus + ingests with a configured backend without any LLM call. + +**Tests** + +- `packages/cli/test/verbatim-ingest.test.ts` — helper units + ingestor integration + against a real `initKtxProject` git repo (byte-identical body hash, >48k no-clip, + idempotency, conflict hard-error, frontmatter passthrough, explicit-summary + preservation, degraded mode, connection scoping + unknown-id rejection + + ambiguity error, no-heading inline error, LLM gap-fill, LLM-failure-fails-item, + lexical + semantic findability). +- `packages/cli/test/text-ingest.test.ts` — verbatim routing, origin tagging, + connection-id forwarding, fail-fast. +- `packages/cli/test/index.test.ts` — `--verbatim` flag threading and the + requires-`--text`/`--file` guard. + +**Docs** + +- `docs-site/content/docs/cli-reference/ktx-ingest.mdx` (flag, "Verbatim ingest" + section, examples, common errors) and + `docs-site/content/docs/guides/writing-context.mdx` (authoritative-document + workflow). + +**Verification** + +- Full CLI suite: 2959 passed, 1 skipped. `pnpm run build` and `pnpm run dead-code` + (Biome + Knip default + production) clean; pre-commit clean on changed files. + A pre-existing, unrelated type error in `test/mcp-server-factory.test.ts` is + untouched — it predates this work. diff --git a/spider2-specs/specs/06-scan-tolerate-broken-objects.md b/spider2-specs/specs/06-scan-tolerate-broken-objects.md new file mode 100644 index 00000000..e64d87ef --- /dev/null +++ b/spider2-specs/specs/06-scan-tolerate-broken-objects.md @@ -0,0 +1,361 @@ +# Schema scan tolerates individual objects that fail introspection + +> Refined spec. Intake draft: `todo/06-scan-tolerate-broken-objects.md`. + +## Problem + +A single broken or inaccessible object zeroes out an entire connection's +context. Schema introspection iterates objects with no per-object error +handling, so one throw aborts the whole scan, the live-database adapter's +`fetch()` rejects, and the connection ends with **no semantic layer at all** — +even when every other object was healthy. + +The failure surfaces in two phases, and the contract must hold in both: + +- **Metadata read (sqlite).** `connectors/sqlite/connector.ts` does + `rawTables.map((t) => this.readTable(...))` (≈ line 171) with no try/catch. + `readTable` runs `PRAGMA table_info()`, which *executes* a view's + body to resolve its columns — so a view over a dropped/renamed column (the + `oracle_sql` case: `emp_hire_periods_with_name` selecting `ehp.start_date` + from a base table that has no such column) raises `no such column: + ehp.start_date` and aborts introspection of all ~48 healthy objects. +- **Profiling read (warehouse drivers).** postgres/mysql/clickhouse/sqlserver/ + bigquery/snowflake read metadata in bulk from catalog / `information_schema` + (a broken view rarely breaks that), then fail when a per-object profiling or + sampling `SELECT` runs against a broken object. Enrichment sampling is + *already* isolated (`description-generation.ts` wraps `sampleTable` in + try/catch → `sampling_failed`), but mandatory introspection-phase reads are + not uniformly isolated across drivers. + +A second, related defect blocks the documented escape hatch. Setting +`enabled_tables: ["main.customers"]` on a sqlite connection produces a +different hard failure — `Adapter "database schema" did not recognize fetched +source output`. Root cause: the sqlite connector emits every object as +`{ db: null }` and filters the scope with `scopedTableNames(scope, { db: null })` +(`context/scan/table-ref.ts` ≈ line 47, `if (ref.db !== wantDb) continue`), but +`"main.customers"` parses to `{ db: "main", name: "customers" }` +(`context/scan/enabled-tables.ts`, `parseDottedTableEntry`). `"main" !== null`, +so the entry matches **nothing**, zero table files are written, and +`detectLiveDatabaseStagedDir` (`stage.ts` ≈ line 138) returns false, tripping +the generic "did not recognize fetched source output" error at +`context/ingest/local-stage-ingest.ts` (≈ line 291). The bare form +`enabled_tables: ["customers"]` would have worked; the `main.`-qualified form +silently matches nothing. + +## Generic use case + +Real warehouses routinely contain broken or inaccessible objects: views over +dropped/renamed columns, views referencing tables the connection role can't +read, permission-denied tables, and vendor system views that error on read. +**ktx** should ingest everything it *can* and skip what it can't, so one bad +object never zeroes out an entire connection's context. This is baseline +production robustness, independent of any benchmark — the same tolerance a +33-warehouse fleet needs the first time one of its databases has a stale view. + +## Design + +The unit of failure is **one object** (table or view). Introspecting or +profiling an object is an operation that can fail independently; a failure skips +that object, records a recoverable warning, and the scan continues from the +objects that succeeded. + +Because seven Node connectors and the Python daemon each introspect differently +(sqlite reads metadata per-object via `PRAGMA`; warehouse drivers read metadata +in bulk and fail per-object during profiling), the **semantics** of "skip / +warn / total-failure" are defined **once** and every connector routes through +them — rather than seven copies of the same try/catch that drift apart: + +- A shared per-object helper in the `scan/` layer — the sibling of the existing + `tryConstraintQuery` (`context/scan/constraint-discovery.ts`) — wraps a single + object read and returns `{ ok: true, table } | { ok: false, warning }`, with a + standard warning code (e.g. `object_introspection_failed`). +- A shared post-check enforces the total-failure rule (R3) uniformly. +- Each connector keeps its **natural** shape: sqlite routes each `readTable` + through the helper; bulk-read drivers route their per-object profiling reads + through it. The contract is uniform; the loop is not forced to be. +- The Python daemon implements the **same contract** in its own helper, adds a + `warnings` field to `DatabaseIntrospectionResponse`, and the Node adapter maps + those warnings into `KtxSchemaSnapshot` (`daemon-introspection.ts`). + +The warning channel already exists end to end on the Node side +(`KtxSchemaSnapshot.warnings`, the `KtxScanWarning` shape with `table`/`column`/ +`recoverable`, the `KtxScanWarningCode` enum, and the staged `warnings.json` +artifact written by `writeLiveDatabaseSnapshot`); sqlite simply never populates +it. This spec makes that channel carry object-skip warnings and surfaces them in +the ingest summary, the persisted report body, and `ktx status`. + +## Requirements + +### R1 — Per-object isolation (the contract) + +If introspecting or profiling one object throws, the scan **MUST** skip that +object, record a `KtxScanWarning` (object name, the error message, and any +schema/catalog qualifier; `recoverable: true`), and continue with the remaining +objects. No single object may abort the scan. + +- The contract holds in **both** phases: the mandatory metadata read *and* any + profiling/row-count/sample read performed during introspection. +- It holds for **all seven Node connectors** + (`packages/cli/src/connectors//`) and the **Python daemon** postgres + path (R6). +- The semantics are defined once (the shared helper + warning code from the + Design section) and every connector routes through them. Do not inline a + divergent per-driver copy. +- Warnings **MUST NOT** carry secrets or full SQL bodies; record the object + identifier and the database's error text, redacted through the existing + `redactKtxSensitiveMetadata` path that `warnings.json` already uses. + +### R2 — Surface, don't hide + +Skipped objects **MUST** be reported both at ingest time and in the durable +status view: + +- **Ingest summary.** The `ktx ingest` run summary (human-facing output) reports + a count plus the object name and a short reason for each skip — e.g. + `Skipped 1 object — emp_hire_periods_with_name: no such column ehp.start_date`. +- **Run report.** Object skips land in the run report's `warnings.json` artifact + (already written) and in the persisted report body (`IngestReportBody`), whose + natural home is the existing `fetch?: SourceFetchReport` field — the fetch + phase *is* introspection. +- **`ktx status`.** `ktx status` shows a per-connection skipped-objects line for + the connection's latest ingest — e.g. `oracle_sql: 1 object skipped — + emp_hire_periods_with_name: no such column ehp.start_date`. This is **derived + from the latest persisted report, not new persisted state**: the report body + is already stored whole as a JSON blob (`local_ingest_reports.body_json`), so + surfacing it requires **no `.ktx/db.sqlite` schema migration** — `status` + reads and renders the skip info already present in the latest report body. A + connection whose latest ingest skipped nothing shows no such line. + +### R3 — Failure semantics (partial vs total) + +Per-object skipping is **unconditional** — there is **no new config knob**, and +the existing `ingest.workUnits.failureMode` (which governs the later LLM +work-unit stage, not introspection) is untouched and orthogonal. Outcomes are +derived from object counts, not from a mode: + +| Scope | Objects discovered / matched | Introspection outcome | Result | +| --- | --- | --- | --- | +| none | 0 | n/a (legitimately empty DB) | **success**, empty layer | +| none | N > 0 | ≥ 1 succeeds | **success** + warnings for the rest | +| none | N > 0 | all N fail | **connection failure** (clear error) | +| `enabled_tables` | matches 0 objects | n/a | **clear scope error** (R5) | +| `enabled_tables` | matches M > 0 | ≥ 1 succeeds | **success** + warnings | +| `enabled_tables` | matches M > 0 | all M fail | **connection failure** | + +- "Connection failure" means the connector / `fetch()` raises a **clear, + actionable error** for that connection. It **MUST NOT** surface as the generic + `did not recognize fetched source output` (that message is reserved for a + genuinely unrecognized staged dir, not an empty/total-failure result). +- A total failure of one connection follows existing per-connection ingest + orchestration for whether sibling connections continue; this spec does not + change cross-connection behavior. + +### R4 — A broken view never blocks base tables + +A broken view **MUST NEVER** prevent base-table ingest. + +- View introspection failures are isolated exactly like any other object (R1). +- Mandatory introspection **MUST** prefer reading an object's structure from the + catalog where possible over executing the object's body, and **MUST NOT** run + a data-reading query (row count, sample) against a view as a required step. + (sqlite already skips `COUNT(*)` for views; the remaining gap is isolating the + metadata read that executes the view definition.) + +### R5 — `enabled_tables` allowlist works + +The documented allowlist escape hatch **MUST** reliably restrict the scan to the +listed objects, with no spurious adapter error: + +- **sqlite qualification.** The schema-qualified form `"main."` **MUST** + resolve to the same object as the bare form `""` (sqlite's sole schema + is `main`; the connector emits `db: null`). Both forms select the object; + neither silently matches nothing. +- **Documented format.** The accepted qualification forms for each driver + (`catalog.db.name` / `db.name` / `name`) and the sqlite-specific `main` + equivalence **MUST** be documented where `enabled_tables` is described + (`context/project/driver-schemas.ts` and the user-facing config docs). +- **Zero-match is a clear error.** A non-empty `enabled_tables` that resolves to + **zero** matched objects **MUST** fail with an actionable error naming the + connection, the unmatched entries, and the available object names — **not** the + generic `did not recognize fetched source output`. This is distinct from a + legitimately empty database (R3 row 1) and from a matched-but-all-broken scope + (R3 last row). +- **Any subset works.** An `enabled_tables` matching M > 0 objects ingests + **exactly** those M objects (minus any that fail per R1), with no adapter + recognition error regardless of how small or edge-case the set is. + +### R6 — Python daemon parity + +The daemon's postgres introspection path **MUST** honor the same contract: + +- Add a `warnings` field to `DatabaseIntrospectionResponse` + (`python/ktx-daemon/src/ktx_daemon/database_introspection.py`) carrying the + same shape Node expects (code, message, object identifier, recoverable). +- Isolate per-object failures in the daemon's introspection so one broken object + does not abort the response; apply the R3 total-failure rule there too. +- Map daemon warnings into `KtxSchemaSnapshot.warnings` in + `mapDaemonSnapshot` (`context/ingest/adapters/live-database/daemon-introspection.ts`), + which currently drops them. +- The Node and Python warning shapes **MUST** stay in parity (the codebase + already mirrors Node↔Python schemas for telemetry; follow the same discipline + so the daemon cannot emit a code Node can't render). + +## Acceptance criteria + +- Ingesting a sqlite DB with one broken view + N healthy tables yields a + semantic layer for the N healthy tables and **exactly one** warning naming the + broken view and its error; exit is **success**. +- The skipped object appears in the `ktx ingest` summary output, in the run's + `warnings.json`, and in `ktx status` as a per-connection skipped-objects line + on the connection's latest ingest. +- A sqlite DB in which **every** discovered object fails introspection (and the + file opens) exits as a **connection failure** with a clear error — not an + empty "success" and not `did not recognize fetched source output`. +- A genuinely empty sqlite DB (zero objects) exits **success** with an empty + layer (not a failure). +- `enabled_tables: ["main.customers"]` and `enabled_tables: ["customers"]` both + ingest exactly the `customers` object on a sqlite connection. +- `enabled_tables` restricted to a valid subset of M objects ingests exactly + that subset, with **no** adapter-output error. +- `enabled_tables` that matches zero objects fails with an error naming the + connection, the unmatched entries, and available objects — distinguishable + from the empty-DB and all-broken cases. +- A broken view does not prevent ingest of base tables in the same connection + (regression test with a view that errors on read alongside a healthy table). +- The daemon's `DatabaseIntrospectionResponse` carries a `warnings` array, and a + per-object failure in the daemon path produces a warning mapped into + `KtxSchemaSnapshot.warnings` (Node↔Python parity test). +- A warehouse-driver object whose profiling/sample read fails is skipped with a + warning and does not abort introspection of its siblings. +- Existing healthy-only ingests (no broken objects, no `enabled_tables`) behave + identically before/after — no warnings, same semantic layer. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the design. + +- **Shared semantics:** `context/scan/constraint-discovery.ts` + (`tryConstraintQuery` / `constraintDiscoveryWarning` — the precedent to mirror + for the per-object helper), `context/scan/types.ts` + (`KtxSchemaSnapshot.warnings`, `KtxScanWarning`, `KtxScanWarningCode` — add the + new object-skip code here). +- **Node connectors:** `packages/cli/src/connectors//connector.ts` and + each `live-database-introspection.ts`. sqlite's loop is + `connectors/sqlite/connector.ts` `introspect` (≈ line 158) → `readTable` + (≈ line 306); the missing try/catch is the `rawTables.map(...)` at ≈ line 171. + Existing per-table sample isolation precedent: `description-generation.ts` + (≈ line 867, `sampling_failed`). +- **Driver dispatch:** `packages/cli/src/local-adapters.ts` (≈ lines 122-156) + routes every driver to its Node connector; the daemon is the `else` fallback. +- **`enabled_tables` matching:** `context/scan/enabled-tables.ts` + (`resolveEnabledTables`, `parseDottedTableEntry`), `context/scan/table-ref.ts` + (`scopedTableNames`, the `ref.db !== wantDb` filter ≈ line 47), + `context/project/driver-schemas.ts` (`enabled_tables` schema + description). +- **Staging / detect / error surface:** + `context/ingest/adapters/live-database/stage.ts` + (`writeLiveDatabaseSnapshot`, `warningArtifact` ≈ line 94, + `detectLiveDatabaseStagedDir` ≈ line 138), + `context/ingest/local-stage-ingest.ts` (the + `did not recognize fetched source output` throw ≈ line 291 — must stop being + the surface for empty-scope and total-failure). +- **Ingest summary:** `packages/cli/src/ingest.ts` (`writeReportStatus` + ≈ line 202), `context/ingest/memory-flow/summary.ts` + (`formatMemoryFlowFinalSummary`) — thread object skips into the human-facing + summary. +- **Report body + `ktx status`:** `context/ingest/reports.ts` (`IngestReportBody`; + `SourceFetchReport` as the home for scan warnings), + `context/ingest/sqlite-local-ingest-store.ts` (the report body is persisted + whole as `body_json` ≈ line 90 — no migration needed), `status-project.ts` + (`buildLocalStatsStatus` reads `local_ingest_reports`; parse the latest body + per connection and render the skipped line via `renderLocalStatsAsLines`). +- **Daemon path:** `python/ktx-daemon/src/ktx_daemon/database_introspection.py` + (`DatabaseIntrospectionResponse` ≈ line 165, `introspect_database_response` + ≈ line 323, `_load_postgres_rows` ≈ line 227, `_map_rows_to_tables` + ≈ line 267), and the Node mapping in + `context/ingest/adapters/live-database/daemon-introspection.ts` + (`mapDaemonSnapshot` ≈ line 209). + +## Benchmark context (motivation only) + +`oracle_sql` (8 of the 135 local sqlite questions) currently has **no** semantic +layer because of its one broken view, so those questions fall back to raw +`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning +restores enriched context for that database. The same robustness is required for +the full Spider 2.0-Lite run across BigQuery and Snowflake, where broken or +permission-restricted objects are common and a single one must not zero out a +warehouse's context. + +## Implementation notes + +Shipped on branch `write-feature-spec-wiki`. All requirements implemented; +verified with `pnpm --filter @kaelio/ktx run test` (2981 passing), +`pnpm run dead-code`, `uv run pytest python/ktx-daemon/tests` (97 passing), +`uv run pre-commit`, and `pnpm run build && pnpm run link:dev`. + +**Shared semantics (R1).** New `context/scan/object-introspection.ts` exposes +`tryIntrospectObject(ctx, fn)` (sibling of `tryConstraintQuery`), returning +`{ ok, table } | { ok: false, warning }` and building an +`object_introspection_failed` warning (object name + redactable DB error). It +rethrows native programming faults (`isNativeProgrammingFault`) so a ktx bug is +never masked as an object skip. The new warning code was added to +`KtxScanWarningCode` (`scan/types.ts`), the `scanWarningCodes` allowlist +(`local-structural-artifacts.ts`, plus a new exported `isKtxScanWarningCode` +validator), and `describeWarningGroup` (`scan.ts`). + +**Per-object isolation, where it actually exists (R1/R4).** Only sqlite +(`readTable` via `PRAGMA`) and bigquery (`tableRef.get()` per dataset) do +per-object reads during *mandatory* introspection; both now route each object +through `tryIntrospectObject`. The other five Node connectors (postgres, mysql, +clickhouse, sqlserver, snowflake) read metadata in bulk from the catalog/ +`information_schema` (already object-safe at this phase) and isolate per-object +profiling/sampling in the enrichment phase (`description-generation.ts`, +`sampling_failed`), so no divergent per-driver try/catch was added there. sqlite +also tolerates a `COUNT(*)` (profiling) failure without dropping a +structurally-readable table, and a broken view's metadata read is isolated so it +never blocks base tables (R4). + +**Single-source outcome decision (R3/R5).** New +`adapters/live-database/scan-outcome.ts#assertLiveDatabaseScanOutcome` runs once +in `LiveDatabaseSourceAdapter.fetch()` — the one path every driver (and the +daemon) routes through — and derives the outcome from the snapshot + scope: +≥1 object → success (skips ride along as warnings); all matched objects failed → +clear `KtxExpectedError`; non-empty `enabled_tables` matched nothing → clear +zero-match error naming the connection, the requested entries, and the available +objects (sqlite/bigquery attach the discovered inventory via +`metadata.discovered_object_names`); empty database (no scope) → success with an +empty layer. `detectLiveDatabaseStagedDir` no longer requires table files, so a +valid empty staging is recognized; total-failure/zero-match now throw a clear +connection error before staging instead of surfacing the generic +`did not recognize fetched source output`. + +**`enabled_tables` matching (R5).** Normalized at the scope boundary in +`resolveEnabledTables` using `connection.driver`: for sqlite, `main.` → +`{ db: null }`, so `"main.customers"` and `"customers"` select the same object. +`table-ref.ts` stayed generic. Documented in `driver-schemas.ts` and +`docs-site/.../configuration/ktx-yaml.mdx`. + +**Surfacing (R2).** Deviation from the spec's orientation: live-database schema +ingest runs through the **stage-only** path (`runLocalStageOnlyIngest` → +`local_ingest_reports`), not the bundle runner, so the home for scan warnings is +`LocalIngestRunRecord.fetch` (a new `SourceFetchReport` field; `body_json` is +persisted whole, so **no migration**), not the bundle-only +`IngestReportBody.fetch`. Both ingest paths read `adapter.readFetchReport` +(`live-database/fetch-report.ts` derives skips from the existing `warnings.json`). +The ingest summary is already rendered by `runKtxScan` from `report.warnings` +(the new `describeWarningGroup` case), and `ktx status` +(`status-project.ts#buildLocalStatsStatus`/`renderLocalStats`) now parses the +latest report body per connection and prints a per-connection +`N object(s) skipped — name: reason` line. + +**Daemon parity (R6).** `database_introspection.py` adds a `warnings` field to +`DatabaseIntrospectionResponse` and a `DatabaseIntrospectionWarning` model, +isolates per-object failures in `_map_rows_to_tables`, and shares the +`OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed"` string with +Node. `mapDaemonSnapshot` maps `raw.warnings` into `KtxSchemaSnapshot.warnings`, +dropping any code Node cannot render (validated via `isKtxScanWarningCode`). +Deviation: the daemon does **not** re-enforce the R3 total-failure rule — the +shared Node post-check (`assertLiveDatabaseScanOutcome`) owns it for every driver +including the daemon, avoiding a divergent second implementation. Parity is +covered by a Node test (daemon-shaped warning round-trips) and a pytest +(per-object failure → warning with the shared code). diff --git a/spider2-specs/specs/07-analytics-skill-sql-craft.md b/spider2-specs/specs/07-analytics-skill-sql-craft.md new file mode 100644 index 00000000..023780d5 --- /dev/null +++ b/spider2-specs/specs/07-analytics-skill-sql-craft.md @@ -0,0 +1,363 @@ +# Add universal SQL-authoring craft to the ktx-analytics skill + +> Refined spec. Intake draft: `todo/07-analytics-skill-sql-craft.md`. + +## Problem + +The shipped `ktx-analytics` skill +(`packages/cli/src/skills/analytics/SKILL.md`) is an *orchestration* guide: its +`` and `` tell the agent **which ktx tools to call and in what +order** (`discover_data` → `entity_details`/`sl_read_source` → +`sl_query`/`sql_execution` → validate → `memory_ingest`). It says almost nothing +about **writing correct SQL**. + +That gap shows up as a specific failure shape: the agent reliably produces +*runnable* SQL but *wrong* results. The recurring defects are universal +analytics-engineering mistakes, not ktx-specific ones: + +- comparing a string column to a numeric literal (or vice versa), which can + silently match zero rows; +- rounding inside intermediate CTEs, so the final number is off; +- ranking/“first”/“most recent” windows with no deterministic tie-breaker, so + results flicker run to run; +- filtering *before* a window function for sequence/“since”/“first” questions, + truncating the partition the window should see; +- returning a full ranked list for a “top/highest” question, or collapsing a + “per X” question to a single value; +- dropping the inputs (or the entity identifier) a derived value was built from. + +These are correctness defects every ktx user hits on a live database. They +belong in the shipped skill — fixing them once improves ktx for everyone, rather +than living in any individual caller’s prompt. + +## Generic use case + +An analyst (human or agent) points ktx at a **live, production** database and +asks a real analytical question — “what’s the most recent order per customer”, +“top region by margin”, “average order value by month”. The schema is unfamiliar +(unknown date encodings, nullable join keys, string-typed numeric columns), the +question carries grain and ranking intent in its wording, and the answer must be +*correct and deterministic*, not merely executable. The skill should encode the +analytics-engineering craft that makes the difference between a query that runs +and a query that’s right — independent of any benchmark. + +## Model + +The change is **additive content in one Markdown file**, governed by these +invariants. They constrain the implementer; the exact prose is theirs. + +### Inline-only delivery (this is a hard constraint, not a style preference) + +All new guidance lives **inside `skills/analytics/SKILL.md`**. A bundled +`reference/*.md` file (the progressive-disclosure pattern Anthropic’s +skill-authoring guide recommends for large skills) **MUST NOT** be used here, +because the delivery mechanism ships only `SKILL.md`: + +- `setup-agents.ts` installs the analytics skill via `readAnalyticsSkillContent()`, + which reads **only** `./skills/analytics/SKILL.md` and writes a **single** file + per target: `.claude/skills/ktx-analytics/SKILL.md` (Claude Code), the Codex / + universal `.agents` equivalent, a **flattened** single rules file for Cursor + (`.cursor/rules/ktx-analytics.mdc`) and OpenCode + (`.opencode/commands/ktx-analytics.md`), and a Claude Desktop **zip that + contains only `ktx-analytics/SKILL.md`** (`writeClaudeDesktopSkillBundle`). +- Nothing copies sibling files or subdirectories. A reference file would dangle + on every target, and the Cursor/OpenCode flatten-to-one-file shape cannot + represent a multi-file skill at all. + +The skill is small enough that inline costs nothing meaningful: ~67 lines today +plus ~60 of craft is well under the 500-line budget. And this craft is **core +content** — consulted on every SQL-authoring turn — so even if multi-file delivery +existed it would still belong inline: progressive disclosure only pays off for +large, *conditionally-relevant* reference material loaded on demand, not for +always-needed craft. + +Multi-file skill *delivery* is a legitimate future enhancement, but it must be +**pulled by a concrete need, not built ahead of one** — no shipped skill today +exceeds the budget (largest is ~346 lines) or uses a bundled reference. The first +real trigger is the **per-dialect SQL syntax follow-up** +(`todo/08-per-dialect-sql-syntax-notes.md`), whose load-on-demand +`reference/.md` content is a genuine progressive-disclosure fit. When +that work is scoped, note that multi-file delivery is **not** a simple directory +copy: `setup-agents.ts` flattens the skill to a *single* file for Cursor +(`.mdc`) and OpenCode (`.md`), so those targets need a concatenation transform, +and uninstall needs per-file manifest entries. Recording the constraint here so a +future implementer does not “improve” this inline content into a bundled +reference that dangles on every target. + +### Heuristics with a generic *why*, not a wall of MUSTs + +The new rules are phrased as **heuristics with a one-line, universal rationale**, +because SQL authoring is a high-freedom task (many valid approaches, choice +depends on the question and the data). A bare imperative overfits; a rule plus +its *why* lets the model apply judgment and generalize. This follows Anthropic’s +own skill-authoring guidance (“if you find yourself writing ALWAYS/NEVER in all +caps or rigid structures, reframe and explain the reasoning”). + +This **reconciles the draft’s “behavior only, no rationale” instruction**: the +prohibition is specifically on rationale that references a **grader, gold answer, +or the benchmark**. *Generic analytics-engineering rationale is required* — e.g. +“…so `RANK`/`ROW_NUMBER` results don’t flicker across runs”, “…a string-vs-number +compare can silently match nothing”. That is a universal truth, not a +grader reference. + +### Dialect-agnostic + +Every rule must read correctly on any SQL dialect a ktx connection might use. +**No dialect-specific syntax** — not `QUALIFY` (Snowflake/BigQuery/DuckDB only), +not `strftime`/`julianday` (sqlite), not backtick/`DB.SCHEMA.TABLE` FQTNs. +Per-dialect syntax notes are a **separate follow-up** living in a dialect-aware +(per-driver) location, explicitly out of scope here. + +### Discovery craft attaches to discovery; authoring craft to query/validate + +Two of the draft’s rules (inspect sample rows; cast before comparing) are +*schema-discovery* concerns that happen **before** SQL is composed. They belong +with the discovery steps of the existing workflow, not only at the query step. +The rest (composition, window correctness, precision, completeness) belong with +the query/validate steps. The draft’s “extend step 5/6” is the right home for +most rules but is slightly off for the discovery pair; this spec corrects that. + +### Additive only + +The existing ``, ``, and `` — compact result tables, +summaries, clarification prompts, the tool-order workflow, the `connectionId` +scoping rules — are preserved unchanged. The skill must still read well for an +interactive, human-facing analysis session. + +## Requirements + +### 1. Placement and structure + +Add a dedicated, scannable craft section to `SKILL.md`: + +- A new top-level block — `` (sibling to ``/``) — with + **five sub-headings**: *Schema discovery*, *Composition*, *Window functions*, + *Numeric precision*, *Answer completeness*. Sub-headings keep the block + scannable (the draft’s “group under clear sub-headings” goal). +- **Pointers, not duplication.** Step 5 (“Query”) and step 6 (“Validate and + explain”) each gain a **one-line pointer** into `` rather than + inlining the rules (state each rule once; Anthropic’s “consistent terminology / + don’t repeat” guidance). The schema-discovery pair is additionally reflected as + a brief cue in the discovery steps (step 2 “Inspect” / step 4 “Plan”), pointing + to the same block. +- No new tool, flag, or config. This is content only. + +### 2. The craft rules (all fourteen behaviors, grouped) + +Every behavior from the intake draft must be represented. Tightly-related ones +**may** be merged into a single bullet where that reads better; none may be +dropped. Each carries a generic *why* (per Model). Dialect-agnostic throughout. + +**Schema discovery** (cue in steps 2/4; lives in ``) +1. Inspect representative **sample rows** of each table before composing SQL — + confirm date/time encoding (`YYYYMMDD` vs ISO vs epoch), null prevalence in + join/filter keys, and the real set of categorical/enum values + (`entity_details` + a small `sql_execution` sample). *Why:* assumptions about + encoding and nullability are the most common source of silently-wrong filters. +2. **Cast a column to its real type before comparing** it in `WHERE`/`JOIN`. A + string column compared to a numeric literal (or vice versa) can silently match + nothing. + +**Composition** +3. Build complex queries **incrementally** — one CTE at a time, verifying each + layer’s output on a small sample before stacking the next. *Why:* a wrong + intermediate layer is far cheaper to catch early than to debug in the final + result. +4. **Avoid fan-out joins.** Add columns only from tables already at the target + grain, or **pre-aggregate** to that grain before joining. *Why:* a join that + multiplies rows quietly inflates every downstream `SUM`/`COUNT`. + +**Window functions** +5. Give every ranking/ordering window function a **complete, deterministic + tie-breaker** (append unique key columns to `ORDER BY`), so + `RANK`/`ROW_NUMBER`/`LAG` are stable rather than flickering across runs. +6. For sequence / “first” / “most recent” / “since” questions, **filter after the + window**, not before: compute over the full partition, then keep the rows you + want. *Why:* a pre-filter shrinks the partition the window ranks over, so + “first”/“most recent” is computed against the wrong set. (See the worked + example, requirement 3.) + +**Numeric precision** +7. Compute at **full precision; round only in the final projection**, never inside + intermediate CTEs. +8. Be **explicit about truncation** — `CAST AS INT` truncates; use explicit + rounding when rounding is intended. (May merge with rule 7.) +9. Distinguish **macro vs micro averages** based on the question’s wording: + “average of per-group averages” = `AVG(group_metric)`; “overall/weighted + average” = `SUM(numerator)/SUM(denominator)`. + +**Answer completeness / interpretation** +10. “top / highest / most / lowest” → return only the **winning row(s)** (keep the + top-ranked row via the window result), not the full ranked list, unless a list + is asked for. *(Phrase the mechanism dialect-agnostically — do not name + `QUALIFY`.)* +11. “for each X / per X / by X” → **exactly one row per X**; don’t collapse to a + single value unless the question says “overall” or “total across X”. +12. When a question asks for inputs and a derived value (“X, Y, and their ratio”), + **include the inputs as columns** alongside the derived value. +13. When grouping by a human-readable label (a name), also **expose the entity’s + identifier** — identity, not just the label, is part of the result (and + disambiguates duplicate names). +14. When a result is **unexpectedly empty, relax filters one at a time** to find + which predicate removed the rows. *Why:* this is the validation feedback loop + that turns a silent empty result into a diagnosable one. + +### 3. One worked example (dialect-agnostic) + +Add **exactly one** compact before/after example to the skill, demonstrating the +**window-then-filter** rule (rule 6) — the subtlest and highest-value of the set. +It shows the wrong shape (filter inside, then rank) and the right shape (rank over +the full partition in a CTE, then filter to the top rank in the outer query), +using generic table/column names and standard SQL only (no `QUALIFY`, no +dialect functions). Keep it ~6–10 lines. Do not add a second example; the +existing three tool-orchestration examples stay as the primary example set. +*(Superseded by spec 09: the skill now carries a second `sql` worked example — +the multi-hop fan-out case — so the one-example constraint applies to spec 07's +window-then-filter example only.)* + +### 4. Explicit exclusions + +None of the following may appear in the skill (they are application/consumer +concerns, or actively wrong for live data): + +- **Output-shape contracts** (“return a bare result set with exactly these + columns, no prose”). The skill is for interactive analysis and already favors + readable tables + summaries; a caller needing a strict shape specifies that + itself. +- **Anchoring relative time to `MAX(date)` of the data.** On a live database + “recent” / “past N months” means relative to *now*; `MAX(date)` anchoring is + only valid for static snapshots and must not be baked into the product. +- **Any advice justified by a grader, gold answer, or scoring comparator.** +- **Dialect-specific syntax** (deferred to the per-driver follow-up). + +### 5. Coordination with spec 03 + +`03-multi-connection-routing-in-analytics-skill` also edits this same file (it +adds a connection-routing “step 0” to `` and threads `connectionId` +through the tool calls). Spec 07’s additions are **orthogonal**: they live in a +new `` block and in step 5/6 pointers, and must not rewrite the +`` routing or the `` `connectionId` scoping that spec 03 owns. +If both land, the result is one coherent skill: routing in ``/``, +SQL craft in ``. + +## Acceptance criteria + +- The shipped `analytics/SKILL.md` contains all fourteen behaviors above, grouped + under the five sub-headings, each phrased as a heuristic with a generic + rationale. +- **Zero references** to any benchmark, gold answer, grader, or scoring + comparator anywhere in the skill. +- **Dialect-agnostic:** the skill contains no `QUALIFY`, no `strftime`/`julianday`, + no backtick/`DB.SCHEMA.TABLE` FQTN syntax, and no other single-dialect + construct — including in the worked example. +- The existing interactive guidance is intact: the `` steps, the + `` (compact tables, summaries, clarification prompt, `connectionId` + scoping), and the three existing examples all still read correctly and were not + removed or contradicted. +- **None of the excluded items** (output-shape contract, `MAX(date)` anchoring of + “recent”, grader-driven advice, dialect syntax) appear. +- Exactly **one** new worked example is present, demonstrating window-then-filter, + in standard dialect-agnostic SQL. *(Superseded by spec 09, which adds a second + `sql` worked example for the multi-hop fan-out case; the shipped skill then + contains two worked examples and the content test asserts two `sql` fences.)* +- The craft is **inline in `SKILL.md`** — no bundled reference file is introduced, + and the skill still installs as a single file through `setup-agents.ts` for all + targets (Claude Code, Codex, Cursor, OpenCode, universal, Claude Desktop zip). +- The skill stays **scannable and within a reasonable size** (comfortably under + the 500-line budget). +- The frontmatter (`name`, `description`) is unchanged and still parses through + `SkillsRegistryService.parseFrontmatter`. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the prose. + +- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`. Add the + `` block; add one-line pointers in steps 5/6 and a discovery cue in + steps 2/4; add the single worked example. Keep ``/``/`` + otherwise intact. +- **Delivery (why inline is mandatory):** `packages/cli/src/setup-agents.ts` + (`readAnalyticsSkillContent`, `installTarget`, `writeClaudeDesktopSkillBundle`, + `plannedKtxAgentFiles`). Each target gets a single file derived from + `SKILL.md`; Cursor/OpenCode flatten to one rules file; Claude Desktop zips only + `ktx-analytics/SKILL.md`. No change to `setup-agents.ts` is required by this + spec — confirm the skill still installs unchanged. +- **Coordination:** `03-multi-connection-routing-in-analytics-skill` edits the + same file; keep the changes non-overlapping (see requirement 5). +- **Tests:** a content assertion over the shipped `analytics/SKILL.md` is the + right level (this is prompt content, not executable logic). Assert the skill + text contains the craft sub-headings / representative rule phrases, contains the + worked example, and contains none of the banned constructs: the literal tokens + `QUALIFY`/`strftime`/`julianday`, grader/benchmark words (`spider`, `benchmark`, + `gold`, `grader`), and — checked as a phrase, not a raw `MAX(` grep, since + `MAX()` is a legitimate aggregate — any instruction anchoring relative time + (“recent”, “past N months”) to the data’s maximum date. The existing + `SkillsRegistryService` frontmatter-parse test must still pass. The standalone + `ktx-dev` binary should be rebuilt/re-linked (`pnpm run build && pnpm run + link:dev`) so the playground picks up the updated skill. + +## Benchmark context (motivation only) + +On the Spider 2.0-Lite sqlite subset the solver produced **0 execution errors but +~50 result mismatches**, and a large share traced to exactly these gaps: +premature rounding, string-vs-number compares, non-deterministic window ordering, +returning full lists for “top” questions, and dropping the inputs to derived +values. These are generic SQL-authoring defects — fixing them in the skill +improves ktx for every user querying a live database, and improving the benchmark +score is a side effect, not the goal. The skill itself must contain no trace of +the benchmark. + +## Implementation notes + +Implemented on branch `write-feature-spec-wiki`. + +**What was built** +- Added a new `` block to `packages/cli/src/skills/analytics/SKILL.md` + (sibling to ``/``, placed just before ``), with the + five sub-headings — *Schema discovery before writing SQL*, *Composition*, + *Window functions*, *Numeric precision*, *Answer completeness / interpretation* — + and a one-line opener framing the bullets as heuristics-with-a-why. +- All fourteen behaviors are represented. Rules 7 and 8 (round-at-the-end / + truncation) are merged into one "Round only at the end" bullet, as the spec + permitted. Each bullet carries a generic analytics-engineering rationale; none + references a benchmark, grader, or gold answer. +- Exactly one worked example (a fenced `sql` block inside ``) + demonstrates the window-then-filter rule, and incidentally the deterministic + tie-breaker: the *wrong* shape filters before the window; the *right* shape + ranks the full partition in a CTE, then filters in the outer query. Standard + SQL only — no `QUALIFY`, no dialect functions. +- Step pointers added without duplicating the rules: a schema-discovery cue in + steps 2 and 4, an authoring pointer in step 5, and a validation pointer in + step 6, each pointing into ``. +- The existing `` / `` / `` (compact tables, + summaries, clarification prompt, `connectionId` scoping, the three + orchestration examples) are unchanged. Delivery is unchanged: still a single + `SKILL.md` per target via `readAnalyticsSkillContent`; no bundled `reference/` + file was introduced. + +**Tests** — added `packages/cli/test/skills/analytics-skill-content.test.ts`, a +content assertion over the source `SKILL.md`: the five sub-headings, a +representative phrase for each behavior, exactly one `sql` worked example, the +preserved interactive guidance, and the absence of banned constructs +(`QUALIFY` / `strftime` / `julianday`, `spider` / `benchmark` / `gold` / +`grader`, a backtick three-part FQTN, and a phrase-level guard against anchoring +relative time to a `MAX(...)` date). The existing `setup-agents.test.ts` content +assertions and the `SkillsRegistryService` frontmatter test still pass (77/77 +across the three relevant files). Rebuilt and re-linked `ktx-dev` +(`pnpm run build && pnpm run link:dev`); the craft block is present in the +shipped `dist` asset. + +**Deviations / notes** +- The worked example runs ~18 lines including comments rather than the spec's + "~6–10"; a faithful before/after with a CTE needs the extra lines, and the + skill stays well within budget (~117 lines total). +- `pnpm run type-check` currently reports one **pre-existing, unrelated** error + in `test/mcp-server-factory.test.ts` (MCP server deps typing), committed on + this branch ahead of `origin/main`. The src type-check and `pnpm run build` + are green; this change does not touch any MCP file. +- Per-dialect SQL syntax stays out of scope here (deferred to + `todo/08-per-dialect-sql-syntax-notes.md`), so the skill remains + dialect-agnostic. No dialect-tool pointer was added to `SKILL.md` yet — that + belongs with spec 08's channel so the skill never references a tool that does + not exist. diff --git a/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md b/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md new file mode 100644 index 00000000..d2674c9c --- /dev/null +++ b/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md @@ -0,0 +1,395 @@ +# Per-dialect SQL syntax notes, served on demand and scoped to the connection + +> Refined spec. Intake draft: `todo/08-per-dialect-sql-syntax-notes.md`. Companion +> to `specs/07-analytics-skill-sql-craft.md`, which kept the analytics SQL craft +> dialect-agnostic and explicitly deferred per-dialect syntax to this spec. + +## Problem + +Spec 07 added universal, **dialect-agnostic** SQL-authoring craft to the +`ktx-analytics` skill (`packages/cli/src/skills/analytics/SKILL.md`). That craft +deliberately excludes anything that reads correctly on only one engine — no +`QUALIFY`, no `strftime`/`julianday`, no backtick or `DB.SCHEMA.TABLE` FQTNs — +because the flat skill is installed verbatim and an agent querying sqlite must +never see Snowflake syntax. + +But a large share of *real* correctness depends on exactly that excluded, +engine-specific syntax: + +- **Snowflake:** `DATABASE.SCHEMA.TABLE` FQTNs, double-quoted case-sensitive + identifiers (unquoted folds to upper-case), VARIANT colon-paths + (`col:field.sub::type`), `QUALIFY`. +- **BigQuery:** backtick FQTNs (`` `project.dataset.table` ``), `_TABLE_SUFFIX` + for sharded/wildcard tables, `QUALIFY`, `JSON_VALUE`/`JSON_EXTRACT`. +- **sqlite:** `strftime`/`julianday`/`date()` for dates, no `QUALIFY`, + `json_extract`. +- and the remaining supported engines (`postgres`, `mysql`, `clickhouse`, + `sqlserver`/`tsql`), each with its own FQTN, quoting, date, top-N, and + JSON conventions. + +This guidance is genuinely useful to an agent writing SQL against a live +database, but it must **not** pollute the flat dialect-agnostic skill. It belongs +in a **dialect-aware** channel, surfaced only for the dialect the active +connection actually uses, and selected from the project's own configured state — +not guessed, not shown all at once. + +## Generic use case + +Any **ktx** project whose connections span more than one warehouse engine — a +Snowflake warehouse plus a BigQuery export plus a local sqlite extract, say. When +the agent (or a human analyst the agent assists) writes SQL for a given +connection, it should receive *that engine's* syntax conventions — FQTN form, +identifier quoting, date functions, top-N idiom, semi-structured access — and +nothing for the engines it is not querying. The need is independent of any +benchmark: it is what "write correct SQL against this specific warehouse" requires +on every multi-engine stack. + +## Model + +The change adds a **dialect-aware channel** alongside spec 07's flat skill. The +following decisions are committed by this refinement; the implementer owns the +exact prose and code. + +### Delivery: a dynamic MCP tool (decision committed) + +The draft posed two delivery mechanisms and asked the refinement to "weigh them +before committing." This spec commits to **dynamic MCP delivery**: a new +read-only MCP tool returns the syntax notes for a given `connectionId`, with the +dialect resolved server-side from the connection's configured `driver`. The flat +skill gains a one-line pointer to that tool. **No install-mechanism change is +required.** + +The alternative — **multi-file skill delivery** (bundle `reference/.md` +files and point the skill at the matching one) — is **rejected** for **ktx**, for +reasons that hold regardless of how the skill is otherwise authored: + +1. **It cannot scope on two of the six install targets.** Cursor + (`.cursor/rules/ktx-analytics.mdc`) and OpenCode + (`.opencode/commands/ktx-analytics.md`) are physically **single-file**; + `setup-agents.ts` flattens the skill to one file there. A bundled `reference/` + directory degenerates to "concatenate every dialect into one file," so a + sqlite agent would see Snowflake VARIANT syntax — **failing this spec's core + no-leak criterion on those targets**, and defeating progressive disclosure + (everything is in context at once). The MCP tool behaves **identically on all + six targets** because it is a tool call, not an installed file. +2. **Selecting the dialect is a deterministic operation, so it belongs in code, + not model judgment.** Anthropic's skill-authoring guidance explicitly says to + *"prefer scripts [tools] for deterministic operations."* With bundled files the + **model** must infer that connection X is Snowflake and open the right file — + and on a multi-connection project it can open the wrong one. With the tool, the + **server** resolves `driver → dialect` from `ktx.yaml` state and returns + exactly the right notes. +3. **It needs a delivery subsystem that the tool does not.** Multi-file delivery + requires reworking `readAnalyticsSkillContent`, `installTarget`, + `plannedKtxAgentFiles`, the install manifest (a directory variant), + `removeKtxAgentInstall`, and `writeClaudeDesktopSkillBundle`, plus a + concatenation transform for the single-file targets. The MCP tool requires one + read-only handler and one skill pointer. +4. **The dependency is free.** The `ktx-analytics` skill already hard-depends on + the **ktx** MCP server — its entire workflow is calling `discover_data`, + `entity_details`, `sql_execution`, and so on. Wherever the server is down, the + skill is already non-functional; the tool adds **no new dependency**. +5. **Dropping Cursor/OpenCode does not change this.** Removing those targets would + make multi-file delivery *possible*, but it would not make it better: reasons + 2–4 stand, and the drop is a disproportionate cost (Cursor is a major target) + to neutralize a constraint the tool handles for free. Whether **ktx** supports + those targets is a separate product decision and is out of scope here. + +This is consistent with Anthropic's progressive-disclosure goal — load the +relevant material on demand, at zero context cost until needed — which the tool +satisfies (its output costs context only when called) while resolving *which* +dialect from state rather than from a model guess. Reference: +[Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices). + +### Scope derived from state, through the one existing resolver + +Which dialect's notes the agent sees is **derived** from the connection's +configured `driver`, via the resolver the rest of the system already uses — +`sqlAnalysisDialectForDriver(driver)` in +`packages/cli/src/context/sql-analysis/dialect.ts`. The same function already +selects the dialect for `sql_execution`, `sl_query`, and the Python SQL-analysis +daemon. This spec **must not** introduce a second driver→dialect map. The notes +are **keyed by the resolved `SqlAnalysisDialect`** (so the SQL Server entry is +keyed `tsql`, not `sqlserver`), tying the note key-space to the resolver's +codomain so the two cannot drift. + +### Authored per-engine notes are sanctioned static content + +Enumerating syntax notes per engine is **not** a rotting denylist of bad +specifics; FQTN form and identifier quoting are genuine, stable invariants of each +engine — the kind of universal fact **ktx**'s design rules explicitly permit as +static content. What must stay derived-from-state is note *selection* (the active +dialect) and note *coverage* (every configured driver must resolve to notes that +exist), both of which this spec ties to the connector registry. + +### The flat skill stays dialect-agnostic (spec 07 invariant preserved) + +This work adds a *separate* channel. It does **not** amend spec 07's `` +block or inline any dialect syntax into `SKILL.md`. Spec 07's acceptance criterion +— no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in the flat skill — stays +green. The only `SKILL.md` change is the pointer in requirement 3, which names the +tool and contains no dialect syntax. + +## Requirements + +### 1. A read-only `sql_dialect_notes` MCP tool + +Register a new tool beside the existing context tools +(`packages/cli/src/context/mcp/context-tools.ts`). The tool name is the +implementer's to finalize but should follow the existing snake_case convention +(`entity_details`, `sql_execution`); `sql_dialect_notes` is the suggested name. + +- **Input:** `{ connectionId }`, **required** — matching its siblings + `entity_details`/`sql_execution`, which always take an explicit connection. +- **Output:** `{ connectionId, dialect, notes }` where `dialect` is the resolved + `SqlAnalysisDialect` and `notes` is the markdown guidance for that dialect. +- **Resolution:** `connectionId → connection.driver → + sqlAnalysisDialectForDriver(driver) → notes[dialect]`, reusing the existing + resolver. Do not duplicate the driver→dialect map. +- **Guards:** + - A **non-SQL context-source** connection (driver `metabase`, `looker`, + `lookml`, `notion`, `dbt`, `metricflow`) returns a **clear "not a SQL + warehouse connection" error**, not postgres notes. Gate on the existing + `isDatabaseDriver()` (`packages/cli/src/connection-drivers.ts`). + - For any **SQL warehouse** connection the resolver always yields a dialect with + notes (all seven warehouse drivers are covered — requirement 2); its built-in + `postgres` default is a safety floor, so the tool never errors for a SQL + connection and never emits a single-engine dialect (e.g. Snowflake) by + accident. +- **Annotations:** read-only and idempotent, consistent with the other read + tools. +- **Description (docs-grade, third person, states what and when):** e.g. + *"Returns the SQL syntax conventions for a connection's dialect — FQTN form, + identifier quoting and case-folding, date/time functions, top-N idiom, and + semi-structured access. Use before authoring raw SQL against a connection so the + SQL matches that engine."* The description drives the agent's decision to call + the tool, so it must be specific. + +### 2. Per-dialect note content + +Author concise notes for each supported dialect against a **fixed rubric**, so +every dialect answers the same questions. Each facet is a line or two of timeless, +engine-true convention (no version-dated "as of vX" content), phrased as +guidance with the engine reason where it helps — inheriting spec 07's +heuristics-with-a-why tone. The rubric facets: + +1. **FQTN form** — how to fully-qualify a table on this engine. +2. **Identifier quoting & case-folding** — quote character and how unquoted + identifiers fold. +3. **Date/time** — the engine's date functions and common date-encoding idioms. +4. **Top-N / window-filtering idiom** — `QUALIFY` where supported; a CTE + + outer-filter form where it is not; `TOP` for `tsql`. +5. **Semi-structured / JSON access** — VARIANT colon-paths, `JSON_VALUE`/ + `JSON_EXTRACT`, `->`/`->>`, `json_extract`, as applicable. +6. **Sharded / partition idiom** where the engine has one (e.g. BigQuery + `_TABLE_SUFFIX`). + +Constraints on the content: + +- **Coverage = the reachable dialect set.** Every driver in the connector registry + must resolve to a dialect that has non-empty notes. The reachable set is + `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, and + `tsql` (from `sqlserver`). Do **not** author notes for `duckdb`/`databricks`: + they appear in the resolver map but no connector can produce them, so they are + unreachable — matching the draft's "don't author for nonexistent drivers." +- **Keyed by `SqlAnalysisDialect`** (see Model). +- **Storage is the implementer's choice.** The notes MAY live as per-dialect + markdown files inside the package (e.g. under the skill's directory) served by + the tool, or as a typed map. If files are used they are **package-internal** — + served by the tool, never installed onto an agent target — and already ship via + the recursive `src/skills → dist/skills` copy + (`packages/cli/scripts/copy-runtime-assets.mjs`); no `setup-agents.ts` change. +- **No benchmark, gold-answer, grader, or scoring references** anywhere in the + notes. + +The implementer must verify each engine's specifics against current official +documentation (the well-known anchors above are starting points, not a +substitute for checking the engine's docs). + +### 3. The `SKILL.md` pointer (completes spec 07's deferral) + +Add a **single one-line pointer** to the SQL-authoring step (step 4 "Plan" / step +5 "Query") of `packages/cli/src/skills/analytics/SKILL.md`, directing the agent to +call the tool before writing raw SQL against a connection — e.g. *"Before writing +raw `sql_execution` SQL, call `sql_dialect_notes` with the connection's id to get +that engine's syntax conventions."* This is the pointer spec 07 deliberately did +not add because the tool did not yet exist. + +- The pointer **names the tool only**; it contains **no dialect syntax**, so the + flat skill stays dialect-agnostic. +- Follow the skill's existing tool-reference convention. The skill currently names + MCP tools by **bare** name (`discover_data`, `sql_execution`). Anthropic's + guidance recommends **fully-qualified** `ServerName:tool` names to avoid + "tool not found" when multiple MCP servers are present. Whether to fully-qualify + the new pointer (and optionally retrofit the existing bare references) is a + small, separable decision flagged for the maintainer — **not** a rename sweep + this spec mandates. + +### 4. Coverage is enforced from state, not by hand + +A test must **derive** the required coverage from the connector registry rather +than hardcoding a dialect list: enumerate the configured warehouse drivers +(`warehouseDrivers` in `driver-schemas.ts` / `KTX_DATABASE_DRIVER_IDS` in +`connection-drivers.ts`), resolve each through `sqlAnalysisDialectForDriver`, and +assert each result has non-empty notes. Adding a connector later then **fails this +test** until its dialect gets notes — the allowlist-from-state discipline, not a +hand-maintained list. + +### 5. No dialect syntax leaks into the flat skill + +Spec 07's content assertion over `analytics/SKILL.md` stays green: the flat skill +(and its worked example) still contain no `QUALIFY`, `strftime`, `julianday`, +backtick/`DB.SCHEMA.TABLE` FQTN, or other single-engine construct. This spec adds +a tool and a tool-pointer; it does not move dialect syntax into the skill. + +### 6. Delivery is unchanged + +`setup-agents.ts` (`readAnalyticsSkillContent`, `installTarget`, +`writeClaudeDesktopSkillBundle`, `plannedKtxAgentFiles`) needs **no change**. The +skill still installs as a single `SKILL.md` per target. Confirm the channel works +on all six targets — Claude Code, Claude Desktop (zip), Codex, universal +`.agents`, Cursor (`.mdc`), OpenCode (`.md`) — by virtue of being a tool call, +including the single-file targets where multi-file delivery could not scope. + +### 7. Coordination with specs 07 and 03 + +- **Spec 07** owns the dialect-agnostic `` block. This spec must not + amend it; it adds the tool, the pointer, and the notes. +- **Spec 03** (`03-multi-connection-routing-in-analytics-skill`) threads + `connectionId` through the skill's tool calls. The `sql_dialect_notes` pointer + is `connectionId`-scoped and fits that routing; keep the pointer consistent with + spec 03's `connectionId` rules and do not rewrite the routing it owns. + +## Acceptance criteria + +- An agent querying a **sqlite** connection gets sqlite date idioms and **never** + sees Snowflake/BigQuery-only syntax; an agent querying **Snowflake** gets + FQTN / identifier / VARIANT guidance. +- The dialect shown is **derived from the connection's configured `driver`** via + the existing `sqlAnalysisDialectForDriver`, not hardcoded per project and not + guessed. No second driver→dialect map is introduced. +- **Every configured warehouse driver** (`postgres`, `mysql`, `snowflake`, + `bigquery`, `sqlite`, `clickhouse`, `sqlserver`) resolves to a dialect with + non-empty notes, and the coverage test derives this from the registry. +- A **non-SQL context-source** connection (e.g. `metabase`, `notion`) yields a + clear "not a SQL warehouse" response, **not** postgres notes. +- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are + unaffected. The new pointer references the tool only and adds no dialect syntax. +- The channel installs/serves correctly across **all six** agent targets, + including the single-file Cursor/OpenCode shape, with **no `setup-agents.ts` + change**. +- The notes contain **no** benchmark/gold/grader/scoring references and **no** + time-sensitive ("as of version X") content. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the design. + +- **Dialect resolver (reuse, do not duplicate):** + `packages/cli/src/context/sql-analysis/dialect.ts` — + `sqlAnalysisDialectForDriver(driver)`, returning `SqlAnalysisDialect` + (`./ports.ts`), default `postgres`. +- **Connector registry (drives coverage):** + `packages/cli/src/connection-drivers.ts` (`KTX_DATABASE_DRIVER_IDS`, + `isDatabaseDriver`) and `packages/cli/src/context/project/driver-schemas.ts` + (`warehouseDrivers`, the per-driver `connectionConfigSchema`). +- **MCP tool registration:** `packages/cli/src/context/mcp/context-tools.ts` + (register beside `connection_list`, `entity_details`, `sql_execution`); the + `connectionId → driver → dialect` resolution already exists for `sql_execution` + in `packages/cli/src/context/mcp/local-project-ports.ts` — route the new tool + through the same path. +- **The skill (one-line pointer only):** + `packages/cli/src/skills/analytics/SKILL.md` — add the tool pointer in step 4/5; + leave ``/``/``/`` otherwise intact. +- **Note storage (if files):** under the skill directory, shipped by + `packages/cli/scripts/copy-runtime-assets.mjs`'s recursive copy; served by the + tool, never installed. +- **Delivery (confirm unchanged):** `packages/cli/src/setup-agents.ts`. +- **Tests:** unit tests for resolution (including `sqlserver → tsql`, unknown → + `postgres`, and non-warehouse rejection); a registry-derived coverage test + (requirement 4); a content test that each dialect's notes cover the rubric + facets and contain no banned tokens; and an extension of spec 07's + `analytics/SKILL.md` content test asserting the new pointer is present and the + flat skill is still dialect-clean. Rebuild and re-link the dev binary so the + playground picks up the change: `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation only) + +The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake +(`DB.SCHEMA.TABLE` FQTNs, double-quoted lower-case columns, VARIANT colon-paths), +BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite +(`strftime`/`julianday`). That content is real and useful but engine-specific; +spec 07 kept it out of the flat skill and deferred it here so the dialect-agnostic +rules stay clean. Delivering it through a dialect-scoped **ktx** tool generalizes +the same correctness benefit to every multi-engine **ktx** project — improving the +benchmark score is a side effect, not the goal, and the shipped skill contains no +trace of the benchmark. + +## Implementation notes + +Implemented on branch `write-feature-spec-wiki`, alongside spec 07. The committed +decision (dynamic MCP delivery, not multi-file skill bundling) was implemented as +specified — no `setup-agents.ts` change. + +**What was built** +- Per-dialect notes are markdown files under + `packages/cli/src/context/sql-analysis/dialects/.md` (one each for + `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, `tsql`), + served by `sqlDialectNotes(dialect)` in `sql-analysis/dialect-notes.ts` (lazy + read + cache, `postgres` fallback floor; the authored set is the + `DIALECTS_WITH_NOTES` const). `duckdb`/`databricks` are intentionally unauthored + (unreachable from any connector). Each note answers the fixed rubric — FQTN, + identifier quoting/case-folding, date/time, top-N/window idiom, + JSON/semi-structured, plus a sharded-table line for BigQuery. Engine specifics + were verified against current docs via Context7 (Snowflake VARIANT colon-paths + and unquoted→UPPER case-folding; BigQuery `_TABLE_SUFFIX`, `QUALIFY`, + `JSON_VALUE`; ClickHouse `LIMIT n BY` and `JSONExtract*`, with no `QUALIFY`). The + files are package-internal — `copy-runtime-assets.mjs` ships them to `dist`; they + are never installed onto an agent target. +- New read-only MCP tool `sql_dialect_notes` (`context-tools.ts`): input + `{ connectionId }` (required), output `{ connectionId, dialect, notes }`, read-only + + idempotent annotations. It resolves through the **existing** + `connectionId → connection.driver → sqlAnalysisDialectForDriver` path (no second + driver→dialect map), implemented as the unconditional `dialectNotes` port in + `local-project-ports.ts` via an extracted `resolveDialectNotesForConnection`. A + non-SQL context source (gated by `isDatabaseDriver`) throws `KtxExpectedError` + ("not a SQL warehouse"), not postgres notes — so the expected agent mistake stays + out of Error Tracking. +- `connection-drivers.ts`: `KTX_DATABASE_DRIVER_IDS` is now an exported (`@internal`) + readonly tuple so the coverage test derives required coverage from the registry; + `isDatabaseDriver` behavior is unchanged. +- `skills/analytics/SKILL.md`: a single dialect-agnostic pointer in step 5 ("call + `sql_dialect_notes` … to get that engine's FQTN, identifier-quoting, date, top-N, + and JSON conventions"). It names the tool only; spec 07's `` block and + its dialect-clean content test are untouched. + +**Tests** +- `test/context/mcp/dialect-notes.test.ts`: registry-derived coverage (a future + connector fails the test until its dialect has notes), the full rubric per dialect, + leak isolation (sqlite shows `strftime` and never `VARIANT`/`_TABLE_SUFFIX`; + `QUALIFY` only on snowflake/bigquery; engine-exclusive markers stay put), no + benchmark/grader or version-dated content, the postgres fallback, and + `resolveDialectNotesForConnection` resolving sqlite / snowflake / `sqlserver→tsql` + and rejecting a non-SQL source / unknown connection with `KtxExpectedError`; plus a + guard that the `DIALECTS_WITH_NOTES` const and the `dialects/*.md` files stay in sync. +- `test/context/mcp/server.test.ts`: `sql_dialect_notes` added to the retained tool + set + annotations assertion + a handler-routing test, and the regenerated + `__snapshots__/mcp-tools-list.json`. +- `test/skills/analytics-skill-content.test.ts`: asserts the new pointer is present + and the flat skill stays dialect-clean. + +**Verification** — `tsc -p tsconfig.json` (src) clean; full default suite 393 files / +3001 passing; slow suite green (incl. `local-project-ports.test.ts`); all three +`dead-code` checks clean; the `dialects/*.md` files copy into `dist`. Rebuilt and +re-linked `ktx-dev`. + +**Deviations / notes** +- Notes are stored as per-dialect markdown files (not a typed map, and not bundled + `reference/*.md` skill files) — all sanctioned by the spec; plain markdown is the + most maintainable to edit. They are served by the tool and ship via a + `copy-runtime-assets.mjs` entry (`src/context/sql-analysis/dialects → dist/…`); no + `setup-agents.ts` change. +- `pnpm run type-check` still reports one pre-existing, unrelated error in + `test/mcp-server-factory.test.ts` (committed in-flight MCP work on this branch); + this change adds zero new type errors and does not touch that file. diff --git a/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md b/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md new file mode 100644 index 00000000..5c75150b --- /dev/null +++ b/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md @@ -0,0 +1,362 @@ +# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill + +> Refined spec. Intake draft: `todo/09-fan-out-safe-multi-hop-aggregation.md`. +> Extends spec 07 (`specs/07-analytics-skill-sql-craft.md`), which shipped the +> `` block. Additive, content-only. + +## Problem + +The shipped `ktx-analytics` skill +(`packages/cli/src/skills/analytics/SKILL.md`) already carries a single-hop +fan-out rule in `` → **Composition**: + +> **Avoid fan-out joins.** Add columns only from tables already at the target +> grain, or pre-aggregate to that grain before joining. A join that multiplies +> rows quietly inflates every downstream `SUM`/`COUNT`. + +In practice the agent honors that on a single join but still **silently +fans out on multi-hop join chains**, where the inflation is one or two joins +removed from the aggregate and therefore much harder to notice. + +The failure shape: a measure that lives at a *coarse* grain (one row per parent +record) is counted/summed *after* the parent has been joined down to a *finer* +grain (one row per child line). Every parent-level value is then duplicated by +its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by a data-dependent +amount — runnable SQL, plausible-looking number, quietly wrong. + +The rule today is stated only as a **prohibition** ("Avoid…"). It needs two +upgrades: (a) generalize it so the danger is understood as *cumulative across a +whole join chain*, not a single join; and (b) pair it with an **affirmative +verification habit** the agent runs while composing, so a grain change is +detected and fixed rather than merely warned against. + +## Generic use case (independent of any benchmark) + +An analyst on any production warehouse asks a counting/summing question whose +path runs through several one-to-many hops — e.g. *"how many orders per region +contain a returned item?"* where the path is `region → store → order → +order_line`. The honest answer counts each order once. The naïve join chain joins +`order_line` (to apply the line-level condition) and then counts orders, so an +order with three returned lines is counted three times. The inflation happens +**three joins below the `COUNT`**, where it is easy to miss. This is one of the +most common silently-wrong analytics mistakes on normalized schemas — not +specific to any dataset, dialect, or benchmark. + +## Model (invariants — the implementer owns the prose) + +These constrain the change; the exact wording is the implementer's. Each is +grounded in Anthropic's skill-authoring and prompt-engineering guidance so the +addition stays consistent with how spec 07 was written. + +### Additive, inline-only, dialect-agnostic (inherited from spec 07) + +The change is **additive content inside `skills/analytics/SKILL.md`** only — no +bundled `reference/*.md` file (the delivery path ships a single `SKILL.md` per +target; see spec 07 §Model "Inline-only delivery"). No new tool, flag, or config. +Every addition must read correctly on any dialect: **no** `QUALIFY`, +`strftime`/`julianday`, backtick/`DB.SCHEMA.TABLE` FQTNs, or other single-dialect +construct — including in the worked example. The existing ``, ``, +``, and the other four `` sub-headings are preserved +unchanged. + +### Heuristic-plus-*why*, because SQL authoring is a high-freedom task + +Anthropic's "set appropriate degrees of freedom" guidance classifies tasks with +many valid approaches where decisions depend on context as **high freedom → +text-based heuristics**, the "open field, many paths" case (versus low-freedom, +fragile operations that need an exact script). SQL authoring is squarely +high-freedom. So the new content is phrased as **heuristics with a one-line, +universal rationale**, never as bare `ALWAYS`/`NEVER` imperatives — matching the +existing `` style and Anthropic's "add context / explain why so Claude +generalizes" principle. + +### Affirmative framing for the verification step (do, not don't) + +Anthropic's prompt-engineering guidance is explicit: **"Tell Claude what to do +instead of what not to do."** The draft's requirement for "a detect-and-fix +*habit*, not just a prohibition" is the same principle. Therefore: + +- The **generalized rule keeps the established `Avoid fan-out joins` lead and the + term `fan-out`** — it is spec 07's consistent terminology and the existing + content test references that phrase; reframing it would churn shared vocabulary + for no gain. +- The **new verification step is phrased affirmatively** (e.g. *"Verify the grain + holds across each join"*) — an action the agent performs while composing, not a + warning. The two together satisfy both principles: a recognized anti-pattern + name *and* a positive habit. + +### One default with an escape hatch, not two equal options + +Anthropic: **"Avoid offering too many options… provide a default with an escape +hatch."** The fix for an inflated aggregate is presented as exactly that: + +- **Default: pre-aggregate the measure to its own grain in a CTE, then join the + already-aggregated result.** This is the single-hop fix generalized, and it is + the *only* correct fix for `SUM`/`AVG` — you cannot de-duplicate a summed + measure with `DISTINCT` (two legitimately-equal amounts would collapse). +- **Escape hatch: `COUNT(DISTINCT key)` — for a pure count only.** It rescues an + inflated count in one line, but must be stated as count-only, not as a general + remedy. + +This is the deepest correctness point in the spec and the easiest to get wrong; a +naïve blanket "just use `COUNT(DISTINCT)`" is silently wrong for sums. + +### Consistent terminology + +Anthropic: **"Choose one term and use it throughout."** Reuse spec 07's existing +vocabulary verbatim — **`grain`**, **`fan-out`**, **`pre-aggregate`** — do not +introduce synonyms (e.g. do not rename the concept "row blow-up" or +"multiplication factor"). Prose may vary, but the named concepts stay fixed. + +### Concise — the addition must justify its token cost + +Anthropic: **"Concise is key… does this paragraph justify its token cost?"** and +"Claude is already very smart." The agent knows what a join and a `GROUP BY` are; +the addition explains only the non-obvious trap (cumulative grain inflation) and +shows the fix. Net addition is roughly one rewritten bullet, one new bullet, and +one worked example — the skill stays comfortably under the 500-line budget +(~117 lines today). + +### Examples over descriptions — exactly one + +Anthropic's "examples pattern": **"Examples help Claude understand the desired +style and level of detail more clearly than descriptions alone"** and +"examples are concrete, not abstract." The multishot guidance favors 3–5 examples +in general, but here **conciseness and spec 07's one-example-per-rule economy +win**: the skill already carries the window-then-filter example, so this adds +**exactly one** compact wrong-vs-right example. The wrong/right contrast inside +that single example supplies the diversity multishot calls for, at one example's +token cost. + +### Leak-safety (hard constraint) + +The worked example must be a **synthetic, generic schema invented for teaching** — +not the tables, column names, query, or numeric results of any Spider 2.0-Lite +question. It demonstrates the *pattern* (a coarse-grain measure aggregated after a +one-to-many join), which is universal and reconstructable from first principles. A +reviewer must find nothing in it that ties it to a specific benchmark instance. +See "Leak-safety" below. + +## Requirements + +All four land in the **Composition** sub-heading of `` in +`packages/cli/src/skills/analytics/SKILL.md`. Structure (chosen design): rewrite +the existing fan-out bullet, add one affirmative verification bullet, add one +worked example. Do not touch the other four sub-headings or ``/``/ +``. + +### 1. Generalize the fan-out rule to multi-hop chains + +Rewrite the existing **`Avoid fan-out joins.`** bullet so it makes explicit that +the danger is **cumulative**: *any* one-to-many hop on the path between a measure's +owning table and the aggregate inflates that measure, **even when the offending +join is several hops away from the `SUM`/`COUNT`**. The fix is the same as the +single-hop case — **pre-aggregate the measure to its own grain in a CTE, then join +the already-aggregated result** — but the agent must apply it **per +measure-owning table along the whole chain**, not just at the final join. Keep the +`fan-out` term and the one-line *why*. + +### 2. Add an affirmative grain-verification habit + +Add a companion bullet, phrased as an action the agent performs **while +composing** (not a prohibition): + +- Confirm that a join intended to be one-to-one / many-to-one **did not change the + grain** it aggregates at — e.g. check that the row count (or the count of the + aggregate's key) is unchanged across that join. +- When a join is genuinely one-to-many, **reach for the default fix + (pre-aggregate to grain)**; for a **pure count**, `COUNT(DISTINCT key)` is an + acceptable escape hatch. +- State the caveat once: **`SUM`/`AVG` of a fanned-out measure must pre-aggregate** + — `DISTINCT` cannot de-duplicate a sum. + +This is spec 07's "build incrementally and check each layer" discipline pointed +specifically at grain preservation, in affirmative form. + +### 3. One concrete, generic multi-hop worked example + +Add **exactly one** compact wrong-vs-right `sql` example inside `` +demonstrating the multi-hop inflation and the pre-aggregate fix. It is the +**second** `sql` fence in the skill (the first is spec 07's window-then-filter +example). + +**Required properties** (these are the constraints; the SQL below is orientation): + +- **Multi-hop chain** where the inflating one-to-many hop is **≥1 join removed** + from the aggregate (not the single-hop case spec 07 already covers). +- **Unambiguous attribution**: each counted entity maps to **exactly one** group, + so the honest answer is well-defined. (This rules out "coarse measure attributed + to a fine dimension reached by descending," where one entity spans several + groups and the correct number is itself ambiguous — that would teach a murky + pattern.) +- **Motivated descent**: the finer-grain table is joined for a real reason (a + line-level filter or a needed line-level value), so the reader sees *why* the + fan-out join is there. +- **Plain `COUNT`/`SUM`**, not `AVG` — averaging collides with the existing + *Macro vs micro average* bullet and would muddy the fan-out lesson. +- The **RIGHT side demonstrates the default fix** (pre-aggregate to grain in a + CTE) and is **actually correct**, not merely runnable — its number must equal the + honest answer, not just avoid an error. +- Generic invented schema, standard dialect-agnostic SQL (no `QUALIFY`, no dialect + functions), no benchmark identifiers or values. + +**Recommended sketch** (implementer may adjust within the properties above): + +```sql +-- "How many orders per region contain a returned item?" +-- WRONG: joining order_lines to apply the line-level filter multiplies orders — +-- an order with two returned lines is counted twice, three joins below the COUNT. +SELECT r.region_id, COUNT(*) AS n_orders +FROM regions r +JOIN stores s ON s.region_id = r.region_id +JOIN orders o ON o.store_id = s.store_id +JOIN order_lines l ON l.order_id = o.order_id +WHERE l.status = 'returned' +GROUP BY r.region_id; + +-- RIGHT: collapse order_lines to one row per qualifying order first, then join up. +WITH returned_orders AS ( + SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id +) +SELECT r.region_id, COUNT(*) AS n_orders +FROM regions r +JOIN stores s ON s.region_id = r.region_id +JOIN orders o ON o.store_id = s.store_id +JOIN returned_orders ro ON ro.order_id = o.order_id +GROUP BY r.region_id; +-- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an +-- order-level measure fanned out this way must pre-aggregate — DISTINCT can't +-- de-duplicate a sum. +``` + +### 4. Placement and structure + +- Both bullets live under the existing **Composition** sub-heading; the example + follows them. The five-sub-heading structure spec 07 established is unchanged. +- **State each rule once** (Anthropic "consistent terminology / don't repeat"): + do not also restate the multi-hop rule in `` steps 5/6 — those already + carry a one-line pointer into ``, which is sufficient. + +### 5. Coordination with spec 07 (supersession) + +Spec 07's requirement 3 and acceptance criteria say the skill contains **exactly +one** worked example and "Do not add a second example." **This spec supersedes +that constraint**: the skill now carries **two** `sql` worked examples +(window-then-filter from spec 07, plus this multi-hop fan-out example). Annotate +spec 07 at those two spots with a one-line "superseded by spec 09" note so the two +permanent specs do not contradict. No other spec 07 content changes. + +## Leak-safety (hard constraint on this spec and its example) + +The benchmark's gold answers must never appear in ktx. The worked example must be +a **synthetic, generic schema invented for teaching** — not the tables, column +names, query, or numeric results of any Spider 2.0-Lite question. The example +demonstrates the *pattern* (a coarse-grain measure counted after a one-to-many +join), which is universal; it must be reconstructable from first principles by +anyone, with zero reference to benchmark data. A reviewer should be able to read +the example and find nothing that ties it to a specific benchmark instance. + +## Acceptance criteria + +- The `` **Composition** section states the **multi-hop generalization** + of the fan-out rule (cumulative danger across the chain; pre-aggregate per + measure-owning table) and an **affirmative grain-verification habit**, inline and + dialect-agnostic. +- The fix is presented as **default (pre-aggregate to grain) + escape hatch + (`COUNT(DISTINCT key)`, count-only)**, with the explicit caveat that `SUM`/`AVG` + of a fanned-out measure must pre-aggregate. +- Exactly **one** new, **generic** worked example (wrong vs. pre-aggregated-right) + using an invented schema, with no benchmark-derived identifiers or values, whose + RIGHT side is actually correct (unambiguous attribution; honest number). +- The skill now contains **two** `sql` worked examples total; the existing content + test's fence-count assertion is updated `1 → 2` and new assertions cover the + multi-hop rule phrase and the grain-verification-habit phrase. +- Terminology is consistent with spec 07 (`grain`, `fan-out`, `pre-aggregate`); no + synonyms introduced. +- **No new tool, flag, or config.** Skill-content only; additive to spec 07. +- All spec 07 invariants still hold: the skill remains dialect-agnostic (no + `QUALIFY`/`strftime`/`julianday`, no backtick three-part FQTN, no relative-time + anchoring to a `MAX(...)` date) and free of any benchmark/grader/gold reference, + including in the new example; ``/``/`` and the other + four sub-headings are intact; frontmatter still parses through + `SkillsRegistryService.parseFrontmatter`; the skill stays under 500 lines. +- Spec 07's "exactly one example" constraint is annotated as superseded (no + contradiction between the two permanent specs). + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the prose. + +- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md` → + `` → **Composition**. Rewrite the `Avoid fan-out joins` bullet, add + the affirmative grain-verification bullet, add the one worked example after them. + Leave the other four sub-headings, ``, ``, and `` + unchanged. +- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. Update the + "ships exactly one … worked example" test: `match(/```sql/g)` length `1 → 2`, + add an assertion for the new fan-out example's distinctive tokens (e.g. + `WITH returned_orders AS`), add the multi-hop-rule and grain-verification-habit + phrases to the behavior-presence list, and keep all banned-construct and + size-budget guards. This is a content assertion over the source `SKILL.md` — the + right level for prompt content. +- **Spec 07 annotation:** add a one-line "superseded by spec 09" note at spec 07's + requirement 3 and at its "Exactly one new worked example" acceptance bullet. +- **Rebuild/re-link** the dev binary so the playground picks up the change: + `pnpm run build && pnpm run link:dev` (provides `ktx-dev`). + +## Benchmark context (motivation only) + +Multi-hop aggregation questions (counting/averaging a coarse-grained measure +reached through several one-to-many joins) are a recurring source of +result-mismatch failures in the SQLite subset: the agent produces runnable SQL +with the right tables but a fan-out-inflated number. These are correctness +failures, not knowledge or schema-discovery failures (zero execution errors in the +latest run), so the fix belongs in the product's authoring craft — where it also +helps any real analyst — not in a benchmark-specific prompt. The skill itself must +contain no trace of the benchmark. + +## Implementation notes + +Shipped as specified — additive, content-only, no new tool/flag/config. + +- **`packages/cli/src/skills/analytics/SKILL.md`** → `` → **Composition**: + - Rewrote the `Avoid fan-out joins` bullet to `**Avoid fan-out joins — the + danger is cumulative.**`, generalizing to multi-hop chains: any one-to-many + hop between a measure's owning table and the aggregate inflates that measure + even when several hops below the `SUM`/`COUNT`; fix is pre-aggregate per + measure-owning table along the whole chain. Kept the `fan-out` term and the + one-line *why*. + - Added the affirmative `**Verify the grain holds across each join.**` bullet: + confirm a one-to-one / many-to-one join did not change the grain (row/key + count unchanged); default fix is pre-aggregate to grain, escape hatch is + `COUNT(DISTINCT key)` for a pure count only; stated once that `SUM`/`AVG` of a + fanned-out measure must pre-aggregate because `DISTINCT` cannot de-duplicate a + sum. + - Added one generic wrong-vs-right worked example (orders→regions via + stores/order_lines, `WITH returned_orders AS …`) — the second `sql` fence in + the skill. The inflating hop is three joins below the `COUNT`; the RIGHT side + pre-aggregates `order_lines` to one row per qualifying order so each order is + counted once (honest answer), and the trailing comment names the count-only + `COUNT(DISTINCT o.order_id)` escape hatch plus the `SUM`/`AVG` caveat. Invented + schema, dialect-agnostic SQL, no benchmark identifiers/values. + - The other four sub-headings and ``/``/`` are + untouched. Skill is 147 lines (well under the 500-line budget). +- **`packages/cli/test/skills/analytics-skill-content.test.ts`**: sql-fence count + `1 → 2`; added the multi-hop phrase (`the danger is cumulative`) and the + grain-verification phrase (`Verify the grain holds across each join`) to the + behavior-presence list; added new-example token assertions + (`WITH returned_orders AS`, `COUNT(DISTINCT o.order_id)`). All banned-construct, + relative-time, and size-budget guards retained. Test file passes (9/9). +- **Spec 07** annotated as superseded at requirement 3 and at its "exactly one + worked example" acceptance bullet — no contradiction between the two permanent + specs. + +**Verification:** `vitest run test/skills/analytics-skill-content.test.ts` → 9 +passed. `pnpm run build` (src `tsc -p tsconfig.json`) succeeds and the built +`dist/skills/analytics/SKILL.md` carries the new content; `pnpm run link:dev` +re-linked `ktx-dev`. A pre-existing, unrelated type error in +`test/mcp-server-factory.test.ts` (`KtxMcpContextPorts`/`context_tool`, last +touched in commit `2677b3ef`) surfaces under the full `type-check`'s +`tsconfig.test.json` pass; it is outside this change's surface and not introduced +here. diff --git a/spider2-specs/specs/10-panel-completeness-spine.md b/spider2-specs/specs/10-panel-completeness-spine.md new file mode 100644 index 00000000..983f01b1 --- /dev/null +++ b/spider2-specs/specs/10-panel-completeness-spine.md @@ -0,0 +1,289 @@ +# Panel/period completeness — emit the full set of groups, not only the populated ones + +> Refined spec. Intake draft: `todo/10-panel-completeness-spine.md`. + +## Problem + +When a question asks for a result *per period* or *per category* ("orders for +each month of 2023", "revenue by region", "count per status"), a plain `GROUP BY` +only returns groups that actually have rows. Periods or categories with **zero** +activity silently vanish, so a "12 months" answer comes back with 9 rows and the +three that should read `0` are simply absent. The SQL is runnable and the +aggregate is right, but the **panel is incomplete** — and a monthly report with +missing months or a category breakdown missing its empty categories is wrong for +any analyst, on any database. + +The existing `` "Answer completeness / interpretation" group already +carries a *"For each X / per X / by X returns exactly one row per X"* rule, but +that rule only governs **grain** (don't collapse to a single value). It says +nothing about the **domain**: "one row per X" today means one row per *observed* +X, so empty groups still drop. This spec sharpens that rule from grain-only to +grain-and-completeness. + +## Generic use case (independent of any benchmark) + +"How many orders were placed in each month of 2023?" must return **12 rows** even +if March had no orders (March = 0), not 11. "Sales per region" should include +regions with no sales when the question asks for *each* region. Both are +bread-and-butter reporting for any analyst on any warehouse, with no benchmark in +sight. + +## Model + +The feature splits across **two surfaces**, each holding the half it is suited +for. This split is the central design decision and exists to satisfy spec 07's +hard dialect-agnostic invariant without weakening it. + +### Why two surfaces (the dialect-agnostic reconciliation) + +The draft asked for a *"recursive-CTE date spine"* worked example. But a real +date/number series is **inherently dialect-specific** — Postgres `generate_series`, +SQLite recursive `date(d,'+1 month')`, BigQuery `GENERATE_DATE_ARRAY`, Snowflake +`GENERATOR`+`DATEADD` — and spec 07 made `` strictly dialect-agnostic +(the analytics-skill content test bans single-dialect constructs). Inlining a date +spine would violate that invariant; carving out a test exception would erode it. + +ktx already has the canonical home for engine-specific syntax: the per-dialect +notes in `packages/cli/src/context/sql-analysis/dialects/.md`, served by +the `sql_dialect_notes` MCP tool (spec 08). Those files answer a fixed rubric +(FQTN / Identifiers / Date-time / Top-N / JSON) — but **series/spine generation is +not in that rubric yet**. So the date-spine syntax belongs *there*, alongside the +other per-dialect idioms, and the dialect-agnostic skill points to it. This +routes the dialect-specific half through the existing channel rather than +standing up a parallel dialect-specific recipe inside the skill. + +Surface 1 (skill) carries the **pattern**; surface 2 (dialect notes) carries the +**concrete series syntax**. + +### Additive, inline, heuristic-with-a-why + +Consistent with spec 07: the skill change is **additive content in one Markdown +file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` file — the +delivery mechanism in `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, +and phrased as a **heuristic with a one-line generic rationale**, not a wall of +MUSTs. The dialect-notes change is additive content in the seven existing +`dialects/*.md` files. No new tool, flag, or config on either surface. + +## Requirements + +### 1. Skill surface — `` "Answer completeness / interpretation" + +Add the panel-completeness rule to the existing group (it extends, and should sit +adjacent to, the *"For each X / per X / by X"* bullet). It must cover: + +1. **Recognize the full-panel cue.** *each / every / all / per / for all + / by month* signals that the answer's row set should be the + **complete expected domain** of periods or categories in scope, not just those + present in the filtered fact rows. *Why:* a plain inner `GROUP BY` can only emit + groups that have at least one fact row. + +2. **Spine → LEFT JOIN → COALESCE.** Build the full set of expected groups (the + **spine**), then LEFT JOIN the aggregated facts onto it: + - **Category/dimension spine:** the distinct values from the **domain-defining + dimension/entity table** (e.g. all regions from a `regions` table), *not* + `SELECT DISTINCT region FROM facts` — the latter yields only categories that + already occur, so a zero-activity category still drops. When no dimension + table exists, the distinct values from the **unfiltered** fact table are the + best available domain (with the residual caveat that a category which never + occurs at all cannot surface). + - **Period/number spine:** generate the series for the question's stated range + (e.g. each month of 2023 → Jan..Dec 2023). The series bounds come from the + question's explicit range; when the range is "all periods present," derive + bounds from `MIN`/`MAX` over the **unfiltered** facts. The concrete + series-generation syntax is per-dialect — the rule points the author to + `sql_dialect_notes` (see requirement 2) and shows no inline series SQL. + +3. **COALESCE by measure additivity.** Default missing measures with + `COALESCE(metric, 0)` for **additive** measures (a `COUNT` or `SUM` of events + or amounts — "no activity" genuinely reads as 0). Leave **non-additive** + measures (`AVG`, a running balance, a price, a rate, a ratio) as **NULL** — + absence is "no data," and 0 would be a wrong reading. *Why:* 0 is a real value + only for additive measures. + +4. **Don't over-apply (the each-vs-which guard).** When the question asks only + about groups that exist ("*which* months had orders", "regions that made a + sale"), the spine is unnecessary and wrong — emit only observed groups. The cue + is *each / all / every* (complete domain) vs *which / that have* (observed + subset). + +5. **One worked example — the category spine, fully portable.** Add **exactly + one** compact before/after example demonstrating the pattern with a + **distinct-dimension spine**: the wrong shape (`GROUP BY` over facts, empty + groups missing) and the right shape (`SELECT DISTINCT` domain from the + dimension table → LEFT JOIN aggregated facts → `COALESCE(metric, 0)`). Generic + table/column names, standard SQL only — no series generation, no dialect + functions, so the example stays dialect-clean. The period-spine variant is + described in prose (requirement 2) and delegated to `sql_dialect_notes`; it + gets **no** inline example. This is the **third** worked `sql` example in the + skill (after spec 07's window-then-filter and spec 09's multi-hop fan-out). + +6. **Step pointer, no duplication.** The validate/explain step (and/or the query + step) already points into `` for answer-completeness; extend that + existing pointer's wording if needed, but state the rule **once** inside + ``. The step-5 pointer that lists what `sql_dialect_notes` provides + ("FQTN, identifier-quoting, date, top-N, and JSON conventions") should also + name the **series/calendar** convention now that it exists. + +### 2. Dialect-notes surface — `dialects/*.md` + +Add a **"Series"** (date/number range) line to **each** of the seven authored +dialect files, giving that engine's idiomatic way to generate a contiguous +date or integer series for use as a spine. Each note is engine-exclusive — a +SQLite analyst gets the SQLite idiom and never another engine's construct, per the +existing dialect-notes leak guards. Orientation (exact syntax is the +implementer's): + +- **postgres:** `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')`. +- **sqlite:** recursive CTE — `WITH RECURSIVE m(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d,'+1 month') FROM m WHERE d < '2023-12-01')`. +- **bigquery:** `UNNEST(GENERATE_DATE_ARRAY('2023-01-01','2023-12-01', INTERVAL 1 MONTH))` (and `GENERATE_ARRAY` for integers). +- **snowflake:** `TABLE(GENERATOR(ROWCOUNT => n))` with `DATEADD('month', SEQ4(), start)`, or a recursive CTE. +- **mysql:** recursive CTE (8.0+) with `DATE_ADD(d, INTERVAL 1 MONTH)`. +- **clickhouse:** `numbers(n)` / `range(n)` with `addMonths(start, number)` (or `arrayJoin`). +- **tsql:** recursive CTE with `DATEADD(month, …)`, or a numbers/tally table. + +This line is what makes the period spine usable from the dialect-agnostic skill, +and it is also consumed by **spec 11** (rolling-window-over-gappy-dates needs the +same date spine) — so it is foundational, not scope creep. + +### 3. Coordination with spec 11 + +Spec 11 (time-series window recipes) explicitly depends on this date spine for the +gappy-rolling case ("build a complete date spine first (see spec 10)"). Spec 10 +establishes the spine concept in the Answer-completeness group and the +series syntax in the dialect notes; spec 11 reuses both from the Window-functions +group. Keep the two non-overlapping: spec 10 owns the spine; spec 11 references it. + +## Leak-safety (hard constraint) + +Any worked example or note must use a **synthetic generic schema** (e.g. an +`orders` table with an `order_date`, a `regions` dimension) and demonstrate only +the *pattern* (spine + LEFT JOIN + COALESCE). **No** benchmark table names, SQL, +or result values on either surface. The dialect-notes additions, like the existing +notes, carry no benchmark/grader/version-dated content. The behavior is +reconstructable from first principles and tied to no specific instance. + +## Acceptance criteria + +- `` "Answer completeness / interpretation" states: the full-panel cue, + the spine → LEFT JOIN → COALESCE recipe, the additive-vs-non-additive COALESCE + discriminator (0 vs NULL), and the each-vs-which over-application guard — + inline, dialect-agnostic, each with a generic *why*. +- Exactly **one** new worked `sql` example is present, a portable + distinct-dimension spine (`SELECT DISTINCT` domain → LEFT JOIN → `COALESCE`), + with no series generation and no dialect-specific syntax. The skill then carries + **three** `sql` worked examples total. +- Each of the seven `dialects/*.md` files gains a **Series** (date/number range) + line in its engine's own idiom; no engine leaks another engine's construct, and + the additions contain no benchmark/grader/version-dated content. +- The skill remains dialect-clean: no `QUALIFY`, `strftime`, `julianday`, + `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, or other + single-dialect construct anywhere in `SKILL.md`, including the new example. +- The existing interactive guidance (``, ``, the other examples) + and the existing dialect-note rubric lines are intact and uncontradicted. +- No grader/benchmark reference, no output-shape contract, and no anchoring of + *relative* time ("recent" / "past N months") to a `MAX(date)` over the data + appears (period-spine bounds derive from the question's explicit range or, for + "all periods present," from `MIN`/`MAX` over the facts — which is range + derivation, not relative-time anchoring). +- The skill stays scannable and comfortably under the 500-line budget; frontmatter + still parses as `ktx-analytics`. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the prose. + +- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the + panel-completeness bullets to the Answer-completeness group, the single category + spine example, and extend the existing step pointer / dialect-notes provision + list to name the series convention. Leave ``/``/other examples + intact. Delivery is unchanged (single `SKILL.md` per target via + `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change required. +- **Dialect notes:** the seven files under + `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with + `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by + `copy-runtime-assets.mjs` — no plumbing change, content only. +- **Tests:** + - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a + representative phrase for the completeness rule; bump the `sql`-fence count + assertion **2 → 3**; assert the spine + LEFT JOIN + `COALESCE` shape; the + existing dialect-clean guards already cover the no-inline-series requirement + (the example is `SELECT DISTINCT`, so they pass unchanged). + - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the rubric loop + (the "answers the full rubric for every dialect" test) so every dialect must + also answer a **Series** line, e.g. `expect(notes).toMatch(/\*\*Series/)`. + Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces + all seven without a hand-maintained list. +- Rebuild and re-link the dev binary so the playground picks up both surfaces: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation only) + +Per-period / per-category questions where some periods are empty produce +short-row result mismatches in the SQLite subset, and the related rolling/cumulative +cluster (spec 11) needs a complete date spine to be correct at all. The fix is a +universal reporting habit (complete panels) plus the per-dialect series syntax +that makes it executable — both belong in the product, where they help real +analysts. Improving the benchmark score is a side effect; the skill and the +dialect notes contain no trace of the benchmark. + +## Implementation notes + +Shipped on branch `write-feature-spec-wiki`. Content-only across two surfaces, no +new tool/flag/config, no plumbing change. + +**Surface 1 — skill (`packages/cli/src/skills/analytics/SKILL.md`):** +- Added a **"Complete the panel for 'each / every / all / per '"** bullet to the `` "Answer completeness / interpretation" + group, directly after the *"For each X / per X / by X"* bullet, with three + sub-bullets carrying the rest of the rule each with its generic *why*: **Spine + source** (distinct domain from the dimension/entity table — not `SELECT DISTINCT` + over the facts; period/number series across the question's stated range, bounds + from `MIN`/`MAX` over the *unfiltered* facts for "all periods present"; series + syntax delegated to `sql_dialect_notes`), **Default by additivity** + (`COALESCE(metric, 0)` for additive measures, `NULL` for non-additive), and + **Don't over-apply** (the each-vs-which guard). +- Added **one** worked `sql` example at the end of the Answer-completeness group: a + portable distinct-dimension spine (`SELECT DISTINCT region_id FROM regions` → + `LEFT JOIN` aggregated facts → `COALESCE(ro.n_orders, 0)`), wrong-vs-right, + standard SQL only, no series generation, no dialect functions. The skill now + carries **three** `sql` worked examples. +- Extended the step-5 dialect-notes pointer to name the **series/calendar** + convention alongside FQTN / identifier-quoting / date / top-N / JSON. +- Delivery unchanged: `readAnalyticsSkillContent` in `setup-agents.ts` ships the + single `SKILL.md` per target — confirmed, no change. + +**Surface 2 — dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`):** +- Added a `- **Series:**` line to all seven authored files (postgres, sqlite, + bigquery, snowflake, mysql, clickhouse, tsql), each in that engine's own idiom + (`generate_series`; recursive CTE with `date(d,'+1 month')`; + `UNNEST(GENERATE_DATE_ARRAY(...))`; `GENERATOR`/`SEQ4`/`DATEADD`; recursive CTE + with `DATE_ADD`; `numbers(n)`/`addMonths`; recursive CTE with `DATEADD` + + `MAXRECURSION`), placed right after each file's Date/time line. No cross-engine + leak, no version-dated/benchmark content. Shipped to `dist` unchanged by + `copy-runtime-assets.mjs`; coverage stays derived from `DIALECTS_WITH_NOTES`. + +**Tests:** +- `test/skills/analytics-skill-content.test.ts`: added the `Complete the panel` + and `Default by additivity` phrases; renamed the worked-examples test and bumped + the `sql`-fence count **2 → 3**; asserted the spine + `LEFT JOIN` + `COALESCE` + shape. Also added `generate_series` and `GENERATE_DATE_ARRAY` to the + dialect-clean banned list — a deliberate **strengthening** beyond the spec's + test orientation so the "no inline series" acceptance criterion is *enforced*, + not merely incidentally true of a `SELECT DISTINCT` example. +- `test/context/mcp/dialect-notes.test.ts`: extended the "answers the full rubric + for every dialect" loop with `expect(notes).toMatch(/\*\*Series/)`, so all seven + dialects are required to answer a Series line (coverage derived from + `DIALECTS_WITH_NOTES`, no hand-maintained list). + +**Verification:** both affected test files pass (19 tests). `src` type-check and +`pnpm run build` are clean, and `copy-runtime-assets.mjs` placed the Series line in +all seven `dist` dialect files; `pnpm run link:dev` re-linked `ktx-dev`. Note: an +unrelated, pre-existing `tsconfig.test.json` type error in +`test/mcp-server-factory.test.ts` exists on this branch — untouched by this work +and outside its scope. + +**Coordination with spec 11:** the per-dialect Series line is the foundational +date spine that spec 11 (rolling/cumulative windows over gappy dates) references. +Spec 10 owns the spine (Answer-completeness group + dialect Series notes); spec 11 +will reference it from the Window-functions group. No overlap introduced. diff --git a/spider2-specs/specs/11-time-series-window-recipes.md b/spider2-specs/specs/11-time-series-window-recipes.md new file mode 100644 index 00000000..95bf3811 --- /dev/null +++ b/spider2-specs/specs/11-time-series-window-recipes.md @@ -0,0 +1,391 @@ +# Time-series window craft — running totals, rolling-over-time (min-periods), period-over-period + +> Refined spec. Intake draft: `todo/11-time-series-window-recipes.md`. + +## Problem + +A large share of analytics questions are time-series shaped: a **running / +cumulative balance**, a **rolling N-day average**, or **period-over-period +growth**. The agent already knows window functions exist — spec 07 gave the +`` "Window functions" group its determinism and window-then-filter +rules, and spec 10 added panel/period completeness — but it still gets the +*time-series specifics* wrong: + +- a cumulative balance computed **without an explicit unbounded-preceding + frame**, or with the implicit frame misbehaving when there are **ties on the + order key**; +- "rolling 30 days" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** + daily data, so the window spans the wrong calendar span when days are missing; +- no **minimum-periods** handling — a rolling average reported before the window + is actually full; +- "growth vs the previous period" written **without `LAG`** (or against the wrong + neighbor), with an **unguarded** `(cur - prev) / prev` that breaks on a zero or + absent prior. + +These are runnable-but-wrong: the structure is close, the edge case diverges. +It is the same failure shape spec 07 addressed at the general level; this spec +adds the time-series specifics to the **same Window-functions group**, building +on the rules already there rather than restating them. + +## Generic use case (independent of any benchmark) + +- "Each account's month-end running balance over 2023" — a cumulative sum of + monthly net over an ordered window. +- "30-day rolling average of daily revenue, only once 30 days of history exist." +- "Month-over-month revenue growth rate." + +All three are bread-and-butter for any analyst on any time-series table, with no +benchmark in sight. The methodology is universal analyst craft, so it belongs in +the shipped skill — it transfers to every ktx user querying a live database. + +## Model + +The change is **additive content across two surfaces** — the same split spec 10 +made, and for the same reason. The split is the central design decision; it +satisfies spec 07's hard dialect-agnostic invariant for `` without +weakening it. + +### Why two surfaces (the dialect-agnostic reconciliation) + +Two of the three recipes are **pure standard SQL** and stay entirely in the +dialect-agnostic skill: + +- **Cumulative / running total** — `SUM(x) OVER (... ROWS BETWEEN UNBOUNDED + PRECEDING AND CURRENT ROW)` is standard on every engine. +- **Period-over-period** — `LAG(metric) OVER (...)`, the growth ratio, and a + `NULLIF`-style divide-by-zero guard are standard on every engine. + +The third recipe — a **rolling window over calendar time** — has one piece that +is genuinely dialect-divergent: the **calendar-range window frame**. A native +range frame such as `RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW` +exists on some engines (e.g. postgres, mysql 8) but **not others** — sqlite has +no date-interval range frame, and SQL Server has **no offset `RANGE` frames at +all**; bigquery's `RANGE` frames are numeric-only. So a portable skill cannot +inline a range frame any more than it could inline a date-series generator. + +ktx already routes that kind of engine-specific syntax through the per-dialect +notes in `packages/cli/src/context/sql-analysis/dialects/.md`, served by +the `sql_dialect_notes` MCP tool (spec 08). Spec 10 established the precedent +exactly: series/spine generation was not in the dialect rubric, so it was added +there (the **Series** line) and the dialect-agnostic skill points to it. +Rolling-window framing is the next construct in that same position — not in the +rubric yet, dialect-specific — so the **rolling-window idiom belongs in the +dialect notes**, and the skill points to it. + +Surface 1 (skill) carries the **pattern** (calendar range, not a row count; the +min-periods guard; the spine-or-range choice). Surface 2 (dialect notes) carries +the **concrete rolling-window frame syntax** per engine. + +### Additive, inline, heuristic-with-a-why + +Consistent with specs 07 and 10: the skill change is **additive content in one +Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` +file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, and phrased as +**heuristics with a one-line generic rationale**, not a wall of MUSTs. The +dialect-notes change is additive content in the seven existing `dialects/*.md` +files. No new tool, flag, or config on either surface. + +### Build on the rules already present; do not restate them + +The Window-functions group already carries **"Make the ordering deterministic"** +(complete tie-breaker) from spec 07, and the Numeric-precision group carries +**"Round only at the end."** The cumulative and period-over-period recipes +**reference** these rather than repeat them (state each rule once — Anthropic's +"consistent terminology / don't repeat" guidance, already followed in spec 07). +Spec 10's **Series** dialect line is likewise **referenced** by the rolling +recipe's spine fallback, not duplicated. + +## Requirements + +### 1. Skill surface — `` "Window functions" group (three recipes) + +Add three recipes to the **existing** "Window functions" group, after its two +current bullets (deterministic ordering; filter-after-the-window). Each is a +heuristic with a generic *why*, dialect-agnostic. + +1. **Cumulative / running total.** Use an **explicit frame** — `SUM(x) OVER + (PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` — + with a **complete tie-breaker** on the `ORDER BY` (per the group's existing + deterministic-ordering rule; reference it, do not restate). *Why:* a bare + `ORDER BY` defaults to a `RANGE … CURRENT ROW` frame, which on **ties in the + order key** folds every tied peer into the same cumulative value — it runs and + looks plausible, but the running total jumps at each tie boundary. + +2. **Rolling window over calendar time, plus minimum periods.** "Rolling N + days/months" must span a **calendar range**, not a fixed row count: a `ROWS + BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are + missing. Two sanctioned techniques: + - **Spine + `ROWS` (portable).** Build a gap-free date spine first (spec 10's + **Series**, via `sql_dialect_notes`) so the data has one row per calendar + unit; then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the + intended calendar span. This path is fully dialect-agnostic. + - **Native range frame or date-keyed self-join (engine-specific).** Where the + engine supports it, a calendar **range frame** expresses the window directly; + otherwise a self-join keyed on the date does. Both use engine-specific + syntax — get the **rolling-window** idiom from `sql_dialect_notes` (see + requirement 3); show no inline range frame in the skill. + + **Minimum periods.** When the question says "only after N periods of data" (or + a rolling metric implies it), emit `NULL` / skip until the window is actually + full — guard on a window count, e.g. `COUNT(*) OVER () = N`. On a + gap-free spine, `COUNT(*)` counts calendar slots; count the **non-null + observations** instead when "N periods" means N data points rather than N + calendar units. *Why:* a row-count frame over missing dates measures the wrong + span, and a partial early window is not the requested metric. + +3. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)` + for the prior-period comparison; compute growth as `(cur - prev) / prev` at + **full precision**, rounding only in the final projection (per the existing + "Round only at the end" rule), and **guard divide-by-zero / NULL prev** + (e.g. divide by `NULLIF(prev, 0)`). *Why:* without `LAG` — or ordered against + the wrong neighbor — the comparison lands on the wrong period, and an unguarded + ratio errors or returns garbage when the prior period is zero or absent. + +**Step pointer (no duplication).** The step-5 `sql_dialect_notes` provision list +(currently "FQTN, identifier-quoting, date, top-N, series/calendar, and JSON +conventions") should also name the **rolling-window** convention now that it +exists. State each rule once inside ``; the workflow steps only point +to it. + +### 2. One worked example — cumulative running total (dialect-agnostic) + +Add **exactly one** new compact before/after `sql` example, demonstrating the +**cumulative running total** — the subtlest of the three (the implicit-frame trap +runs fine and is wrong only at tie boundaries) and the highest-value to show. +Use a synthetic generic schema (e.g. `account_txns(account_id, txn_date, net)`): + +- **Wrong:** `SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date)` — the + implicit `RANGE` frame makes two txns on the same date share one inflated + running balance. +- **Right:** the same with an explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND + CURRENT ROW` frame and a complete tie-breaker (`ORDER BY txn_date, txn_id`). + +Standard SQL only — no `QUALIFY`, no dialect functions, no series generation, no +`RANGE … INTERVAL`. Keep it ~10–14 lines. The **rolling-over-time** recipe gets +**no** inline example (its correct form needs the engine-specific frame/spine, +delegated to `sql_dialect_notes`, exactly as spec 10's period-spine variant was +prose-only); the **period-over-period** recipe is self-evident from its bullet +and also gets no example. This is the **fourth** worked `sql` example in the +skill, after spec 07 (window-then-filter), spec 09 (multi-hop fan-out), and +spec 10 (panel-completeness spine). + +### 3. Dialect-notes surface — `dialects/*.md` (rolling window) + +Add a **rolling-window-over-time** idiom line to **each** of the seven authored +dialect files, parallel to spec 10's **Series** line. Each note is +engine-exclusive — a SQLite analyst gets the SQLite idiom and never another +engine's construct, per the existing dialect-notes leak guards. Each note either +gives the engine's native calendar-range frame **or** references its own +**Series** line for the spine + `ROWS` fallback (a cross-reference within the +file, not a duplicate of the Series line). + +Orientation only — **`RANGE`-frame support genuinely varies by engine and +version, so the implementer must verify each engine's current support against +authoritative docs (context7 / the engine's manual) rather than assert it from +memory.** Starting points: + +- **postgres:** native — `... OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' + PRECEDING AND CURRENT ROW)`. +- **mysql (8.0+):** native — `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT + ROW` over a temporal order key. +- **bigquery:** `RANGE` frames are **numeric** — range over an integer day key + (e.g. `UNIX_DATE(day)`) with `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or + build a spine (see **Series**) and use a `ROWS` frame. +- **sqlite:** **no** date-interval range frame — build a date spine (see + **Series**) and use a `ROWS` frame. +- **tsql (SQL Server):** **no** offset `RANGE` frames at all — build a spine (see + **Series**) and use a `ROWS` frame, or a date-keyed self-join. +- **snowflake / clickhouse:** range-frame support over dates is limited — verify; + default to a spine (see **Series**) + `ROWS` frame where a native calendar range + frame is unavailable. + +This line is what makes the rolling-over-time recipe executable from the +dialect-agnostic skill. It is **distinct** from spec 10's Series line (Series = +how to *generate* a spine; Rolling window = how to compute a *moving +calendar-range aggregate*, natively or via that spine), and it cross-references +the Series line rather than overlapping it. + +### 4. Explicit constraints / exclusions + +None of the following may appear (consistent with specs 07 and 10): + +- **No inline dialect-specific range-frame syntax in the skill** — no + `RANGE … INTERVAL` frame, no series generator, no dialect function. The skill + stays dialect-clean; the range frame lives only in the dialect notes. +- **No anchoring of relative time to `MAX(date)`.** "Recent" / "past N months" + means relative to *now* on a live database. A range *bound* may be derived from + the question's explicit range or, for "all periods present," from `MIN`/`MAX` + over the **unfiltered** facts (range derivation, per spec 10) — but the metric + must never silently redefine "recent" as the data's maximum date. +- **No grader / gold-answer / benchmark reference**, and no output-shape contract + (the skill is for interactive analysis). + +### 5. Coordination with specs 07 and 10 + +All three recipes live in the **existing** `` "Window functions" +group; the two current bullets and the spec-07 window-then-filter example must +stay intact and uncontradicted. + +- **Spec 07** owns the deterministic-ordering rule (Window functions) and the + round-at-the-end rule (Numeric precision). Spec 11 **builds on** both — + references them, never restates them. +- **Spec 10** owns the spine concept and the dialect **Series** line. Spec 11 + **references** the spine for the gappy-rolling fallback and adds the **distinct** + rolling-window dialect line. Keep them non-overlapping: spec 10 = how to make a + spine; spec 11 = how to compute a moving calendar-range aggregate (native frame + or spine + `ROWS`). + +## Leak-safety (hard constraint) + +Every worked example or note uses a **synthetic generic schema** (e.g. +`daily_revenue(day, amount)` or `account_txns(account_id, txn_date, net)`) and +shows only the *pattern*. **No** benchmark table names, SQL, or result values on +either surface. The dialect-notes additions, like the existing notes, carry no +benchmark / grader / version-dated content. The behavior is reconstructable from +first principles and tied to no specific instance. + +## Acceptance criteria + +- The `` "Window functions" group states the three recipes — inline, + dialect-agnostic, each with a generic *why*, and each **building on** (not + restating) the deterministic-ordering and round-at-the-end rules: + - **cumulative / running total** with an explicit `ROWS BETWEEN UNBOUNDED + PRECEDING AND CURRENT ROW` frame and a complete tie-breaker; + - **rolling window over calendar time + minimum periods** — calendar range not + row count, the spine-or-range choice, the min-periods `COUNT(*) OVER (...)` + guard — delegating the engine's range-frame syntax to `sql_dialect_notes`; + - **period-over-period** via `LAG`, with full-precision growth and a + divide-by-zero / NULL-prev guard. +- Exactly **one** new worked `sql` example: the cumulative running total, + wrong-vs-right, with the explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT + ROW` frame and a complete tie-breaker, in standard dialect-agnostic SQL. The + skill then carries **four** `sql` worked examples total. +- Each of the seven `dialects/*.md` files gains a **rolling-window-over-time** + idiom line in its engine's own idiom (native calendar-range frame where + supported, otherwise a spine + `ROWS` fallback that references its **Series** + line); no engine leaks another engine's construct, and the additions contain no + benchmark / grader / version-dated content. +- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`, + `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, **and no + inline `RANGE … INTERVAL` frame**, anywhere in `SKILL.md` including the new + example. +- The step-5 `sql_dialect_notes` provision list names the **rolling-window** + convention alongside FQTN / identifier-quoting / date / top-N / series/calendar / + JSON. +- The existing interactive guidance (``, ``, the other + examples), the two existing Window-functions bullets, the window-then-filter + example, and the existing dialect-note rubric lines (including **Series**) are + intact and uncontradicted. +- No grader / benchmark reference, no output-shape contract, and no anchoring of + *relative* time ("recent" / "past N months") to a `MAX(date)` over the data. +- The skill stays scannable and comfortably under the 500-line budget; frontmatter + still parses as `ktx-analytics`. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the prose. + +- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three recipes + to the "Window functions" group (after its two existing bullets), the single + cumulative worked example, and extend the step-5 dialect-notes provision list to + name the rolling-window convention. Leave `` / `` / the other + examples and the two existing window bullets intact. Delivery is unchanged + (single `SKILL.md` per target via `readAnalyticsSkillContent` in + `setup-agents.ts`) — confirm, no change required. +- **Dialect notes:** the seven files under + `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with + `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by + `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each + engine's actual `RANGE`-frame support against authoritative docs before writing + the idiom; do not assert from memory.** +- **Tests:** + - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a + representative phrase for each of the three recipes; bump the `sql`-fence count + assertion **3 → 4**; assert the cumulative example shape (e.g. `ROWS BETWEEN + UNBOUNDED PRECEDING AND CURRENT ROW`); and **strengthen** the dialect-clean + guard with a no-inline-`RANGE … INTERVAL` assertion (mirroring spec 10 adding + `generate_series` / `GENERATE_DATE_ARRAY` to the banned list, so the + "range frame lives only in the dialect notes" criterion is *enforced*, not + incidentally true). + - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers the + full rubric for every dialect" loop with the rolling-window assertion, e.g. + `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect must answer it. + Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces + all seven without a hand-maintained list. +- Rebuild and re-link the dev binary so the playground picks up both surfaces: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation only) + +Running-balance / rolling / period-over-period questions are the single largest +result-mismatch cluster in the SQLite subset (financial-transactions-style DBs): +cumulative balances with the wrong frame on ties, rolling windows that mis-span +gappy dates, partial early windows, and unguarded period-over-period ratios. The +methodology is universal analyst craft, so it belongs in the product's skill +(where it helps every real user) plus the per-dialect rolling-window syntax that +makes it executable — not in a benchmark-specific prompt. Depends on spec 10 (the +date spine) for the gappy-rolling fallback. Improving the benchmark score is a +side effect; the skill and the dialect notes contain no trace of the benchmark. + +## Implementation notes + +Shipped as additive content across the two surfaces the spec specified — no new +tool, flag, or config. + +**Skill (`packages/cli/src/skills/analytics/SKILL.md`).** Added the three recipes +to the existing `` "Window functions" group, after its two bullets and +the spec-07 window-then-filter example: **Cumulative / running total** (explicit +`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` + a tie-breaker, referencing +the deterministic-ordering rule), **Rolling window over calendar time, plus +minimum periods** (calendar range not row count; spine-or-native-range choice +delegated to `sql_dialect_notes`; the `COUNT(*) OVER () = N` +min-periods guard), and **Period-over-period** (`LAG` + full-precision growth + +`NULLIF` divide guard, referencing the round-at-the-end rule). Added one worked +`sql` example — the cumulative running total, wrong-vs-right, using +`account_txns(account_id, txn_id, txn_date, net)` — bringing the skill to four +worked examples. Extended the step-5 `sql_dialect_notes` provision list to name +the rolling-window convention. No inline `RANGE … INTERVAL` frame anywhere in the +skill; it stays dialect-clean. + +**Dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`).** Added a +**Rolling window over time** line to all seven files, parallel to the spec-10 +**Series** line and cross-referencing it for the spine fallback. + +**Deviation — `RANGE`-frame support verified against authoritative docs (the +spec's hard requirement), which corrected two of its starting points:** + +- **postgres** — native interval frame: `RANGE BETWEEN INTERVAL '29 days' + PRECEDING AND CURRENT ROW` (as the spec guessed). +- **mysql** — native interval frame over a temporal key: `RANGE BETWEEN INTERVAL + 29 DAY PRECEDING AND CURRENT ROW` (as guessed). +- **bigquery** — `RANGE` is numeric-only: range over `UNIX_DATE(day)` with + `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or spine + `ROWS` (as guessed). +- **snowflake** — **corrected:** the spec said "limited; default to a spine," but + Snowflake *does* support a native interval `RANGE` frame over a date/timestamp + key and it is gap-tolerant, so the note gives the native frame + (`RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`), no spine needed. +- **clickhouse** — **corrected:** the spec said "limited; default to a spine," but + ClickHouse supports a numeric `RANGE` offset over a `Date` column (counts in + days, gap-tolerant); the `INTERVAL` form is unsupported (use seconds for + `DateTime`). The note gives the numeric `RANGE` frame, with spine + `ROWS` as + the fallback. +- **sqlite** — no date-interval range frame (no native date type): spine + `ROWS` + (as guessed). +- **tsql** — `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame): + spine + `ROWS`, or a date-keyed self-join (as guessed). + +**Tests.** `test/skills/analytics-skill-content.test.ts` — added a representative +phrase per recipe (plus `minimum periods`), bumped the `sql`-fence count 3 → 4, +asserted the cumulative example shape (`ROWS BETWEEN UNBOUNDED PRECEDING AND +CURRENT ROW` and the `ORDER BY txn_date, txn_id` tie-breaker), and strengthened +the dialect-clean guard with a no-inline-`RANGE … INTERVAL` regex. +`test/context/mcp/dialect-notes.test.ts` — extended the per-dialect rubric loop +with `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect (derived from +`DIALECTS_WITH_NOTES`) must answer the rolling-window rubric. + +**Verification.** Full `@kaelio/ktx` vitest suite green (3001 passed, 1 skipped); +`pnpm run build` mirrors both surfaces into `dist`; `pnpm run link:dev` refreshed +`ktx-dev`. Pre-existing, unrelated note: `tsc -p tsconfig.test.json` reports one +error in `test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts` cast) that is +present in committed branch code and untouched by this work. diff --git a/spider2-specs/specs/12-parse-text-encoded-numbers.md b/spider2-specs/specs/12-parse-text-encoded-numbers.md new file mode 100644 index 00000000..68139ca3 --- /dev/null +++ b/spider2-specs/specs/12-parse-text-encoded-numbers.md @@ -0,0 +1,405 @@ +# Parse text-encoded numeric columns before doing math on them + +> Refined spec. Intake draft: `todo/12-parse-text-encoded-numbers.md`. + +## Problem + +Numeric measures are often stored as **text** with human formatting: unit +suffixes (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators +(`"$1,200"`), percent signs (`"12%"`), or non-numeric sentinels for missing/zero +(`"-"`, `"N/A"`, `""`). Aggregating or comparing such a column directly is +**silently wrong**: a string comparison orders `"100" < "9"`, and a naive +`CAST(x AS REAL)` yields `0`/NULL/partial on the formatted values rather than the +intended number. The query runs, the shape looks right, the number is garbage. + +The agent already samples schemas before composing — spec 07 gave the +`` "Schema discovery before writing SQL" group its *"Sample before you +compose"* and *"Cast to the real type before comparing"* rules. But those rules +guard **encoding** (date format, nullability) and **type-mismatch in `WHERE`**; +they say nothing about a column whose declared/affinity type is text yet whose +*meaning* is numeric. When the agent sees a "numeric-looking" column it tends to +assume a real number type and skips the parse, so the arithmetic runs on the raw +strings. This spec adds the detect → parse/scale → verify habit to that same +group, building on the two rules already there rather than restating them. + +## Generic use case (independent of any benchmark) + +- A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become + `1200 / 3000000 / 0` before you can sum it or compute a daily change. +- A `price` stored as `"$1,299.00"` must become `1299.00` before averaging. +- A `conversion_rate` stored as `"12%"` must become `0.12` before weighting it. + +This is routine data hygiene on real, messy production tables — every analyst +hits text-encoded measures on some warehouse, with no benchmark in sight. The +methodology is universal craft, so it belongs in the shipped skill; it transfers +to every ktx user querying a live database. + +## Model + +The change is **additive content across two surfaces** — the same split specs 10 +and 11 made, and for the same reason. The split is the central design decision; +it satisfies spec 07's hard dialect-agnostic invariant for `` without +weakening it. + +### Why two surfaces (the dialect-agnostic reconciliation) + +The **detect → parse → scale** half is **pure portable SQL** and stays entirely +in the dialect-agnostic skill: + +- Stripping `$` / `,` / `%` is a portable chained `REPLACE` over a small, known + set of literal characters — no regex needed. +- Suffix scaling (K=10³, M=10⁶, B=10⁹) is a portable `LIKE`/`CASE` expression. +- Sentinel mapping (`-` / `N/A` / empty → `0` or `NULL`) is a portable `CASE`. +- The final cast to a numeric type is `CAST(... AS DECIMAL)`, broadly portable. + +The **verify** half has one piece that is genuinely dialect-divergent: a +**failure-detecting numeric cast** — a cast that signals (rather than silently +swallows) a value that did not parse. This is exactly what requirement 3 +("confirm coverage") needs, and it cannot be written portably: + +- **bigquery:** `SAFE_CAST(x AS FLOAT64)` → `NULL` on failure. +- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_CAST` → `NULL` on failure. +- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(...))` / `TRY_CONVERT` → `NULL`. +- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`. +- **postgres / mysql:** no `TRY_CAST` — guard with a numeric pattern test before + casting (e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`). +- **sqlite (the gotcha):** a plain `CAST('abc' AS REAL)` returns **`0.0`** and + `CAST('12abc' AS REAL)` returns **`12.0`** — it neither errors nor NULLs, so an + `IS NULL` coverage check is **silently broken**. Detecting a failed parse needs + a `GLOB`/`typeof` pattern guard. + +So a portable skill cannot inline a safe cast any more than spec 10 could inline a +date-series generator or spec 11 a calendar range frame. ktx already routes that +kind of engine-specific syntax through the per-dialect notes in +`packages/cli/src/context/sql-analysis/dialects/.md`, served by the +`sql_dialect_notes` MCP tool (spec 08). Specs 10 and 11 set the exact precedent: +a construct not yet in the dialect rubric, genuinely engine-specific, was added +there (the **Series** line; the **Rolling window** line) and the dialect-agnostic +skill points to it. The failure-detecting cast is the next construct in that same +position, so the **safe-cast idiom belongs in the dialect notes**, and the skill +points to it. + +Surface 1 (skill) carries the **pattern** (detect the text encoding; parse/scale +in an early CTE; verify with a failure-detecting cast). Surface 2 (dialect notes) +carries the **concrete safe-cast syntax** per engine, including the sqlite +`CAST`-returns-0 gotcha. + +The regex character-*strip* is deliberately **not** promoted to the dialect +notes: a portable chained `REPLACE` over a known character set is the opinionated +default, so there is no need for a per-dialect strip line (derive from need; one +default). The dialect surface gains exactly one thing — the safe cast — because +that is the only piece the portable path genuinely cannot express. + +### Additive, inline, heuristic-with-a-why + +Consistent with specs 07, 10, and 11: the skill change is **additive content in +one Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled +`reference/` file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, +and phrased as **heuristics with a one-line generic rationale**, not a wall of +MUSTs. The dialect-notes change is additive content in the seven existing +`dialects/*.md` files. No new tool, flag, or config on either surface. + +### Build on the rules already present; do not restate them + +- The Schema-discovery group already carries **"Sample before you compose"** and + **"Cast to the real type before comparing"** (spec 07). The detect rule + **extends** the first (distinct-value sampling to learn the encoding) and the + parse rule **complements** the second (text-meaning-numeric, not just + text-vs-numeric literal mismatch) — reference them, do not repeat them. +- The sentinel **0-vs-NULL** choice is the **same additive-vs-non-additive + judgment** spec 10 established in its *"Default by additivity"* rule (0 only + when "no value" genuinely reads as 0; NULL otherwise). **Reference** that rule + rather than restating the discriminator (state each rule once). + +## Requirements + +### 1. Skill surface — `` "Schema discovery before writing SQL" + +Add the text-encoded-numeric guidance to the **existing** group, after its two +current bullets. Phrase as heuristics, each with a generic *why*, dialect-agnostic. +It must cover: + +1. **Detect text-encoded numerics during sampling.** When a column the question + treats as a number is stored as text, sample its **distinct** values to learn + the encodings actually present — unit suffixes (`K`/`M`/`B`), currency + symbols, thousands separators, percent signs, and non-numeric sentinels + (`-`, `N/A`, empty) — **before** composing. Never infer the format from the + column name. *Why:* compared/aggregated as-is, the text sorts lexically + (`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL — + producing a silently wrong result instead of an error. + +2. **Parse and scale in an early CTE.** Strip currency/separator/percent + characters, multiply by the suffix scale (K=10³, M=10⁶, B=10⁹), map sentinels + to `0` **or** `NULL` per the question's intent, then cast to a numeric type — + all in **one early CTE**, so every downstream layer sees clean numbers. The + `0`-vs-`NULL` choice for sentinels follows spec 10's **additive-vs-non-additive** + rule (reference it; do not restate). *Why:* a string column aggregated as-is + sorts lexically and casts to 0, so the math is silently wrong. + +3. **Confirm coverage (verify).** After parsing, sanity-check that **no + intended-numeric value silently failed to parse** — a failed parse should + surface as `NULL`, which is only visible with a **failure-detecting cast**. + Note the divergence: a plain `CAST` errors on some engines and, on sqlite, + returns `0`/partial rather than NULL — so use the engine's safe-cast idiom from + `sql_dialect_notes` (requirement 3), then count residual NULLs among + non-sentinel rows. *Why:* an encoding the sample missed would otherwise vanish + as `0`/NULL instead of being caught. + +### 2. One worked example — parse/scale, fully portable + +Add **exactly one** new compact before/after `sql` example demonstrating the +parse-and-scale pattern on a synthetic generic schema +(e.g. `metrics(label, value_text)` with values like `'1.2K'`, `'$1,200'`, `'-'`): + +- **Wrong:** `SUM(CAST(value_text AS REAL))` (or summing the raw strings) — the + formatted values collapse to `0`/partial, so the total is silently wrong. +- **Right:** an early CTE that strips symbols with chained `REPLACE`, applies a + `CASE` for the K/M/B suffix scale, maps `'-'`/`'N/A'`/`''` to `0`, casts to + `DECIMAL`, then `SUM`s the parsed column. + +**Standard, portable SQL only** — no `REGEXP_REPLACE`, `SAFE_CAST`, `TRY_CAST`, +`TRY_TO_NUMBER`, `toFloat64OrNull`, `GLOB`, or any dialect function — so the +example stays dialect-clean. Keep it ~12–16 lines. The **verify** step gets **no** +inline example (its correct form needs the engine-specific safe cast, delegated to +`sql_dialect_notes`, exactly as spec 10's period-spine and spec 11's +rolling-window variants were prose-only). + +This adds **one** worked `sql` example to the skill. Spec 11 independently adds +one as well; **do not hardcode the resulting total** — increment from the current +state. As of this writing the skill carries **three** examples (spec 07 +window-then-filter, spec 09 multi-hop fan-out, spec 10 panel spine), so this is +the **fourth**; if spec 11 ships first it is the **fifth**. The fence-count test +assertion is incremented by one from its current value (see Acceptance criteria). + +### 3. Dialect-notes surface — `dialects/*.md` (safe cast) + +Add a **"Safe cast"** idiom line to **each** of the seven authored dialect files, +parallel to spec 10's **Series** line and spec 11's **Rolling window** line. Each +line gives that engine's **failure-detecting numeric cast** — a cast that returns +`NULL` (or is detectably invalid) on a non-numeric input — which is what makes the +verify step correct on that engine. Each note is engine-exclusive (a SQLite +analyst gets the SQLite idiom and never another engine's construct, per the +existing dialect-notes leak guards). Orientation only — exact syntax is the +implementer's; verify against authoritative docs (context7 / the engine manual) +rather than asserting from memory: + +- **postgres:** no `TRY_CAST` — guard with a numeric pattern before casting, + e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`. (`regexp_replace` is + available for the strip, but chained `REPLACE` is the portable default.) +- **mysql (8.0+):** no `TRY_CAST` — guard with `x REGEXP '^-?[0-9.]+$'` before + `CAST(... AS DECIMAL)`; `REGEXP_REPLACE` is available for the strip. +- **bigquery:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(... AS NUMERIC)`) → + `NULL` on failure. +- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` / `TRY_CAST` → + `NULL` on failure. +- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`. +- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(18,4))` / `TRY_CONVERT` → `NULL`. +- **sqlite (the gotcha):** a plain `CAST` returns `0`/partial, **not** NULL or an + error, so a coverage check must use a pattern guard such as + `CASE WHEN cleaned GLOB '...' THEN CAST(cleaned AS REAL) END` (or a `typeof` + check) to detect a value that did not parse. + +This line is what makes the verify step executable from the dialect-agnostic +skill. It is **distinct** from the Series and Rolling-window lines (those generate +or window over a calendar; this detects a failed numeric parse). Phrase any +version note as `8.0+`-style, **not** "as of version …" (the dialect-notes test +bans version-dated wording). + +### 4. Explicit constraints / exclusions + +None of the following may appear (consistent with specs 07, 10, and 11): + +- **No inline dialect-specific cast/regex syntax in the skill** — no `SAFE_CAST`, + `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, + `replaceRegexpAll`, or `GLOB` anywhere in `SKILL.md`. The portable strip is + chained `REPLACE`; the failure-detecting cast lives only in the dialect notes. +- **No regex-strip dialect line.** The character strip stays the portable + chained-`REPLACE` default; the dialect notes gain only the **safe cast**. +- **No grader / gold-answer / benchmark reference**, and no output-shape contract + (the skill is for interactive analysis). + +### 5. Coordination with specs 07, 08, 10, and 11 + +- **Spec 07** owns the Schema-discovery group and its two existing bullets + (*"Sample before you compose"*, *"Cast to the real type before comparing"*). + Spec 12 **extends** that group and **builds on** both bullets — references them, + never restates them; they must stay intact and uncontradicted. +- **Spec 08** owns the dialect-notes channel and its leak guards. Spec 12 adds one + rubric line through that channel; the engine-exclusivity guards apply unchanged. +- **Spec 10** owns the additive-vs-non-additive discriminator (Answer + completeness) and the dialect **Series** line. Spec 12 **references** the + additivity rule for the sentinel `0`-vs-`NULL` choice; do not duplicate it. +- **Spec 11** independently adds the dialect **Rolling window** line, one `sql` + example, and the **rolling-window** entry to the step-5 provision list. Spec 12 + touches the **same** three places (the dialect-notes rubric loop, the example + count, and the step-5 list). Both are independent and additive — **add to the + current state, do not assume an order**: name **safe-cast** in the step-5 list + without removing rolling-window/series; increment the example count by one from + whatever it is; add `/\*\*Safe cast/` to the rubric loop alongside any + `/\*\*Rolling/` assertion. + +### 6. Step pointer (no duplication) + +The step-5 `sql_dialect_notes` provision list (currently "FQTN, +identifier-quoting, date, top-N, series/calendar, and JSON conventions"; spec 11 +also names rolling-window) should additionally name the **safe-cast** convention +now that it exists. State each rule once inside ``; the workflow steps +only point to it. + +## Leak-safety (hard constraint) + +Every worked example or note uses a **synthetic generic schema** (e.g. +`metrics(label, value_text)`) and made-up values (`'1.2K'`, `'$1,200'`, `'-'`), +showing only the *pattern*. **No** benchmark table names, SQL, or result values on +either surface. The dialect-notes additions, like the existing notes, carry no +benchmark / grader / version-dated content. The behavior is reconstructable from +first principles and tied to no specific instance. + +## Acceptance criteria + +- The `` "Schema discovery before writing SQL" group states the three + heuristics — inline, dialect-agnostic, each with a generic *why*, and each + **building on** (not restating) the existing *"Sample before you compose"* and + *"Cast to the real type before comparing"* bullets and spec 10's additivity rule: + - **detect** text-encoded numerics by sampling distinct values (suffixes, + symbols, separators, sentinels) — never from the column name; + - **parse and scale** in an early CTE (strip → suffix-scale → sentinel map → + cast), sentinel `0`-vs-`NULL` per spec 10's additivity rule; + - **confirm coverage** with a failure-detecting cast, delegating the engine's + safe-cast syntax to `sql_dialect_notes`. +- Exactly **one** new worked `sql` example: parse-and-scale, wrong-vs-right, using + chained `REPLACE` + `CASE` suffix scale + sentinel `CASE` + `CAST(... AS + DECIMAL)`, in standard portable SQL. The `sql`-fence count assertion is + incremented by **one** from its current value (3 today → 4; or 5 if spec 11 + shipped first). +- Each of the seven `dialects/*.md` files gains a **"Safe cast"** idiom line in its + engine's own failure-detecting numeric-cast idiom (including the sqlite + `CAST`-returns-0 gotcha); no engine leaks another engine's construct, and the + additions contain no benchmark / grader / version-dated content. +- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`, + `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, inline + `RANGE … INTERVAL` frame, **and no `SAFE_CAST` / `TRY_CAST` / `TRY_TO_NUMBER` / + `REGEXP_REPLACE` / `toFloat64OrNull` / `GLOB`**, anywhere in `SKILL.md` + including the new example. +- The step-5 `sql_dialect_notes` provision list names the **safe-cast** convention + alongside FQTN / identifier-quoting / date / top-N / series-calendar / + rolling-window / JSON. +- The existing interactive guidance (``, ``, the other examples), + the two existing Schema-discovery bullets, and the existing dialect-note rubric + lines (including **Series** and, if present, **Rolling window**) are intact and + uncontradicted. +- No grader / benchmark reference, and no output-shape contract. +- The skill stays scannable and comfortably under the 500-line budget; frontmatter + still parses as `ktx-analytics`. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the prose. + +- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three + heuristics to the "Schema discovery before writing SQL" group (after its two + existing bullets), the single parse-and-scale worked example, and extend the + step-5 dialect-notes provision list to name the safe-cast convention. Leave + `` / `` / the other examples and the two existing + schema-discovery bullets intact. Delivery is unchanged (single `SKILL.md` per + target via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no + change required. +- **Dialect notes:** the seven files under + `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with + `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by + `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each + engine's actual safe-cast / try-cast support against authoritative docs before + writing the idiom; do not assert from memory** (in particular the sqlite + `CAST`-returns-0 behavior, which is the motivating gotcha). +- **Tests:** + - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a + representative phrase for each of the three heuristics (e.g. a *detect*, a + *parse/scale*, and a *confirm-coverage* phrase) to the `represents every craft + behavior` list; bump the `sql`-fence count assertion **by one** from its + current value; assert the example shape (e.g. `REPLACE(` and `CAST(` and a + suffix-scale multiplier); and **strengthen** the dialect-clean guard by adding + `SAFE_CAST`, `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, + and `GLOB` to the banned list (mirroring spec 10 adding `generate_series` / + `GENERATE_DATE_ARRAY` and spec 11 adding the no-inline-`RANGE … INTERVAL` + guard, so the "safe cast lives only in the dialect notes" criterion is + *enforced*, not incidentally true). + - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers + the full rubric for every dialect" loop with the safe-cast assertion, + `expect(notes).toMatch(/\*\*Safe cast/)`, so every dialect must answer it. + Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces + all seven without a hand-maintained list. Do **not** add a false-exclusivity + assertion for `TRY_CAST` (it is shared by snowflake and tsql); requiring the + line per dialect is sufficient. +- Rebuild and re-link the dev binary so the playground picks up both surfaces: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation only) + +At least one SQLite-subset question stores trading volume as suffix-encoded text +(`"K"`/`"M"`, `"-"` for zero) and fails because the agent aggregates the raw +strings — runnable, plausible, wrong. The sqlite `CAST`-returns-0 behavior makes +the failure especially insidious: there is no error to alert the agent, and a +naive `IS NULL` coverage check would not catch it either, which is precisely why +the safe-cast idiom belongs in the dialect notes. The fix — parse messy encodings +before math, then verify coverage with a failure-detecting cast — is universal +data hygiene that helps any analyst on any warehouse, so it belongs in the +product's craft (skill) plus the per-dialect safe-cast syntax that makes the +verify step executable, not in a benchmark-specific prompt. Improving the +benchmark score is a side effect; the skill and the dialect notes contain no trace +of the benchmark. + +## Implementation notes + +Shipped on branch `write-feature-spec-wiki`, on top of specs 10 and 11 (both already +applied in the working tree). Built from the current state per the "do not assume an +order" guidance — there were **four** worked examples (specs 07 window-then-filter, +09 multi-hop fan-out, 10 panel spine, 11 cumulative running total), so this is the +**fifth**, and step 5 already named `series/calendar, rolling-window`. + +**Skill — `packages/cli/src/skills/analytics/SKILL.md`:** +- Added the three heuristics to the **"Schema discovery before writing SQL"** group, + after the two existing bullets: *Parse text-encoded numerics before doing math on + them* (detect by sampling distinct values, extending *Sample before you compose*, + never inferring from the column name), *Strip, scale, and cast in one early CTE* + (the *meaning-is-numeric* complement to *Cast to the real type before comparing*, + with the sentinel `0`-vs-`NULL` choice deferred to spec 10's *Default by + additivity* rule), and *Confirm the parse covered every value* (failure-detecting + cast from `sql_dialect_notes`). Each carries a one-line generic *why*; the existing + bullets and the additivity rule are referenced, not restated. +- Added **one** portable worked example (`metrics(label, value_text)` with `'1.2K'`, + `'3M'`, `'$1,200'`, `'-'`): wrong = `SUM(CAST(value_text AS REAL))`; right = an + early `parsed` CTE that strips with chained `REPLACE`, scales the K/M/B suffix with + a `CASE`, maps sentinels to `0`, casts to `DECIMAL(18,4)`, then `SUM`s. Standard + portable SQL only — no dialect functions, no inline safe cast. +- Step 5 dialect-notes provision list now names **safe-cast** alongside the others. + +**Dialect notes — `packages/cli/src/context/sql-analysis/dialects/*.md`:** added a +**Safe cast** line to all seven files (after the *Rolling window* line), each giving +that engine's failure-detecting numeric cast: postgres/mysql use a numeric pattern +guard before casting (no `TRY_CAST`; mysql's bare `CAST` returns `0` with a warning); +bigquery `SAFE_CAST`; snowflake `TRY_TO_NUMBER`/`TRY_TO_DECIMAL`/`TRY_CAST`; tsql +`TRY_CAST`/`TRY_CONVERT`; clickhouse `toFloat64OrNull`/`toDecimal64OrNull` (the +`...OrZero` variants return `0`); sqlite documents the `CAST`-returns-`0.0`/partial +gotcha and a `GLOB` pattern guard. ClickHouse function names were verified against +the official docs via context7 (the spec's loose `toDecimalOrNull` is not a real +name — the `toOrNull` family requires a bit width, hence `toDecimal64OrNull`). +No version-dated wording. + +**Tests:** `analytics-skill-content.test.ts` — added the three representative +phrases, bumped the `sql`-fence count 4 → 5 (and the test title), asserted the +example shape (`WITH parsed AS`, `REPLACE(`, `AS DECIMAL(`, `LIKE '%K' THEN 1000`), +and strengthened the dialect-clean banned list with `SAFE_CAST`, `TRY_CAST`, +`TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, and `GLOB` (mirroring spec 10's +`generate_series` / spec 11's inline-`RANGE … INTERVAL` guards). `dialect-notes.test.ts` +— added `expect(notes).toMatch(/\*\*Safe cast/)` to the per-dialect rubric loop, so +all seven (derived from `DIALECTS_WITH_NOTES`) must answer it; no false-exclusivity +assertion for the shared `TRY_CAST`. + +**Verification:** both affected test files pass (19 tests); broader `test/skills` + +`test/context/mcp` pass (65 tests); production type-check (`tsc -p tsconfig.json`) +is clean; `pnpm run build` copies both surfaces into `dist` (7 dialect files carry +*Safe cast*, the built `SKILL.md` carries the parse example) and `pnpm run link:dev` +relinks `ktx-dev`. One **pre-existing, unrelated** type error remains in the +test-only config (`test/mcp-server-factory.test.ts:152`, byte-identical to HEAD, +untouched here) — out of scope for this spec. diff --git a/spider2-specs/specs/14-output-completeness-final-check.md b/spider2-specs/specs/14-output-completeness-final-check.md new file mode 100644 index 00000000..c5b18e43 --- /dev/null +++ b/spider2-specs/specs/14-output-completeness-final-check.md @@ -0,0 +1,336 @@ +# Output completeness — answer every requested part, enforced by a final pre-emit check + +> Refined spec. Intake draft: `todo/14-output-completeness-final-check.md`. + +## Problem + +The single largest correctness failure mode for the analytics skill is +**incomplete output**: the query runs and the methodology is roughly right, but +the projection is missing columns the question asked for. The SQL is runnable and +the aggregate is correct — the answer is simply *short by columns*. Three +recurring shapes: + +1. **Multi-part questions answered partially.** A question that asks for several + things ("report the highest *and* the lowest month, each with its count and + average, *and* the difference") comes back with only the first clause — one + column where several were requested. +2. **Identity dropped.** Grouping by a human-readable name but not projecting the + entity's identifier (a product name without its product id, a customer name + without its customer id). +3. **Inputs to a derived value dropped.** Returning a ratio / percentage / + difference but not the underlying counts the question also asked for. + +Shapes 2 and 3 are **already covered** by shipped `` rules — spec 07's +*"Expose identity, not just the label"* and *"Keep the inputs to a derived +value"* — yet they are frequently **not applied**. So the gap is not missing +knowledge: these rules sit as passive heuristics in a list, and nothing makes the +agent reliably check them before finalizing. The fix is twofold: (a) add the +missing **multi-part-completeness** rule that generalizes shapes 1–3, and (b) +turn output-completeness into an **explicit final verification step** the agent +performs before emitting SQL, so the existing identity/inputs rules are actually +enforced rather than merely listed. + +The failure is **model-independent**: a markedly stronger model produced the same +incomplete-output mistakes on these questions, which means it is a +craft/enforcement gap, not a capability gap — exactly the kind of universal +analyst craft that belongs in the shipped skill. + +## Generic use case (independent of any benchmark) + +An analyst is asked: *"For each region, report the highest and the lowest monthly +order count, and the difference between them."* A complete answer has a column for +the region's id and name, the highest count, the lowest count, and the difference +— five columns. Returning just the region and a single number answers only part +of the request. This is a universal expectation on any database: answer **every** +part of a multi-part request, identify the entities, and show the inputs behind +any derived figure — and answer *exactly* that, without padding the result with +columns the question never asked for. + +## Model + +The change is **additive content in one Markdown file** +(`skills/analytics/SKILL.md`), governed by the same invariants spec 07 +established. They constrain the implementer; the exact prose is theirs. + +### Additive, inline, heuristic-with-a-why + +Consistent with specs 07 and 10: the change is additive content in +`skills/analytics/SKILL.md`, **inline** (no bundled `reference/` file — the +`setup-agents.ts` delivery ships only `SKILL.md` per target), dialect-agnostic, +and phrased as **heuristics with a one-line generic rationale**, not a wall of +MUSTs. The new rule extends the existing `` "Answer completeness / +interpretation" group; the shipped bullets in that group (including the *identity* +and *inputs* rules this spec builds on) are preserved unchanged. No new tool, +flag, or config. + +### The over-projection guard carries a *universal* why, not a grader reference + +The intake draft frames "don't pad the result with extra columns" as +*grader-gaming*. The skill forbids **any** reference to a grader, gold answer, or +benchmark (spec 07's hard invariant; the content test bans the words). So the +guard must ship with a **universal analytics rationale** instead: columns the +question did not ask for add noise, mislead the reader into thinking they matter, +and make the result harder to consume — match the request exactly, neither short +nor padded. This is the same reconciliation spec 07 applied to the draft's +"behavior only, no rationale" instruction: generic *why* is required; only +grader/gold/benchmark rationale is banned. + +### Completeness is a closed set — identity and inputs are *inside* it + +"Expose identity" and "keep the inputs" tell the agent to add columns; the +over-projection guard tells it not to. These only contradict if the target is +left fuzzy, so this spec pins it down. A **complete projection** is exactly: + +> {every requested metric/attribute} ∪ {the identifier of each grouped/named +> entity} ∪ {the inputs to each derived value}, at the grain the question +> specifies. + +Identity and inputs are **members of that set** — part of completeness, never +"padding." **Under-projection** is any member missing (the failure this spec +attacks); **over-projection** is any column *outside* the set (what the guard +forbids). The implementer must phrase the rule and guard against this single +definition so they read as one coherent notion, not two competing instructions. + +### Dialect-agnostic, additive-only, exclusions intact + +Every addition reads correctly on any dialect — no dialect-specific syntax in the +rule text or the worked example. The existing ``, ``, and the +other `` bullets and examples (specs 07/09/10/11/12) are preserved and +uncontradicted. Spec 07's exclusions still hold: no output-shape contract, no +`MAX(date)` anchoring of relative time, no grader-driven advice, no dialect +syntax. + +## Requirements + +### 1. Multi-part / multi-output completeness — a new umbrella rule + +Add a bullet to the `` "Answer completeness / interpretation" group: +when a question requests several outputs — a **list** ("A, B, and C"), **paired +extremes** ("the highest *and* the lowest"), or a **value plus its components** +("X, Y, and their ratio") — the final projection must contain a column for +**each** requested output. *Why:* answering only the first clause is the most +common way a runnable query is still wrong; the grain and methodology can be +perfect yet the answer is short by columns. + +This rule is the **umbrella** over the two shipped completeness rules: the +*inputs* rule (*"Keep the inputs to a derived value"*) is its "value + components" +instance, and the *identity* rule (*"Expose identity, not just the label"*) is its +"entity identity" instance. The new bullet should **name that relationship** +(so the three read as one notion) rather than restating either rule. + +Keep this distinct from the row-selection rules in the same group: *"Top / +highest / most / lowest"* and *"For each X / per X / by X"* govern **which rows** +appear; multi-part completeness governs **which columns** appear. They compose +(e.g. "highest and lowest per region" needs one row per region *and* a column per +clause). + +### 2. Final completeness check — the enforcement mechanism + +The rule content lives **once** in ``; the trigger is promoted to a +first-class line in `` step 6. + +- **Capstone bullet in ``** (closing the "Answer completeness / + interpretation" group): *before emitting the final SQL, re-read the question and + confirm the projection covers* — + 1. every named **metric / attribute** the question asks for (→ the multi-part + rule); + 2. the **identifier** of every grouped or named entity (→ the *identity* rule); + 3. every **input** to each derived value (→ the *inputs* rule); + 4. all at the **grain** the question specifies (→ the *for each X* / panel + rules). + + Each facet cross-references the rule it enforces, so the check is what makes + those passive rules active. Phrase it as a short, concrete "confirm the + projection covers…" checklist, not a wall of MUSTs. + +- **Over-projection guard** (attached to the check): do **not** add columns the + question did not ask for "to be safe" — extra columns add noise, mislead, and + make the result harder to consume; match the request exactly. Carries the + **universal** why from the Model, **never** a grader/gold/benchmark reference. + +- **`` step 6 line** (the explicit ritual): step 6 ("Validate and + explain") gains a mandatory line directing the agent to **always** run the final + completeness check before emitting — re-read the question and verify every + requested output, each entity's identity, each derived value's inputs, and the + grain are all projected — pointing into the `` capstone for the + detail. This **replaces the current conditional pointer's role** ("If a result + is unexpectedly empty or its grain looks wrong, work through the … rules"): the + empty/grain diagnostic stays available (it maps to the existing *"Diagnose empty + results"* and grain rules), but the completeness check fires **unconditionally**, + on every SQL-authoring turn, not only when a result looks off. The workflow line + names the ritual and the four facets; the rationale, guard, and example are + stated once in ``, not duplicated into the workflow. + +### 3. One worked example (dialect-agnostic) + +Add **exactly one** compact before/after example to the "Answer completeness / +interpretation" group, demonstrating multi-part completeness on a **synthetic** +schema (`regions`, `region_monthly`): + +- **WRONG:** answers only the first clause — `SELECT region_name, + MAX(monthly_orders) AS highest … GROUP BY region_name` — with no region id, no + lowest, no difference. +- **RIGHT:** one column per requested output plus the entity's identity, at the + region grain — `region_id, region_name`, the highest, the lowest, and the + difference, with `regions` joined to `region_monthly` and grouped by the region + id and name. + +Standard dialect-clean SQL only (no `QUALIFY`, no dialect functions; `MAX`/`MIN` +are portable aggregates). Keep it tight. It teaches multi-clause coverage + +identity + derived-value inputs in one capstone, and is **distinct** from the +spec-10 `regions` panel example: that one is about missing **rows** (LEFT-JOIN +spine + `COALESCE`); this one is about missing **columns**. This is the **sixth** +worked `sql` example in the skill (after specs 07/09/10/11/12). + +### 4. Coordination with specs 03 and 07/09/10/11/12 + +- **Spec 03** (multi-connection routing) owns `` step 0 and the + `connectionId` threading/scoping. Spec 14 touches `` only to add the + completeness-check line to **step 6** — it must not rewrite the routing or the + `` `connectionId` scoping. If both land, step 6 reads coherently: validate + + the completeness ritual. +- **Specs 07/09/10/11/12** own their own bullets and worked examples in + ``. Spec 14 is **additive** to the same "Answer completeness / + interpretation" group and adds one example; it must not remove or contradict + theirs. + +## Leak-safety (hard constraint) + +The example uses an **invented, generic schema** (`regions`, `region_monthly`) and +made-up columns — **no benchmark table names, SQL, or result values.** It teaches +the *pattern* (cover every requested output + identity + inputs, at grain, without +padding), which is universal and tied to no specific instance. The over-projection +guard's rationale is **universal** (noise/clarity/consumability), never +"grader-gaming" or any other scoring reference. No part of the addition mentions a +benchmark, gold answer, grader, or scoring comparator. + +## Acceptance criteria + +- `` "Answer completeness / interpretation" states the **multi-part / + multi-output completeness** rule (a column per requested output; list / paired + extremes / value-plus-components), named as the umbrella over the shipped + *identity* and *inputs* rules — inline, dialect-agnostic, with a generic *why*. +- `` states a concrete **final completeness check** (re-read the + question → confirm metrics + entity identity + derived-value inputs + grain are + projected), cross-referencing the existing identity/inputs/grain rules so they + are enforced, not merely listed. +- The check carries the **over-projection guard** with a **universal** rationale + (don't pad with unrequested columns — noise / misleading / harder to consume), + and the skill contains **zero** grader/gold/benchmark references anywhere. +- `` **step 6** carries a mandatory line that runs the completeness + check **unconditionally** before emitting and points into the `` + capstone; the rule content is **stated once** in `` (no duplicated + rationale/guard in the workflow). The empty/grain diagnostic remains available. +- Exactly **one** new worked `sql` example is present (synthetic + `regions`/`region_monthly`, wrong vs complete), in standard dialect-agnostic SQL; + the skill then carries **six** `sql` worked examples total. +- The existing interactive guidance (`` steps, ``, the other + `` bullets and the five prior examples) is intact and uncontradicted; + the additive-only and dialect-clean invariants from specs 07/10 still hold. +- None of spec 07's excluded items appear (output-shape contract, `MAX(date)` + anchoring of "recent"/"past N", grader-driven advice, dialect syntax). +- The skill stays scannable and comfortably under the 500-line budget; the + frontmatter still parses as `ktx-analytics`. +- The analytics-skill **content test is updated** to cover the new rule and check + (see Implementation orientation). + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the prose. + +- **Skill:** `packages/cli/src/skills/analytics/SKILL.md`. + - Add the multi-part-completeness bullet and the final-completeness-check + capstone (with the over-projection guard) to the `` "Answer + completeness / interpretation" group; add the single + `regions`/`region_monthly` worked example. + - In `` step 6, replace the current conditional answer-completeness + pointer with the mandatory completeness-check line (unconditional, names the + four facets, points into ``); keep the empty/grain diagnostic. + - Leave `` steps 0–5, ``, and the other `` + bullets/examples intact. Delivery is unchanged (single `SKILL.md` per target + via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change + required. +- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. + - Add representative phrases to the "represents every craft behavior" list for + the multi-part rule, the final completeness check, and the over-projection + guard. + - Bump the worked-example `sql`-fence count assertion **5 → 6** (and update the + test name/comment), and assert the new example's shape (e.g. `region_monthly`, + `MAX(`, `MIN(`, the difference expression, `region_id`). + - The existing dialect-clean, grader/benchmark-clean, and relative-time + (`MAX(...)` anchoring) guards must still pass — the new example's `MAX`/`MIN` + lines carry no "recent"/"past N" wording, so the phrase-level guard is + unaffected. The `SkillsRegistryService` frontmatter test must still pass. +- Rebuild and re-link the dev binary so the playground picks up the updated skill: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation only) + +On the latest SQLite-subset run, **incomplete output was the single largest +failure bucket (~13 of 51 voted failures)**: multi-part questions answered +partially, plus dropped identity / derived-value inputs — the latter two being +spec-07 rules that already exist but weren't applied. A probe with a much stronger +model reproduced the *same* incomplete-output failures, confirming this is a +craft-enforcement gap rather than a model-capability one. The fix — answer every +requested part, identify the entities, keep the inputs, and don't pad — is +universal analyst craft, so it belongs in the product skill (and transfers to real +users), enforced as a final pre-emit check rather than left as a passive hint. +Improving the benchmark score is a side effect; the skill contains no trace of the +benchmark. + +## Implementation notes + +Implemented as additive content in one Markdown file plus a test update. + +- **Skill — `packages/cli/src/skills/analytics/SKILL.md`** (`` "Answer + completeness / interpretation" group): + - Added the **"Answer every requested output"** umbrella bullet (list / paired + extremes / value-plus-components → a column per requested output, with a generic + *why*). It names *keep the inputs* and *expose identity* as its "value + + components" and "entity identity" instances, pins the closed-set definition of a + complete projection, and marks itself as governing *which columns* appear — + distinct from the *Top …* / *For each X* row-selection rules, with which it + composes. The two shipped instance rules are preserved verbatim. + - Added the **"Final completeness check"** capstone bullet: a four-facet + "before emitting, re-read the question and confirm the projection covers…" + checklist (metric/attribute → multi-part rule; identifier → *expose identity*; + inputs → *keep the inputs*; grain → *for each X* / *complete the panel*), run on + every query. It carries the **over-projection guard** with a universal rationale + (unrequested columns add noise, mislead, and are harder to consume — match the + request exactly), with **no** grader/gold/benchmark reference. + - Added one worked `sql` example (synthetic `regions` / `region_monthly`): WRONG + answers only the first clause (`SELECT region_name, MAX(monthly_orders) …`), + dropping the region id, the lowest, and the difference; RIGHT projects + `r.region_id, r.region_name`, `MAX` highest, `MIN` lowest, and the + `MAX − MIN` difference, joining `regions` to `region_monthly` and grouping by id + + name. This is the **sixth** `sql` example, dialect-clean (portable `MAX`/`MIN`). + - `` **step 6**: replaced the conditional answer-completeness pointer + with an unconditional *"Always run the final completeness check before emitting"* + line that names the four facets and points into the `` capstone; the + empty/grain diagnostic is retained for diagnosis. Steps 0–5, ``, and the + other `` bullets/examples are untouched. + - Delivery is unchanged: `readAnalyticsSkillContent` in + `packages/cli/src/setup-agents.ts` still ships the single `SKILL.md` per target + (confirmed, no change required). +- **Tests — `packages/cli/test/skills/analytics-skill-content.test.ts`:** added the + three representative phrases (`Answer every requested output`, `Final completeness + check`, `Don't over-project`); bumped the `sql`-fence count assertion 5 → 6 and + renamed that test; asserted the new example's shape (`region_monthly`, + `MAX(rm.monthly_orders)`, `MIN(rm.monthly_orders)`, the `MAX − MIN` difference, and + `r.region_id, r.region_name`). The dialect-clean, grader/benchmark-clean, + relative-time, and frontmatter guards still pass. +- **Verification:** `analytics-skill-content` 9/9 and `setup-agents` 46/46 pass; + production type-check (`tsconfig.json`, src) is clean; `pnpm run build` copied the + updated skill into `dist/skills/analytics/SKILL.md` (6 fences, all new content + present) and `pnpm -w run link:dev` re-linked `ktx-dev` so the playground picks it + up. The skill is 244 lines (< 500 budget) and the frontmatter still parses as + `ktx-analytics`. +- **Deviation (cosmetic):** the worked example uses alias `rm` and a difference + column named `order_count_range`; the intake draft sketched alias `m` and + `AS difference`. The spec leaves prose to the implementer, so the change is purely + naming. +- **Unrelated pre-existing issue:** `tsconfig.test.json` reports one type error in + `packages/cli/test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts`/`contextTools` + mismatch introduced by the earlier connection-scoped-wiki commit `2677b3ef`). It is + untouched by this work and out of scope here. diff --git a/spider2-specs/specs/15-mcp-server-structured-logging.md b/spider2-specs/specs/15-mcp-server-structured-logging.md new file mode 100644 index 00000000..5ad31d18 --- /dev/null +++ b/spider2-specs/specs/15-mcp-server-structured-logging.md @@ -0,0 +1,405 @@ +# Structured, leveled logging for the ktx MCP server + +> Refined spec. Intake draft: `todo/15-mcp-server-structured-logging.md`. +> +> **Scope: observability only.** This spec is about *seeing* what the MCP server +> does (which tool, what params, when, how long, outcome). *Preventing* a runaway +> query from blocking the server (off-event-loop / interruptible execution) is a +> separate concern — see "Non-goals". + +## Problem + +The ktx MCP server (`mcp-http-server.ts` + `mcp-stdio-server.ts`, both built +through `mcp-server-factory.ts` on raw `node:http` + the +`@modelcontextprotocol/sdk` transports) emits almost no operational logs. There +is no server-side record of **which MCP tool was called, with what parameters, +when, how long it took, or whether it succeeded** — nor of session open/close or +transport errors. When a tool call is slow, hangs, or a client connection drops +("Transport channel closed"), an operator has no trail to diagnose it and must +resort to process sampling / `lsof` / guesswork — and the offending input +(e.g. the exact SQL) is typically unrecoverable. + +The hook to fix this already exists but is half-built: `instrumentMcpServer` +(`context/mcp/context-tools.ts`) wraps every tool handler and already times it, +but it emits **only on completion** (a sampled `mcp_request_completed` telemetry +event) and **never writes a start line and never writes to the server log**. A +call that never returns therefore leaves no trace at all. + +## Generic use case (independent of any benchmark) + +Anyone running a long-lived ktx MCP server — a developer's local instance +(stdio, launched by Claude Desktop / Cursor), a foreground HTTP server, or a +shared/hosted HTTP daemon — needs observability into tool-call activity to: + +- diagnose slow or hung tool calls (which `sql_execution` ran, against which + connection, with what SQL, for how long); +- explain client-visible connection failures from the server side (session + lifecycle, transport-closed events); +- audit what agents asked the server to do; +- spot patterns (hot tools, slow connections, error rates). + +This is standard production-server hygiene; the server currently provides none. + +## Design decisions (resolved during refinement) + +These resolve ambiguities the intake draft left open. They constrain the +implementer; the exact code is theirs. + +### One `pino` logger, synchronous, written to **stderr** + +Use `pino` — the de-facto standard structured-JSON logger for Node servers — as +a single shared instance. Two corrections to the draft's sketch: + +- **stderr, not stdout.** The stdio transport reserves **stdout** for the + JSON-RPC protocol (`mcp-stdio-server.ts` deliberately no-ops `stdout.write`); + writing logs there would corrupt the protocol stream. The HTTP daemon already + redirects **both** child fds to `.ktx/logs/mcp.log` + (`managed-mcp-daemon.ts`: `stdio: ['ignore', log.fd, log.fd]`), so stderr lands + in the same log file (surfaced by `ktx mcp logs`). **stderr is therefore the + one universally-correct sink** for both transports. +- **Synchronous, no worker-thread transport.** `pino` writes through a + `DestinationStream` (`{ write(msg) }`) — the server's existing + `KtxCliIo.stderr` sink satisfies that interface directly. Configure pino with a + **synchronous** destination (`pino.destination({ sync: true })`, or the + pino-pretty stream below with `sync: true`). This is load-bearing: the + `tool.start` line **must** be flushed to the fd *before* the (possibly + blocking) handler runs, so a runaway synchronous `better-sqlite3` query that + pegs the event loop still leaves the start line on disk. A worker-thread + transport (`transport: { target: ... }`) buffers and can lose that exact line + on a hard crash — **do not use transport mode.** + +### Format is derived from `stderr.isTTY`, not a config flag + +One logger, two serializations chosen by the environment (the "behavior follows +from inputs" rule — not a user-visible knob): + +- **TTY** (`ktx mcp start --foreground` or `ktx mcp stdio` run in a terminal) → + **`pino-pretty` as a synchronous in-process stream** (`pretty({ sync: true, + destination: })`, colorized). A readable live dev view. +- **Not a TTY** (the detached daemon, whose stderr is the `.ktx/logs/mcp.log` + file fd) → **plain JSON line** via the synchronous pino destination. The log + *file* stays structured JSON so the incident workflow ("recover the hung query + with a one-line `grep` / `jq`") works — colorized ANSI in a file would defeat + it. + +`KtxCliIo.stderr` has no `isTTY` field (`cli-runtime.ts`), so detect the terminal +from the underlying stream (`process.stderr.isTTY`) at logger construction, while +still writing *through* the `io.stderr` sink so tests can capture emitted lines. + +### Single hook: extend `instrumentMcpServer`, do not fork a second wrapper + +Tool-call logging is added to the existing `instrumentMcpServer` +(`context-tools.ts`), which already wraps `registerTool` and measures duration. +It receives the **raw** tool input (it wraps the schema-parsing handler from +`registerParsedTool`), so the params it logs include `sql` for `sql_execution`. +The existing telemetry emission stays unchanged; logging is **additive** beside +it. Because both transports build their server through `mcp-server-factory.ts` → +`registerKtxContextTools`, this single change gives **both HTTP and stdio** +tool-call logging for free. + +### `sessionId` / `callId` provenance + +- **`sessionId`** comes from the SDK's per-call handler context + (`RequestHandlerExtra.sessionId`; confirmed present in `@modelcontextprotocol/sdk` + `1.29.0`). It is populated for the HTTP StreamableHTTP transport and absent for + stdio (single session) — log it when present, omit otherwise. Add + `sessionId?: string` to `KtxMcpToolHandlerContext` (`context/mcp/types.ts`). +- **`callId`** is generated per invocation with `randomUUID()` (already imported + in `context-tools.ts`). It correlates a `tool.start` with its `tool.end`. + +### No redaction in v1 (explicit) + +v1 ships **no log redaction**. Rationale recorded here so it is a deliberate +choice, not an oversight: these logs are **local** (stderr → `.ktx/logs/mcp.log`), +**never transmitted off-box**, and sit at the **same trust boundary** as the +`ktx.yaml` / environment that already hold the connection credentials. Concretely: + +- Request **headers are never logged** at all, so the bearer token + (`KTX_MCP_TOKEN`) simply isn't collected — this is "not logged," not "redacted." +- Errors are logged with their **full message and stack** via pino's standard + `err` serializer. +- SQL text and tool params are logged **verbatim** (they are not secrets). + +Credential redaction (e.g. a DB URL embedded in a driver error string) is an +explicit **v1 non-goal**; revisit only if these logs are ever shipped off-box. +This drops the draft's "light redaction" requirement and the +`collectTelemetryRedactionSecrets` / scrubber reuse it implied. + +## Requirements + +### 1. One shared pino logger + +- A single `pino` instance per server process, constructed once and threaded to + both the transport layer (for lifecycle events) and the tool layer (for + tool-call events). Level set from env (Requirement 7), default `info`. +- Synchronous destination bound to the server's stderr sink (see Design + decisions). Pretty (`pino-pretty`, sync stream) when `process.stderr.isTTY`, + otherwise plain JSON. Each line carries pino's standard `time` and `level`. +- No new dependency beyond `pino` and `pino-pretty`. No OpenTelemetry / metrics + stack, no async/worker transport, no in-app file rotation. + +### 2. Per-session / per-call context via child loggers + +Use pino child loggers so every line carries the relevant correlation fields: +a per-call child binds `{ tool, callId }` plus `sessionId` when present, so one +session's or one call's activity can be grepped from the log. + +### 3. Tool-call logging — START before execute, END after + +In `instrumentMcpServer`, for **every** MCP tool invocation: + +- **On entry, before invoking the handler**, write `tool.start` with + `{ tool, callId, sessionId?, params }` at **`info`**. `params` is the raw tool + input; for `sql_execution` this includes the full **SQL text** (the single most + useful field). The write is synchronous so the line exists even if the handler + never returns. +- **On normal completion**, write `tool.end` with + `{ tool, callId, sessionId?, durationMs, outcome: "ok", resultSize }` at + **`info`** — *unless* it is a slow call (Requirement 4). `resultSize` is a + tool-agnostic size measure (byte length of the serialized result text content). +- **On error**, write `tool.end` with + `{ tool, callId, sessionId?, durationMs, outcome: "error", err }` at **`error`**, + where `err` is the serialized error (message + stack) per Requirement 6. + +`tool.start` and `tool.end` share the **same correlation fields and the same +`info` level** (for the non-slow, non-error case) so that an **unmatched +`tool.start`** — a start with no `tool.end` for the same `callId` — is an +unambiguous "this call hung" signal. This is the property that makes a runaway +`sql_execution` identifiable from the log alone, with its exact SQL and +timestamp, no process sampling. + +> **Deliberate change from the intake draft.** The draft put `tool.start` / +> `tool.end` at `debug` (suppressed at the default `info`). That defeats the +> motivating incident: a hang is unpredictable, so debug would have to be enabled +> *before* it occurs, which never happens. v1 logs start/end at **`info`** — an +> always-on access log — so the offending query is recoverable at the default +> level. `debug` is reserved for heavier detail (Requirement 7). + +### 4. Slow-call warning + +When a call **completes** with `durationMs` greater than the configured slow +threshold (Requirement 7), emit its `tool.end` at **`warn`** (carrying the same +fields plus the duration) instead of `info`. This makes a completed-but-slow call +stand out and keeps it visible even when the level is raised to `warn`. + +### 5. Connection / session lifecycle and transport errors + +- **HTTP** (`mcp-http-server.ts`, in `newTransport`): log `session.open` from + `onsessioninitialized` and `session.close` from `onsessionclosed` / + `transport.onclose`, each with `sessionId`, at `info`. **Wire the currently + unused `transport.onerror`** to log `transport.error` (the SDK's + closed-channel / "Transport channel closed" events) at `error`, so a + client-visible connection failure has a server-side counterpart. +- **stdio** (`mcp-stdio-server.ts`): route the existing raw + `transport.onerror` stderr string (it currently writes a plain string) through + the logger as a `transport.error` line at `error`. A single `session.open` / + `session.close` pair for the one stdio connection MAY be logged at `info`. + +### 6. Structured error logging + +Errors are logged as structured objects via pino's standard `err` serializer +(`pino.stdSerializers.err` or equivalent), carrying error class, message, and +stack — never a bare interpolated string. The existing telemetry exception +reporting in `instrumentMcpServer` / `registerParsedTool` is unchanged. + +### 7. Configuration surface + +- **`KTX_MCP_LOG_LEVEL`** — pino level (`error` | `warn` | `info` | `debug` | + …), default **`info`**. MCP-scoped name because the MCP server is the only + emitter today; naming it global (`KTX_LOG_LEVEL`) would imply a logging system + that does not exist. +- **`KTX_MCP_SLOW_TOOL_MS`** — slow-call threshold in milliseconds (Requirement + 4), default **`10000`**. Justified as a real ops knob: "slow" differs sharply + between a local SQLite file and a remote warehouse. +- Level ladder that results from Requirements 3–5: + - `debug`: everything below **plus** heavier detail (e.g. result bodies, + progress notifications) — implementer's discretion on what extra to attach. + - `info` (default): `tool.start` / `tool.end`, session lifecycle, slow `warn`s, + errors. + - `warn`: slow-call `tool.end`s, `transport.error`, errored `tool.end`s — but + not routine tool traffic. + - `error`: errored `tool.end`s and `transport.error` only. + +## Acceptance criteria + +- At default level (`info`), invoking any MCP tool produces a `tool.start` + (`tool`, `callId`, `sessionId` when HTTP, `params`) and a matching `tool.end` + (`durationMs`, `outcome`, `resultSize`) line, as **JSON to stderr** when stderr + is not a TTY. +- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a + `tool.start` line carrying its **exact SQL and timestamp** and **no** matching + `tool.end` for that `callId` — so the offending query is recoverable from the + log alone, with no process sampling. +- A completed call slower than `KTX_MCP_SLOW_TOOL_MS` emits its `tool.end` at + `warn` with its `durationMs`. +- Session open/close and transport-closed (`transport.error`) events are logged + with the `sessionId` (HTTP); the stdio transport error path goes through the + logger, not a raw `stderr.write`. +- At level `warn`, routine `tool.start` / `tool.end` are suppressed but + slow-call warnings, transport errors, and errored calls are present. +- When stderr is a TTY (`ktx mcp start --foreground` / `ktx mcp stdio` in a + terminal), output is human-readable colorized `pino-pretty`; the daemon log + file (`.ktx/logs/mcp.log`) is plain JSON. Both paths are synchronous. +- The bearer token never appears in any log line (headers are not logged); SQL + and tool params do appear. +- No worker-thread / async log transport is introduced; no OpenTelemetry / + metrics stack; the only new dependencies are `pino` and `pino-pretty`. +- The existing `mcp_request_completed` telemetry and exception reporting still + work unchanged. + +## Non-goals + +- **Preventing / interrupting runaway queries** (off-event-loop execution, query + timeouts, worker-thread isolation). A single synchronous query that fans out + into a massive nested-loop join can peg the single-threaded server for hours + and break new connections — observability surfaces *which* query, but the fix + is execution-model work in a separate spec. (This logging is also the + prerequisite for a future watchdog that detects a `tool.start` with no + `tool.end` past a threshold and recycles the server.) +- **Log redaction** (see Design decisions) — explicit v1 non-goal. +- **Pretty output as a worker-thread transport** — the TTY path uses pino-pretty + as a synchronous in-process stream only. +- Metrics / tracing / OpenTelemetry exporters. +- Forwarding logs to the MCP *client* via the protocol logging capability + (`notifications/message`, `logging/setLevel`) — a possible later enhancement, + distinct from operational stderr logging. +- A global `KTX_LOG_LEVEL` spanning non-MCP commands — out of scope until other + surfaces emit structured logs. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the design. + +- **New module** — a small logger factory, e.g. + `packages/cli/src/context/mcp/logger.ts`: builds the shared pino instance from + the stderr sink + `KTX_MCP_LOG_LEVEL`, choosing the pino-pretty (sync) stream + when `process.stderr.isTTY` else `pino.destination({ sync: true })`, and + exposes a `slow-threshold` read from `KTX_MCP_SLOW_TOOL_MS`. +- **Tool-call logging** — `packages/cli/src/context/mcp/context-tools.ts`: + extend `instrumentMcpServer` (~line 585) to write `tool.start` before + `handler(...)` and `tool.end` after (ok / slow-`warn` / `error`); generate + `callId` via the already-imported `randomUUID`; read `sessionId` from the + handler `context`. Thread the logger via `RegisterKtxContextToolsDeps` + (~line 26) and `registerKtxContextTools` (~line 650). Leave `registerParsedTool` + and the existing telemetry emission intact. +- **Context type** — `packages/cli/src/context/mcp/types.ts`: add + `sessionId?: string` to `KtxMcpToolHandlerContext`; add the logger to + `KtxMcpServerDeps` / the register deps. +- **Server wiring** — `packages/cli/src/context/mcp/server.ts` + (`createDefaultKtxMcpServer` / `createKtxMcpServer`) and + `packages/cli/src/mcp-server-factory.ts` (`createKtxMcpServerFactory`): accept + and pass the logger down to `registerKtxContextTools`. +- **HTTP lifecycle** — `packages/cli/src/mcp-http-server.ts`: construct (or + receive) the logger; in `newTransport` (~line 186) log `session.open` / + `session.close` and add `transport.onerror` → `transport.error`. +- **stdio lifecycle** — `packages/cli/src/mcp-stdio-server.ts`: construct (or + receive) the logger; route the existing `transport.onerror` (~line 54) through + it. +- **Log destination is already captured** — `packages/cli/src/managed-mcp-daemon.ts` + redirects child stdout+stderr to `.ktx/logs/mcp.log`; `ktx mcp logs` + (`commands/mcp-commands.ts`) tails it. No change needed there. +- **Dependencies** — add `pino` and `pino-pretty` to + `packages/cli/package.json`. Verify Knip/Biome dead-code and bundle checks + still pass. +- **Tests** — extend `packages/cli/test/mcp-http-server.test.ts`, + `mcp-server-factory.test.ts`, `context/mcp/server.test.ts`, and + `commands/mcp-commands.test.ts`: assert (a) a `tool.start` JSON line is written + before a (mock) handler runs and carries `params`/`sql`; (b) a matching + `tool.end` with `durationMs`/`outcome`; (c) a hung-handler scenario yields a + `tool.start` with no `tool.end` for that `callId`; (d) a slow completion emits + `warn`; (e) session lifecycle + `transport.error` lines; (f) the bearer token + never appears. Inject a capturing `io.stderr` and parse the JSON lines. + *Note:* `mcp-server-factory.test.ts` carries a pre-existing + `KtxMcpContextPorts`/`contextTools` type error (from commit `2677b3ef`, + unrelated to this work) — do not let it mask new failures. +- After implementing, rebuild and re-link so the playground picks it up: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation, not a requirement) + +Running Spider 2.0-Lite against the MCP server at concurrency, an +adversarial-reviewer-generated query degenerated into a massive nested-loop join; +synchronous `better-sqlite3` executed it on the event loop, pegging a server at +~100% CPU for hours and breaking new MCP connections ("Transport channel +closed"). We could not determine *which* query, because the server logs nothing +about tool calls — diagnosis required `sample` / `lsof` on the live process and +the exact SQL was never recovered. Structured tool-call logging — especially +`tool.start` written synchronously *before* execution, at the default level — +would have turned this into a one-line `grep` of the server log. Improving the +benchmark is a side effect; the logging is generic production-server hygiene. + +## Implementation notes + +Implemented on branch `write-feature-spec-wiki`. All requirements and acceptance +criteria are satisfied. + +**What was built / where** + +- **New module `packages/cli/src/context/mcp/logger.ts`** — `createMcpLogger(io, + { isTTY? })` builds one synchronous `pino` (v10) instance written through the + `io.stderr` sink: plain JSON when stderr is not a TTY, a `pino-pretty` (v13) + synchronous in-process stream (`{ colorize: true, sync: true }`, wrapping the + sink in a `node:stream.Writable`) when it is. Also exports `mcpLogLevel` + (`KTX_MCP_LOG_LEVEL`, validated against pino levels, default `info`), + `mcpSlowToolMs` (`KTX_MCP_SLOW_TOOL_MS`, default `10000`), and + `serializeMcpError`. No worker/async transport; no global `KTX_LOG_LEVEL`. +- **Tool-call logging — `instrumentMcpServer` (`context/mcp/context-tools.ts`)** — + per invocation: `callId = randomUUID()`, a child logger bound to + `{ tool, callId, sessionId? }`, `tool.start { params }` written at `info` + **before** awaiting the handler (synchronous, so a runaway query still leaves it + on disk), and `tool.end` after: `info { durationMs, outcome:"ok", resultSize }`, + `warn` when `durationMs > KTX_MCP_SLOW_TOOL_MS`, or `error { outcome:"error", + err }`. `resultSize` is the UTF-8 byte length of the serialized text content. + The existing `mcp_request_completed` telemetry + `reportException` are unchanged + (`durationMs` is now computed once and shared); `registerParsedTool` is intact. +- **`sessionId` / logger plumbing** — `sessionId?: string` added to + `KtxMcpToolHandlerContext`; a single per-process logger threads from each + transport entrypoint through `createKtxMcpServerFactory` → + `createDefaultKtxMcpServer` → `createKtxMcpServer` → `registerKtxContextTools` + (`KtxMcpServerDeps.logger`, `RegisterKtxContextToolsDeps.logger`). +- **HTTP lifecycle (`mcp-http-server.ts`)** — `session.open` from + `onsessioninitialized`, `session.close` from `transport.onclose`, and the + previously-unused `transport.onerror` wired to `transport.error` at `error`. +- **stdio lifecycle (`mcp-stdio-server.ts`)** — the raw `transport.onerror` + string write is replaced by a `transport.error` log line; `session.open` / + `session.close` are logged for the single stdio session. +- **Deps** — `pino ^10.3.1`, `pino-pretty ^13.1.3` added to + `packages/cli/package.json`. +- **Tests** — `test/context/mcp/logger.test.ts` (factory, level/threshold env + parsing, error serializer, TTY vs JSON), a "MCP tool-call logging" block in + `test/context/mcp/server.test.ts` (start-before-handler, matching end with + `resultSize`, hung-handler leaves an unmatched start, slow→`warn`, `warn`-level + suppression with errored end still present, no-logger no-op), session lifecycle + + bearer-token-never-logged in `test/mcp-http-server.test.ts`, and + `test/mcp-stdio-server.test.ts` for `transport.error`. + +**Deviations / decisions** + +- **In-band errors carry no stack (inherent).** `registerParsedTool` converts a + thrown handler error into an `{ isError: true }` result (and reports the full + error via telemetry) before it reaches `instrumentMcpServer`, so the original + stack is already gone. `tool.end` for such a result logs `outcome:"error"` with + `err.message` only; a genuine throw that escapes gets the full pino `err` + serialization (type + message + stack). The field is always `err` for + consistency. This honours "leave `registerParsedTool` intact." +- **`session.close` is logged from `transport.onclose`** (the universal close + signal for both clean DELETE and dropped connections) rather than + `onsessionclosed`, to avoid duplicate lines; `onsessionclosed` keeps its + session-map cleanup role. +- **The logger is optional throughout.** Production always wires one per process; + when absent (programmatic/test callers that inject `createMcpServer`), tool-call + logging is simply off — which keeps existing tests unchanged. +- `createMcpLogger` accepts an optional `isTTY` purely as a test seam; production + derives format from `process.stderr.isTTY`. + +**Verification** + +`pnpm --filter @kaelio/ktx exec vitest run` for the four touched/added MCP test +files: 57 passed. Full default `pnpm run test`: 3018 passed, 1 skipped — the only +2 failures are in `test/skills/analytics-skill-content.test.ts`, pre-existing and +unrelated to this change (in-progress analytics-skill work on this branch). +`pnpm run dead-code` (Biome + Knip default + Knip production) clean. `pnpm run +build` and `pnpm run link:dev` succeed. `pnpm run type-check` reports only the +one pre-existing, test-only error in `test/mcp-server-factory.test.ts` from commit +`2677b3ef` (documented above); all source and the new tests type-check clean. diff --git a/spider2-specs/specs/16-bounded-query-execution-timeout.md b/spider2-specs/specs/16-bounded-query-execution-timeout.md new file mode 100644 index 00000000..597968ef --- /dev/null +++ b/spider2-specs/specs/16-bounded-query-execution-timeout.md @@ -0,0 +1,493 @@ +# Bounded query execution (deadline + non-blocking) for read SQL + +> Refined spec. Intake draft: `todo/16-bounded-query-execution-timeout.md`. +> +> **Scope: bound and cancel a read query that runs too long.** This is the +> execution-model companion to spec 15 (MCP structured logging). Spec 15 +> *surfaces* a runaway query in the log; it explicitly defers *preventing* one — +> "off-event-loop execution, query timeouts, worker-thread isolation … is +> execution-model work in a separate spec." This is that spec. + +## Problem + +Two compounding gaps on the read-query path (`executeReadOnly`), confirmed in the +current code: + +1. **No execution deadline, handled divergently per connector.** A single + expensive query runs unbounded, and whether it is bounded at all depends + entirely on which driver the caller hit: + - **BigQuery** is the only connector with a real statement timeout — it sets + `jobTimeoutMs` on the query job from a per-connection config field + `job_timeout_ms` (`connectors/bigquery/connector.ts`, `query(...)` ~491–512). + - **ClickHouse** sets a hardcoded 30s *HTTP* `request_timeout` at client + creation (`connectors/clickhouse/connector.ts:602`) — a client-side give-up, + not a server-side `max_execution_time`; the server keeps working. + - **Snowflake, Postgres, MySQL, SQL Server** bound only pool/connection + *acquisition* (Snowflake `acquireTimeoutMillis: 60_000`; Postgres + `connectionTimeoutMillis: 10_000`; SQL Server `idleTimeoutMillis: 30000`; + MySQL pool size only) — nothing bounds statement *execution*. + - **SQLite** has nothing. + +2. **In-process SQLite blocks the event loop and cannot be cancelled.** The + SQLite connector executes on the main thread via synchronous + `better-sqlite3 .prepare().all()` (`connectors/sqlite/connector.ts`, + `query(...)` 311–318, used by `executeReadOnly` 247–251). A slow query freezes + the whole MCP server — it cannot serve other requests, send progress, or write + `tool.end` — and there is no in-thread way to interrupt it: better-sqlite3 (v12) + exposes no interrupt/cancel API. Its documented mechanism for slow queries is a + **worker thread**, and the only way to stop a runaway synchronous query is to + **terminate the thread** executing it (context7 `/wiselibs/better-sqlite3`, + `docs/threads.md`). + +The observed failure (Spider2-lite sqlite run, 2026-06-18): a single +`sql_execution` MCP call — +`SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on `complex_oracle`, +where `profits` is a VIEW (`costs ⋈ sales`, 918,843 × 82,112 rows, joined on a +4-column key with no composite index) — degraded to an O(N×M) nested-loop scan, +pegged a worker at 100% CPU for 13+ minutes, never returned, produced a +`tool.start` with no matching `tool.end`, and stalled an eval shard until the +worker was killed by hand. A row cap (`maxRows`) does not help: it bounds returned +rows, not scan work, and the failing query returned a single aggregate row. + +## Generic use case (independent of any benchmark) + +Any data agent that lets an LLM author SQL will eventually issue an +accidentally-expensive query — an unindexed or cartesian join, an expensive VIEW, +a wide aggregate over a large fact table. A general-purpose context layer must +bound that and return a clean, fast "query exceeded Ns" error so the agent can +revise (add filters, query base tables, narrow the range) instead of hanging the +tool and the server. This matters for embedded/local warehouses (SQLite, and any +future DuckDB-style in-process driver) and remote ones alike, and is wholly +independent of any benchmark. + +## Design decisions (resolved during refinement) + +These resolve ambiguities the intake draft left open. They constrain the +implementer; the exact code is theirs. + +### One canonical deadline, applied uniformly at the contract + +The deadline is enforced for **every** `executeReadOnly` caller, not only the MCP +`sql_execution` path. `executeReadOnly` has 13 call sites beyond MCP (ingest query +executor, relationship profiling and composite-candidate probes, relationship +validation, historic-SQL probes, `ktx sql`); the contract is the single place to +bound all of them. A heavy ingest profiling probe over a giant unindexed join is +exactly as worth abandoning as an interactive one — those call sites are +best-effort and degrade gracefully, so a deadline `KtxQueryError` becomes "skip +this probe / mark unprofiled," not "fail the source." (Requirement 8 covers the +call sites that must treat the timeout as recoverable.) + +> Rejected alternative: a caller-resolved deadline (short on the interactive path, +> longer/none for ingest). That introduces a second value source and the open +> question "what is the ingest budget," for no real gain — the 30s default already +> clears any normal profiling probe, and a probe that exceeds it is one to drop. + +### Default 30s, configurable per-connection via one shared field + +- **Default `30_000` ms.** Fast enough that an LLM agent gets a clean + "exceeded 30s" and revises within the same turn; generous headroom over any + indexed aggregate or normal profiling probe; a genuine pathological nested-loop + scan blows past it immediately. +- **One shared per-connection override**, honored by every connector: + `query_timeout_ms` in `ktx.yaml` (`queryTimeoutMs` in TS), a positive integer + in **milliseconds**. Milliseconds matches the BigQuery SDK and the field it + replaces; the user-facing error still reads in seconds. +- **BigQuery's `job_timeout_ms` config key is removed**, not kept alongside the + new field. BigQuery reads the shared `query_timeout_ms` and maps the resolved + value onto its SDK's `jobTimeoutMs`. ktx keeps no backward compatibility, so + there is exactly one way to set a query timeout — no parallel knob (intake + requirement 1). +- **Granularity is per-connection only.** No global all-connections override — + different warehouses have different performance envelopes, and a second + (global) knob would double the configuration surface for no stated need. + +### The shared contract is a value + an error, not a base class + +There is **no shared connector base class or factory** — each connector is +constructed independently; the only shared registry is the *dialect* factory +(`context/connections/dialects.ts:47–55`). So "defined once" (intake requirement +3) means a single shared module that owns: + +- `DEFAULT_QUERY_TIMEOUT_MS = 30_000`; +- `resolveQueryDeadlineMs(connectionConfig)` → the validated `query_timeout_ms` + override, else the default — so the default and the override precedence live in + exactly one place; +- `queryDeadlineExceededError(deadlineMs)` → a `KtxQueryError` with the canonical + message `query exceeded ${Math.round(deadlineMs / 1000)}s`. + +Each connector calls the resolver once (at construction; connectors already +receive their connection config) and stores `this.deadlineMs`. **Enforcement is +necessarily per-connector** — different engines cancel differently — but the +*value* and the *error message* are shared, so the agent sees one consistent, +actionable error regardless of driver. + +### Real cancellation, not client-side give-up + +Per intake requirement 5, the deadline must *stop the work*, not merely abandon +the promise while the query keeps running (which on a pooled driver also risks +returning a still-busy connection to the pool). So: + +- **In-process (SQLite, and any future embedded driver):** run the query off the + main thread and enforce the deadline by **terminating the worker thread**. There + is no generic `Promise.race` outer wrapper — a `Promise.race` against a + synchronous in-thread `.all()` can never fire (the loop is blocked), and against + a pooled remote query it would poison the pool. Thread termination *is* the + cancellation. +- **Remote engines:** set the engine's **server-side statement timeout** so the + server itself aborts the query and frees the connection cleanly. + +### Logging routes through spec 15's pino path — no second logger + +The deadline cases are logged through the **existing** MCP tool-call logger +(spec 15's `instrumentMcpServer`, `context/mcp/context-tools.ts:644–730`), not a +new logging path threaded into the connector. Verified flow for a timeout: +`executeReadOnly` throws `queryDeadlineExceededError` (a `KtxQueryError`) → +`local-project-ports.ts` preserves it → `registerParsedTool` (:552) reports it +(`reportException` skips `$exception` for `KtxExpectedError`) and returns an +in-band `isError` result → `instrumentMcpServer` writes `tool.end` at **`error`** +with `outcome:"error"`, `err.message = "query exceeded {N}s"`, and the **same +`callId`** as the `tool.start`. + +This is the central observability win and it requires **no new MCP logging code**: +spec 15 made a hang show up as a `tool.start` with *no* matching `tool.end`; this +spec turns it into a **matched `tool.start` → `tool.end(error)` pair** whose +`tool.end` names the deadline. The worker-termination (SQLite) and server-side +abort (remote) are internal enforcement mechanisms; their single observable signal +is that `tool.end`, so the connector does **not** get its own logger threaded +through `KtxScanContext` — that would fork a second path for one capability. The +"worker was actually reaped, not left spinning" guarantee is asserted by the +worker's `exit` event in tests (Requirement 3), not by a log line. + +## Requirements + +### 1. Shared deadline contract, defined once + +A single new module (e.g. `packages/cli/src/context/connections/query-deadline.ts`) +exports `DEFAULT_QUERY_TIMEOUT_MS` (30_000), `resolveQueryDeadlineMs(connectionConfig)`, +and `queryDeadlineExceededError(deadlineMs)`. Every connector resolves its +deadline through this resolver; no connector hardcodes its own default or +duplicates the override-precedence logic. + +### 2. Shared per-connection config field; BigQuery's removed + +`query_timeout_ms` is added to the **shared** connection config schema (validated +as an optional positive integer, milliseconds) so every driver accepts it. The +BigQuery-specific `job_timeout_ms` config field and its dedicated reader +(`bigQueryJobTimeoutMsFromConnection`) are removed; BigQuery sources its timeout +from the shared field and applies it as `jobTimeoutMs`. A bad `query_timeout_ms` +(zero, negative, non-integer) is a clear config validation error, consistent with +how ktx validates `ktx.yaml`. + +### 3. SQLite executes off the main thread, terminated on deadline + +`executeReadOnly` on the SQLite connector MUST NOT block the MCP server event +loop: + +- Read-only validation and the row-limit wrapper (`assertReadOnlySql` + + `limitSqlForExecution`) run **on the main thread** before dispatch — invalid SQL + fails instantly without spawning a worker, and read-only enforcement stays at + the boundary (Requirement 7). +- The validated, row-limited SQL (and any params) is dispatched to a **worker + thread** that opens the database `{ readonly: true, fileMustExist: true }`, runs + the query, and posts back `{ headers, rows, totalRows }` (all values are + structured-cloneable — primitives, `Buffer`, `BigInt`). +- The main thread arms a timer for `this.deadlineMs`; on expiry it calls + `worker.terminate()` and rejects with `queryDeadlineExceededError`. On a normal + message it clears the timer and resolves. On a worker error (SQLite rejected the + SQL) it rejects with that error, message preserved. A provided + `ctx.signal` (`KtxScanContext.signal`, already on the contract) also terminates + the worker, for external cancellation. +- **One short-lived worker per call**, terminated on completion or deadline — not + a persistent worker or pool. Terminate-on-deadline destroys the worker, so a + pool would need respawn/job-tracking for no benefit: `executeReadOnly` is + low-frequency (LLM-issued, serial per agent turn) and worker spawn cost is + negligible against query latency. The other SQLite paths (introspect, sample, + stats, distinct-values, row-count) stay on the main thread — they are + ktx-authored, bounded, and not on the `executeReadOnly` contract. +- The event loop stays responsive throughout, so `tool.end` is always written and + concurrent requests on the same port are served. + +### 4. Remote engines set a real server-side statement timeout + +Each remote connector applies `this.deadlineMs` as its engine's server-side +statement timeout, so the deadline stops server work rather than abandoning the +promise: + +| Connector | Mechanism | Unit | +|------------|--------------------------------------------------------|---------------| +| BigQuery | `jobTimeoutMs` on the query job (replaces `job_timeout_ms`) | ms | +| Postgres | `statement_timeout` | ms | +| MySQL | session `max_execution_time` (applies to read-only SELECT — the only kind on this path) | ms | +| Snowflake | `STATEMENT_TIMEOUT_IN_SECONDS` (ALTER SESSION) | s (ceil) | +| ClickHouse | `max_execution_time` setting, with `request_timeout` aligned to the deadline so the HTTP client does not give up before the server aborts | s (ceil) | +| SQL Server | `mssql` `requestTimeout` (TDS attention cancels server-side) | ms | + +ClickHouse's existing hardcoded 30s `request_timeout` is brought under this +contract (derived from the resolved deadline), not left as a parallel mechanism. + +### 5. Timeout resolves as a `KtxQueryError` with the canonical message + +On exceeding the deadline, the path resolves with a `KtxQueryError` +(`query exceeded {N}s`) — a finite, decision-reaching outcome, never an unbounded +hang. For SQLite the worker-termination path throws `queryDeadlineExceededError` +directly. For remote engines, each connector recognizes **its own** engine's +timeout signal (Postgres `57014`; MySQL errno `3024`; ClickHouse code `159`; +SQL Server `ETIMEOUT`; Snowflake and BigQuery timeout errors) and re-wraps it as +`queryDeadlineExceededError`, keeping the driver error as `cause`. Each connector +owns its driver's signal — there is no central denylist of error codes to +maintain. + +### 6. MCP surfacing and logging via the existing pino path + +The MCP `sql_execution` path already (a) maps any non-native driver error to +`KtxQueryError` (`context/mcp/local-project-ports.ts:78–88`, guarded by +`isNativeProgrammingFault`), (b) reports it through `reportException`, which skips +`$exception` Error Tracking for `KtxExpectedError`, and (c) writes `tool.start` +synchronously before the handler and `tool.end` in `instrumentMcpServer` +(`context/mcp/context-tools.ts:644–730`). The deadline cases MUST surface through +this path — the implementer verifies and tests them, but adds **no parallel +classification or logging path**: + +- **Query exceeds the deadline (any driver):** a `tool.end` at **`error`** with + `outcome:"error"` and `err.message = "query exceeded {N}s"`, carrying the same + `callId` as the `tool.start`. Classified as an expected error, so it is absent + from `$exception` Error Tracking. The reason `tool.end` was previously missing + is solely the blocked event loop (Requirement 3); once the loop stays free and + the deadline throws, the existing instrumentation logs the matched pair — closing + spec 15's "`tool.start` with no `tool.end` = hang" gap for this case. +- **Completed-but-slow query (under the deadline, over `KTX_MCP_SLOW_TOOL_MS`):** + unchanged from spec 15 — its `tool.end` is emitted at **`warn`**. The deadline + (default 30s) and the slow threshold (default 10s) are independent knobs; a query + between 10s and 30s completes with a slow `warn`, one past 30s is killed with the + `error` above. + +### 7. Read-only enforcement and `maxRows` unchanged + +`assertReadOnlySql` and the `maxRows` row cap (`limitSqlForExecution`) behave +exactly as today. The deadline is additive. `maxRows` is not a substitute for it +(it bounds returned rows, not scan work). + +### 8. Best-effort callers treat a deadline timeout as recoverable + +The non-interactive `executeReadOnly` call sites that are best-effort — +relationship profiling, composite-candidate probes, relationship validation, +historic-SQL probes — MUST treat a deadline `KtxQueryError` as "skip this +probe / mark unprofiled" and continue, never as a source-fatal error. The +implementer confirms each such site already swallows query errors into a +graceful-skip and adds that handling where it does not, so the uniform deadline +(Requirement 1, applied to all callers) cannot abort an ingest run. A skipped +probe is logged at the skip site through that path's existing scan/ingest logger +(`KtxScanContext.logger`, `warn`/`debug`), never silently dropped — these callers +are off the MCP tool-call path, so their visibility comes from the logger they +already use. + +## Acceptance criteria + +- A read query that exceeds the deadline returns a `KtxQueryError` + (`query exceeded {N}s`) within roughly the deadline; the MCP worker stays + responsive (a concurrent tool call on the same server completes while the slow + query is still pending) and writes a matching `tool.end` with a non-ok outcome. +- **Logging:** a timed-out `sql_execution` produces a `tool.start` and a matching + `tool.end` (same `callId`) at `error` with `outcome:"error"` and + `err.message = "query exceeded {N}s"` — no unmatched `tool.start` remains. The + timeout does not raise a `$exception` Error Tracking event (it is a + `KtxExpectedError`). A completed query slower than `KTX_MCP_SLOW_TOOL_MS` but + under the deadline still emits its `tool.end` at `warn`. No new logger is + introduced — the lines come from the existing `instrumentMcpServer`. +- **SQLite specifically:** executing a deliberately pathological query (an + expensive VIEW or an unindexed cross join) on a fixture does not block the event + loop, is terminated at the deadline, and the worker exits (the off-main-thread + executor is killed, not left spinning) so CPU returns to idle. +- **One server-side-timeout driver (Postgres):** the connector applies + `statement_timeout` equal to the resolved deadline, and a `57014` cancellation + is mapped to the canonical `KtxQueryError`. +- `resolveQueryDeadlineMs` returns 30_000 by default, honors a `query_timeout_ms` + override, and rejects an invalid value (zero / negative / non-integer). +- **No regression:** normal fast queries return identical results; read-only + rejection still works; `maxRows` still bounds returned rows. +- The shared `query_timeout_ms` field is accepted by every connector; BigQuery's + former `job_timeout_ms` key is gone and BigQuery's timeout is driven by the + shared field. + +## Non-goals + +- **A row/byte/cost budget on returned data.** This spec bounds *time*, not result + size — `maxRows` already bounds rows, and BigQuery's `maximumBytesBilled` is a + separate, retained concern. +- **A global `KTX_QUERY_TIMEOUT_MS` or per-call user flag.** One opinionated + default plus a per-connection override; no per-call knob, no global knob. +- **A server watchdog that recycles the process on an unmatched `tool.start`.** + Spec 15 names this as a possible future mitigation; this spec prevents the hang + at the source, so the watchdog is out of scope here. +- **Moving SQLite introspection / sampling / stats off the main thread.** Only the + `executeReadOnly` (LLM-SQL) path needs worker isolation; the rest are bounded + ktx-authored queries. +- **Per-connection retry / backoff on timeout.** A timeout returns a clean error + for the agent to revise; ktx does not auto-retry. +- **A second logger threaded into the connector.** The deadline cases are logged + through spec 15's existing MCP tool-call logger; the connector gets no separate + pino instance and `KtxScanContext` gets no MCP-logger thread (see "Logging routes + through spec 15's pino path"). + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns +the design. + +- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`: + `DEFAULT_QUERY_TIMEOUT_MS`, `resolveQueryDeadlineMs`, `queryDeadlineExceededError`. + Error class is `KtxQueryError` (`packages/cli/src/errors.ts:25`). +- **Contract anchor** — `KtxScanConnector.executeReadOnly` + (`context/scan/types.ts:343`), `KtxReadOnlyQueryInput` (`types.ts:285`), + `KtxScanContext.signal` (`types.ts:176`, already present, currently unused on the + MCP path). +- **Config schema** — add `query_timeout_ms` to the shared connection config + (`context/project/config.ts`, `KtxProjectConnectionConfig` and its zod schema); + remove BigQuery's `job_timeout_ms` reader. +- **SQLite worker** — new `packages/cli/src/connectors/sqlite/read-query-worker.ts` + (constructed by path via `new URL('./read-query-worker.js', import.meta.url)`); + rework `connectors/sqlite/connector.ts` `executeReadOnly` (247–251) to validate + on the main thread then dispatch to the worker with a terminate-on-deadline + timer. Reuse `normalizeQueryRows` (`context/connections/query-executor.ts`) in + the worker. Register the worker as a dynamic entry in `knip.json` (it is + referenced by path, not import) and confirm the build copies it into `dist`. +- **Remote connectors** — apply the resolved deadline and recognize the engine's + timeout signal in each `executeReadOnly` / `query(...)`: + `connectors/bigquery/connector.ts` (~491–512, `jobTimeoutMs`), + `connectors/clickhouse/connector.ts` (~602/629–644, `max_execution_time` + + `request_timeout`), `connectors/snowflake/connector.ts` (~354–371/510–534, + `STATEMENT_TIMEOUT_IN_SECONDS`), `connectors/postgres/connector.ts` (~822–838, + `statement_timeout`), `connectors/mysql/connector.ts` (~774–793, + `max_execution_time`), `connectors/sqlserver/connector.ts` (~812–832, + `requestTimeout`). +- **MCP path + logging (verify only)** — `context/mcp/local-project-ports.ts:69–88` + (error mapping), the `sql_execution` registration (~915–943), and the logging in + `instrumentMcpServer` (`context/mcp/context-tools.ts:644–730`, which writes + `tool.start`/`tool.end` via the spec-15 pino logger `context/mcp/logger.ts`). No + new classification or logging code; confirm the timeout flows through as an + expected error producing a matching `tool.end(error)` with the canonical message. +- **Best-effort callers** — `context/scan/relationship-profiling.ts` (~227, 275), + `context/scan/relationship-composite-candidates.ts` (~365, 440), + `context/scan/relationship-validation.ts` (~259), + `context/ingest/historic-sql-probes/bigquery-runner.ts` (~97), and the + historic-sql clients: confirm a deadline `KtxQueryError` is swallowed into a + graceful skip. +- **Tests** — a SQLite fixture with a pathological query (tiny `query_timeout_ms` + as the test seam) asserting terminate-on-deadline, event-loop responsiveness + (a concurrent promise resolves while the query is pending), and worker exit; a + Postgres test asserting `statement_timeout` is set to the resolved deadline and + a `57014` error maps to `KtxQueryError`; resolver unit tests (default / + override / invalid); regression tests for normal results, read-only rejection, + and `maxRows`. Extend the MCP logging tests (alongside spec 15's, e.g. + `test/context/mcp/server.test.ts`) to assert a timed-out `sql_execution` yields a + matched `tool.start`/`tool.end(error)` pair carrying `query exceeded {N}s`. +- After implementing, rebuild and re-link so the playground picks it up: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation, not a requirement) + +The Spider2-lite local set loads several warehouses into SQLite, some with +expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` = +`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112 +rows, no composite index, with `promo_id` (the index the optimizer picks) being +95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a +view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval +shard for 10+ minutes; with one, the agent gets a fast error and can scope the +query instead. Improving the benchmark is a side effect; the deadline is generic +production hygiene for any agent that lets an LLM author SQL. + +## Implementation notes + +Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All +acceptance criteria are met; tests, type-check, dead-code, and build are green +for the changed surface. + +### What was built, and where + +- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`: + `DEFAULT_QUERY_TIMEOUT_MS = 30_000`, `resolveQueryDeadlineMs(connection)` (returns + the validated `query_timeout_ms` override else the default; throws on + zero/negative/non-integer), and `queryDeadlineExceededError(deadlineMs, options?)` + (a `KtxQueryError` reading `query exceeded ${round(ms/1000)}s`, carrying the + driver error as `cause`). Unit-tested in `test/context/connections/query-deadline.test.ts`. +- **Config field** — `query_timeout_ms` (optional positive integer, ms) added to + the **shared warehouse** schema. NOTE (spec drift): that schema lives in + `context/project/driver-schemas.ts` (`warehouseConnectionSchema`), not + `config.ts`. The warehouse schemas use `z.looseObject`, so the field had to be + declared explicitly to be *validated* (otherwise it would pass through + unvalidated). BigQuery's `job_timeout_ms` field and `bigQueryJobTimeoutMsFromConnection` + reader were removed; BigQuery now resolves the shared field. Every connector + resolves its deadline once at construction via `resolveQueryDeadlineMs`. + +### Deviation from the spec's SQLite mechanism (worker thread → child process) + +The spec mandated running SQLite read queries on a **worker thread** and enforcing +the deadline by `worker.terminate()`. This was **empirically disproven**: +`Worker.terminate()` cannot interrupt a CPU-bound synchronous `better-sqlite3` +scan — the native `sqlite3_step` loop never yields to V8, so terminate's promise +never even resolves (an 8s probe of the exact failing query shape confirmed the +thread keeps spinning). better-sqlite3 v12 exposes no `interrupt`/progress-handler +API, and `.iterate()` does not help because the failing query is a single +aggregate row produced only *after* the full scan. + +The implemented mechanism is therefore **`child_process.fork` + `SIGKILL`** +(`packages/cli/src/connectors/sqlite/read-query-child.ts`, spawned from +`connector.ts`). SIGKILL lets the OS reclaim the whole process — a probe confirmed +the scan is interrupted in ~2 ms and CPU returns to idle. This satisfies *both* +SQLite requirements better than a thread (event loop stays free **and** the query +is genuinely cancellable). The child is self-contained (imports only +`better-sqlite3` + node builtins); validation/row-limiting (`limitSqlForExecution`) +and `normalizeQueryRows` stay on the main thread. One short-lived child per call, +killed on completion, deadline, or `ctx.signal` abort. Node v24's native +TS type-stripping lets the `.ts` child load under vitest; a `.js`-if-exists-else-`.ts` +URL resolver picks the compiled child in `dist`. Registered as a dynamic entry in +`knip.json`; `tsc` emits it to `dist` (verified, plus a dist-level end-to-end smoke). + +### Remote connectors (server-side timeouts + own-signal mapping) + +Each applies the resolved deadline server-side and re-wraps its own timeout signal +as `queryDeadlineExceededError(deadlineMs, { cause })`: + +- **BigQuery** — `jobTimeoutMs` on the query job; maps a "Job timed out" / timeout-reason error. +- **Postgres** — `statement_timeout` via pool `options` (`-c statement_timeout=`); maps `57014`. +- **MySQL** — `SET SESSION max_execution_time = ` before the read; maps errno `3024`. +- **Snowflake** — `ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = ` in the pooled connection; maps code `604` / "reached its … timeout". +- **ClickHouse** — `max_execution_time` (ceil seconds) setting, with `request_timeout` set to `deadline + 5s` so the HTTP client outlasts the server abort (replaces the old hardcoded 30s); maps code `159`. +- **SQL Server** — `requestTimeout` on the `mssql` pool config (TDS attention cancels server-side); maps `ETIMEOUT`. + +Each connector has a focused test asserting the timeout is applied and its signal +maps to `KtxQueryError` (Postgres is the spec's required acceptance test). + +### Best-effort callers (Requirement 8) + +Confirmed already graceful: relationship **profiling** (outer try/catch → +`profile_failed` warning) and **composite-candidate** detection +(`detectCompositeRelationships` → recoverable warning, returns `[]`). Historic-SQL +**probes** flow through `runHistoricSqlReadinessProbe`, which catches *any* error +into `{ ok: false }`. **Added** handling to relationship **validation**: a +`KtxQueryError` on the per-candidate coverage probe now sends that one candidate to +`review` (`validation_query_failed`, logged via `ctx.logger.warn`) instead of +aborting the whole validation pass. `ingest-query-executor.ts` is a generic +executor port whose callers own recoverability — left unchanged. + +### MCP surfacing/logging + +No new MCP classification or logging code. The deadline `KtxQueryError` flows +through the existing `local-project-ports` mapping → `reportException` (skips +`$exception` for `KtxExpectedError`; existing test `telemetry/exception.test.ts` +covers the skip for `KtxQueryError`) → `instrumentMcpServer`, which logs a matched +`tool.start` → `tool.end(error, level 50)` pair carrying `err.message = "query +exceeded {N}s"`. A test in `test/context/mcp/server.test.ts` asserts the matched +pair, closing spec 15's "`tool.start` with no `tool.end` = hang" gap for this case. + +### Pre-existing branch issues encountered (not part of this feature) + +- `test/mcp-server-factory.test.ts` had a type error (an `as` cast to a shape with + a fake `context_tool` key, introduced by branch commit `2677b3ef`) that broke + `tsc -p tsconfig.test.json`. Fixed with a clean single cast to keep the + type-check gate green; behavior unchanged. +- `test/skills/analytics-skill-content.test.ts` fails (2 cases: missing + `**Window functions**` heading and `Expose identity, not just the label` prose + in `src/skills/analytics/SKILL.md`). This is unrelated analytics-skill (spec + 13/14) content drift committed earlier on the branch; **left untouched** — no + skill files were modified by this feature. diff --git a/spider2-specs/specs/18-bigquery-cross-project-datasets.md b/spider2-specs/specs/18-bigquery-cross-project-datasets.md new file mode 100644 index 00000000..4dd65e2d --- /dev/null +++ b/spider2-specs/specs/18-bigquery-cross-project-datasets.md @@ -0,0 +1,418 @@ +# BigQuery cross-project dataset introspection (foreign-hosted datasets, billed in own project) + +> Refined spec. Intake draft: `todo/18-bigquery-cross-project-datasets.md`. +> +> **Scope: let the BigQuery connector introspect a dataset hosted in a *different* +> project than the one it bills jobs to.** A `dataset_ids` entry may be written +> fully-qualified as `project.dataset`; the connector introspects each entry in +> *its own* project while every job still runs in `credentials.project_id`. A +> bare `dataset` keeps today's single-project behavior unchanged. +> +> Out of scope (confirmed during refinement): the interactive `ktx setup` wizard +> is **not** expected to *discover* foreign datasets — you cannot enumerate +> datasets in a project you don't own, and the wizard doesn't know which foreign +> projects to probe. Users hand-write `project.dataset` entries (in `ktx.yaml` or +> at the dataset prompt); the connector must accept and introspect them. See +> *Non-goals*. + +## Problem + +**ktx**'s BigQuery connector derives a single `projectId` from +`credentials.project_id` and uses it for **both** job billing **and** schema +introspection. There is no way to introspect a dataset that lives in another +project, even though *querying* such a dataset already works (a cross-project +read in a `FROM` clause bills to the caller's project — that path is proven). + +Confirmed in the current connector (`packages/cli/src/connectors/bigquery/connector.ts`): + +- **`:294`** — `projectId` is read only from `credentials.project_id`. There is + no separate billing-vs-dataset project. `bigQueryConnectionConfigFromConfig` + (`:278`–`:301`) returns `datasetIds: string[]` — raw, unparsed. +- **`datasetIds()` (`:163`)** — returns `dataset_ids` / `dataset_id` verbatim; + it never parses a `project.` prefix. +- **`introspectDataset` (`:544`)** — calls `this.getClient().dataset(datasetId)`, + which resolves the dataset in the **client's (billing) project**, and labels + every table `catalog: this.resolved.projectId` (`:566`, `:574`) — including the + introspection-failure warning metadata (`:566`). +- **`primaryKeys` (`:591`)** — builds `INFORMATION_SCHEMA` SQL as + `` `..INFORMATION_SCHEMA.TABLE_CONSTRAINTS` `` using the + **billing** project. +- **`listTables` (`:453`)** — queries + `` ``.`region-`.INFORMATION_SCHEMA.TABLES `` against the + **billing** project and labels each row `catalog: this.resolved.projectId`. +- **`testConnection` (`:344`)** — calls `client.dataset(datasetId).get()` in the + billing project. + +### Empirical confirmation (from the intake draft) + +With a service account in project `ktx-spider2-lite`: + +- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (it looks + in `projects/ktx-spider2-lite/datasets/austin_311`). +- The cross-project form `dataset("austin_311", { projectId: "bigquery-public-data" })` + → **succeeds** (public metadata is readable by any authenticated principal). +- There is **no config knob** to separate the introspection project from billing. + +### Why the table `catalog` label is load-bearing, not cosmetic + +The BigQuery dialect generates **three-part `catalog.db.name`** SQL +(`connectors/bigquery/dialect.ts:38` → `formatDialectTableName(..., 'three-part')`; +`context/connections/dialect-helpers.ts:27`–`32` emits `catalog.db.name`). The +`catalog` stored on each scanned table is therefore the project that *every* +later query targets — `sampleTable`, `sampleColumn`, `getColumnDistinctValues`, +and ref-based `executeReadOnly` all format the ref through the dialect. If a +foreign dataset's tables are labeled with the billing project, every one of those +queries becomes `` `billing-project`.`austin_311`.`table` `` → `404`. So labeling +the table `catalog` with the dataset's own project is a **correctness +requirement**, and it is the single lever that makes sampling, dictionary value +extraction, and `discover_data` all resolve once the snapshot is right. + +### One introspection path, no divergence + +`connectors/bigquery/live-database-introspection.ts` wraps +`KtxBigQueryScanConnector.introspect` directly, so the ingest and live-database +paths share **one** introspection implementation. The SDK already supports the +fix: `client.dataset(id, { projectId })` — `@google-cloud/bigquery@8.3.1`'s +`DatasetOptions` exposes `projectId?: string`. + +## Generic use case (independent of any benchmark) + +Analysts routinely introspect datasets they can **read but do not own and do not +bill to**: Google's `bigquery-public-data`, a partner's shared project, an +organization's central data project that a smaller team queries from its own +billing project. To make those connectable in **ktx** — so `discover_data`, the +semantic layer, dictionary sampling, and `sql_dialect_notes` all work — the +connector must introspect a foreign-hosted dataset while billing jobs in the +credentials' own project. This is a standard BigQuery deployment shape and is +wholly independent of any benchmark. + +The class to design for is "the dataset's project ≠ the billing project," and it +must generalize beyond one example: a single connection may reference datasets in +**several** foreign projects at once (e.g. one slice mixing `bigquery-public-data` +and `isb-cgc-bq`), and two different projects may host datasets with the **same +name**. The design must keep those distinct. + +## Design decisions (resolved during refinement) + +These resolve ambiguities the intake draft left open. They constrain the +implementer; the exact code is theirs. + +### Carry the project inline on each dataset entry — no separate knob + +The introspection project is expressed **per dataset**, inline, as the optional +`project.` prefix on a `dataset_ids` / `dataset_id` entry. There is no new config +field. + +> Rejected alternative: a separate connection-level `dataset_project` (or +> `introspection_project`) field. It is a speculative runtime knob (against the +> repo's opinionated-defaults rule) and, more decisively, it **cannot express the +> requirement**: one connection must span *multiple* foreign projects, which a +> single global field cannot represent. The inline form also derives scope from +> the user's own declared input rather than adding a parallel setting. + +### Parse to canonical `{ project, dataset }` pairs at the config boundary + +Each entry is parsed **once**, in `bigQueryConnectionConfigFromConfig` / +`datasetIds()`, into a canonical pair: the project (when no prefix is present, +default it to `credentials.project_id`) and the bare dataset id. Every +introspection-side call site reads the resolved pair; nothing downstream re-parses +a `project.dataset` string. + +> Rejected alternative: keep `datasetIds: string[]` raw and split the prefix +> lazily at each use site (`introspectDataset`, `primaryKeys`, `listTables`, +> `testConnection`). That re-implements one rule in four places and is exactly the +> drift trap the repo's single-source-of-truth rule warns about — a later fix +> lands on one path and not another. Normalize at the boundary; carry the +> canonical form downstream. + +The internal resolved-config type (`KtxBigQueryResolvedConnectionConfig.datasetIds`) +changes shape from `string[]` to a structured pair list. That is an internal type; +the connector internals and the connector test fixtures are the only consumers. + +### Parsing rule (at the boundary) + +- An entry contains **at most one `.`**. +- With a dot: the segment **before** the dot is the project, validated by the + existing `normalizeBigQueryProjectId` charset + (`context/connections/bigquery-identifiers.ts`); the segment **after** is the + dataset id (validated as a normal identifier). +- Without a dot: a bare dataset; the project defaults to `credentials.project_id` + (today's behavior). +- **More than one `.`** (e.g. a stray `proj.ds.table`) is a clear config error + raised at resolution time, naming the connection — not a silent + mis-introspection. +- Legacy domain-scoped project ids that contain `:` (e.g. `example.com:proj`) stay + **out of scope**, consistent with `normalizeBigQueryProjectId`'s current charset + (which already rejects `.` and `:` in a project id). + +### Billing is never the dataset's project + +The BigQuery client is still constructed with `projectId = credentials.project_id` +(`getClient()`, `:487`–`:495`), and `createQueryJob` always bills there. Only the +*introspection* surfaces switch to the per-dataset project. Cross-project reads in +a `FROM` clause already bill to the caller — unchanged and already proven. + +### Dataset identity downstream is `(catalog, db)` + +Scanned tables are keyed by `(catalog, db, name)` throughout +(`context/scan/table-ref.ts`; `context/scan/warehouse-catalog.ts:107`). Because +the table `catalog` now holds the dataset's own project, two foreign projects that +each host a `austin_311` dataset remain distinct with no extra work — provided the +snapshot's `scope` / `metadata` also preserve the project (Requirement 6). + +### Setup-wizard scope: accept, don't discover + +The connector's region-scoped `listTables` (`:453`) is consumed **only** by the +`ktx setup` wizard's table-selection step (`setup-databases.ts`); the +ingest / `discover_data` path reads persisted snapshot JSON via +`WarehouseCatalogService.listTables`, not the connector method. The wizard is not +expected to enumerate foreign datasets (you can't list a project you don't own). +A `project.dataset` value hand-entered at the dataset prompt, or written into +`ktx.yaml`, must be accepted, validated, and introspected. See *Non-goals* for the +region caveat that follows from this. + +## Requirements + +### R1 — Accept and parse `project.dataset` at the config boundary + +`datasetIds()` / `bigQueryConnectionConfigFromConfig` resolve each +`dataset_ids` and `dataset_id` entry into a canonical `{ project, dataset }` pair +per the parsing rule above, defaulting `project` to `credentials.project_id` when +unprefixed. A malformed entry (more than one `.`, an empty project or dataset +segment, or a project/dataset that fails identifier validation) raises a clear +error at resolution time that names the connection id. + +### R2 — Introspect each dataset in its own project + +`introspectDataset` resolves the dataset via the **dataset's** project — +`client.dataset(datasetId, { projectId })` — for `getTables()` and each +`tableRef.get()`. This requires extending the `KtxBigQueryClient.dataset` port to +accept the project (e.g. `dataset(id, projectId)` / `dataset(id, { projectId })`) +and forwarding it from `DefaultBigQueryClientFactory`. + +### R3 — Label table `catalog` with the dataset's project + +Every table produced by `introspectDataset` is labeled `catalog: ` (not the billing project), and the introspection-failure warning +metadata (`object` / `catalog`) likewise reflects the dataset's project. This is +what makes downstream sample/distinct-value/read queries resolve. + +### R4 — Primary-key discovery targets the dataset's project + +The `primaryKeys` `INFORMATION_SCHEMA.TABLE_CONSTRAINTS` / +`KEY_COLUMN_USAGE` SQL is built against +`` `..INFORMATION_SCHEMA…` ``. (This INFORMATION_SCHEMA +view is dataset-qualified and therefore region-independent.) Its existing +soft-fail-on-denied behavior (`tryConstraintQuery`, scan warning) is preserved. + +### R5 — `listTables` lists each dataset in its own project + +`listTables` returns rows labeled `catalog: ` and queries +each referenced project's region `INFORMATION_SCHEMA.TABLES`. Because a connection +can now span projects, it queries per distinct project rather than assuming one. +(This is the setup-wizard surface — see the cross-region caveat in *Non-goals*.) + +### R6 — Snapshot scope and metadata reflect multiple projects + +`introspect`'s returned snapshot keeps `metadata.project_id` = the **billing** +project, but `scope.catalogs` becomes the **distinct set of dataset projects** +actually introspected. `scope.datasets` / `metadata.datasets` must stay +unambiguous when two projects share a dataset name (e.g. carry the qualified +`project.dataset`, or otherwise preserve the project). The scoped table-name +lookup that today passes `catalog: this.resolved.projectId` (`:359`) must pass +each dataset's own project so `tableScope` / `enabled_tables` filtering still +matches. + +### R7 — `testConnection` resolves foreign datasets + +`testConnection` validates each configured dataset via its own project +(`client.dataset(datasetId, { projectId }).get()`), so a connection pointing only +at foreign datasets reports success rather than a spurious `404`. + +### R8 — Billing unchanged; bare dataset is a strict no-op + +`createQueryJob` continues to bill in `credentials.project_id`. A connection whose +`dataset_ids` are all bare (no `project.` prefix) behaves **exactly** as before: +same resolved project, same `catalog` labels, same INFORMATION_SCHEMA targets, no +behavioral change. + +### R9 — `getTableRowCount` honors the parsed entry + +`getTableRowCount`'s default-dataset handling (`:431`, today +`this.resolved.datasetIds[0]`) resolves through the canonical pair so a foreign +default dataset is introspected in its own project. + +### R10 — Docs reflect the qualified form + +Document that a BigQuery `dataset_ids` / `dataset_id` entry may be written +`project.dataset` to introspect a dataset hosted in another project (billing stays +in `credentials.project_id`). Update the BigQuery rows/examples in +`docs-site/content/docs/configuration/ktx-yaml.mdx` and +`docs-site/content/docs/integrations/primary-sources.mdx` (and the dataset-scope +note in `docs-site/content/docs/cli-reference/ktx-setup.mdx`). Keep examples +copy-pasteable and follow the `fumadocs-mdx-structure` skill. + +## Acceptance criteria + +1. **Foreign single-project introspection.** With credentials in project + `ktx-spider2-lite` and `dataset_ids: ['bigquery-public-data.austin_311']`, + `ktx ingest ` introspects the tables, enriches, and samples values; + `discover_data` / `dictionary_search` return them. Tables are labeled + `catalog: 'bigquery-public-data'`. +2. **Multi-project connection.** `dataset_ids: ['bigquery-public-data.x', + 'other-project.y']` introspects **both**, each under its own project; the + snapshot's `scope.catalogs` contains both projects. +3. **Cross-project query still bills locally.** `sql_execution` of a + fully-qualified `project.dataset.table` query runs and bills in + `credentials.project_id`. +4. **Same dataset name, two projects.** `['proj-a.shared', 'proj-b.shared']` + yields two distinct dataset groups; tables do not collide. +5. **No regression.** `dataset_ids: ['my_dataset']` (or singular `dataset_id`) + behaves exactly as before — resolved under `credentials.project_id`, same + `catalog` labels and INFORMATION_SCHEMA targets. +6. **Malformed entry fails clearly.** `dataset_ids: ['proj.ds.table']` (or an + empty segment) raises a config error naming the connection, not a `404` at + scan time. +7. **Test coverage** (extend `packages/cli/test/connectors/bigquery/connector.test.ts`, + using the existing fake `clientFactory` harness): + - the fake `dataset()` is called with the dataset's project for a prefixed + entry, and with the billing project for a bare entry; + - a prefixed entry yields tables with `catalog: ''`; + - a mixed two-project `dataset_ids` introspects both; + - `bigQueryConnectionConfigFromConfig` rejects a multi-dot / empty-segment + entry; + - the existing single-project tests still pass unchanged. + +## Non-goals + +- **Foreign-dataset discovery in the setup wizard.** The wizard does not + enumerate datasets in projects the credentials don't own; users supply + `project.dataset` explicitly (scope decision A). +- **Cross-region `listTables`.** `listTables`' region-scoped + `region-.INFORMATION_SCHEMA.TABLES` query uses the connection-level + `location`; a foreign dataset in a *different* region than the connection's + `location` will not be listed by that wizard-facing query. This does **not** + affect ingest/`discover_data`, whose introspection path + (`introspectDataset` REST metadata + dataset-qualified PK INFORMATION_SCHEMA) is + region-independent. A per-dataset region knob is a separate spec if ever needed. +- **Domain-scoped legacy project ids** containing `:` (e.g. `example.com:proj`), + already unsupported by `normalizeBigQueryProjectId`. +- **A separate billing/introspection config field** — explicitly rejected above. + +## Implementation orientation + +Pointers from exploration; line numbers may have drifted, and the implementer owns +the design. + +- `packages/cli/src/connectors/bigquery/connector.ts` + - `datasetIds()` (`:163`) and `bigQueryConnectionConfigFromConfig` (`:278`) — + parse + canonicalize (R1); change `KtxBigQueryResolvedConnectionConfig.datasetIds` + shape. + - `KtxBigQueryClient.dataset` port (`:100`–`:110`) and + `DefaultBigQueryClientFactory.dataset` (`:130`–`:135`) — thread `projectId` + (R2). `getClient()` (`:487`) keeps the billing project (R8). + - `introspectDataset` (`:544`) — `dataset(id, { projectId })`, table `catalog` + + warning metadata (R2, R3). + - `primaryKeys` (`:591`) — dataset-qualified INFORMATION_SCHEMA (R4). + - `listTables` (`:453`) — per-project region INFORMATION_SCHEMA + row catalog + (R5). + - `introspect` (`:352`) — `scope.catalogs`, `scope.datasets`, scoped-name lookup + (`:359`) (R6). + - `testConnection` (`:339`) (R7); `getTableRowCount` (`:431`) (R9). +- `packages/cli/src/connectors/bigquery/live-database-introspection.ts` — wraps + `introspect`; no separate change needed (it inherits the fix). +- `packages/cli/src/context/connections/bigquery-identifiers.ts` — + `normalizeBigQueryProjectId` is the project-segment validator. +- `packages/cli/src/context/connections/dialect-helpers.ts` / + `connectors/bigquery/dialect.ts` — three-part naming; no change, but this is + *why* R3 matters. +- After implementing, rebuild and re-link so the playground picks it up: + `pnpm run build && pnpm run link:dev`. Run + `pnpm --filter @kaelio/ktx run type-check` and the connector test suite. + +## Benchmark context (motivation, not a requirement — do not encode benchmark specifics) + +Spider 2.0-Lite's **BigQuery slice (~205 questions)** is otherwise unservable +faithfully: every one of its ~74 logical databases groups datasets hosted in +foreign public projects (`bigquery-public-data`, `isb-cgc-bq`, +`data-to-insights`, …), never in a project we own. Query execution already works +cross-project; ktx-only *discovery* is the sole blocker, and it is blocked exactly +because the connector can't introspect a foreign-hosted dataset. Of 74 BQ +databases only **one** spans more than one source project, so "let `dataset_ids` +carry `project.dataset` and introspect each in its own project" covers the +benchmark and the general case alike. None of these project names belong in the +code — they are derived from the user's own `dataset_ids` input. + +## Implementation notes + +Implemented on branch `write-feature-spec-wiki`. The whole change is contained in +the BigQuery connector, its identifier helpers, the connector test suite, and three +docs pages. + +**Config boundary (R1).** Added `normalizeBigQueryDatasetId` +(`packages/cli/src/context/connections/bigquery-identifiers.ts`, charset +`[A-Za-z0-9_]`) next to the existing project/region validators. In +`connectors/bigquery/connector.ts`, a single `parseBigQueryDatasetEntry(entry, +defaultProject, connectionId)` parses one entry by splitting on `.`: zero dots → +bare dataset in `defaultProject`; one dot → `project.dataset` (each segment +validated; empty segment throws); two or more dots → throws. `resolveDatasetRefs` +resolves `env:`/`file:` references first, trims/filters empties, then parses each. +`bigQueryConnectionConfigFromConfig` calls it with the billing `project_id` as the +default, so the canonical pair list is produced once at the boundary. +`KtxBigQueryResolvedConnectionConfig.datasetIds` changed from `string[]` to the new +`BigQueryDatasetRef[]` (`{ project, dataset }`). All errors name +`connections..dataset_ids entry ""`. + +**Client port (R2).** `KtxBigQueryClient.dataset` now takes +`(datasetId, projectId)`; `DefaultBigQueryClientFactory` forwards +`client.dataset(datasetId, { projectId })` (`@google-cloud/bigquery` `DatasetOptions.projectId`). +`getClient()` still constructs the client with the **billing** `project_id`, so +`createQueryJob` bills locally regardless of the dataset's project (R8, acceptance 3). + +**Per-dataset introspection (R3–R7, R9).** Every introspection site reads the +resolved pair: `introspectDataset(ref, …)` resolves `dataset(ref.dataset, ref.project)` +and labels tables (and the introspection-failure warning, via `tryIntrospectObject`'s +`catalog.db.object`) with `ref.project`; `primaryKeys(ref)` builds dataset-qualified +`` `..INFORMATION_SCHEMA…` `` SQL; `testConnection` validates each +dataset under its own project; `getTableRowCount`'s default resolves through the first +pair. `introspect` sets `scope.catalogs` to the distinct set of dataset projects and +keeps `metadata.project_id` = billing. `scope.datasets` / `metadata.datasets` use a +`qualifiedDatasetLabel` helper — bare in the billing project (so the single-project +snapshot is byte-for-byte unchanged), `project.dataset` otherwise (so two projects with +the same dataset name stay distinct, R6/acceptance 4). + +**`listTables` (R5).** Split into `listTables` (parse override entries, group by +project) and `listTablesInProject(project, region, datasets?)`. With no override it +lists the billing project's region (unchanged); with an override it runs one +region-`INFORMATION_SCHEMA.TABLES` query per distinct project, filtered to that +project's bare datasets, and labels rows with that project. The existing single-region +test is unchanged (bare entries collapse to one billing-project query). + +**Docs (R10).** Added a "Cross-project datasets" subsection to +`integrations/primary-sources.mdx` (qualified-entry example + the setup/region caveats), +plus pointers from `configuration/ktx-yaml.mdx` and `cli-reference/ktx-setup.mdx`. + +**Tests.** Extended `test/connectors/bigquery/connector.test.ts`: parse-to-pairs and +malformed-entry rejection (`proj.ds.table`, `proj.`, `.ds`); a foreign-only connection +calls `dataset('austin_311', 'bigquery-public-data')`, labels tables +`catalog: 'bigquery-public-data'`, builds the client with the billing project, and keeps +`metadata.project_id` local; a mixed `['bigquery-public-data.austin_311', 'analytics']` +connection introspects both under their own projects; and `['proj_a.shared', +'proj_b.shared']` stays distinct. The internal `datasetIds`-shape assertion was updated +to the pair list; all pre-existing behavioral tests pass unchanged. + +**Verification.** `pnpm --filter @kaelio/ktx run type-check`, the connector suite +(18 tests), `test/setup-databases.test.ts` + `bigquery-identifiers.test.ts`, +`pnpm run build`, `pnpm run dead-code` (Biome + Knip default + production), +`pnpm run link:dev` (`ktx-dev` → 0.12.0), and `pre-commit` on the changed files all +pass. Acceptance criteria 1–4 are exercised by unit tests with the fake client factory; +criteria 5–6 by unit tests; criterion 3 (cross-project query bills locally) is +structurally guaranteed (single billing client) and asserted via the `createClient` +project. End-to-end ingest against live `bigquery-public-data` was not run here (no live +credentials in this worktree); the `link:dev` binary is ready for the playground agent to +validate. + +**No deviations from the spec design.** The only judgment call: `scope.datasets` +renders bare-in-billing / qualified-otherwise rather than always-qualified, chosen to +satisfy both the no-regression requirement (R8/acceptance 5) and the disambiguation +requirement (R6/acceptance 4) with one unambiguous, dot-delimited form. diff --git a/spider2-specs/specs/19-durable-bounded-relationship-detection.md b/spider2-specs/specs/19-durable-bounded-relationship-detection.md new file mode 100644 index 00000000..3aecf45b --- /dev/null +++ b/spider2-specs/specs/19-durable-bounded-relationship-detection.md @@ -0,0 +1,471 @@ +# Durable, resumable, bounded relationship detection during ingest enrichment + +> Refined spec. Intake draft: `todo/19-durable-bounded-relationship-detection.md`. +> +> **Scope: make the expensive part of ingest enrichment survive an interrupted +> relationship stage.** Today the paid LLM descriptions + embeddings only become +> durable and queryable after the slowest, most-killable, least-valuable stage +> (relationship detection) also finishes. This spec moves the persistence boundary +> to the cost boundary, makes stage resume work across runs, and bounds + observes +> the one open-ended stage — the durability companion to spec 16 (bounded query +> execution), which this spec composes with rather than replaces. + +## Problem + +Three compounding failure modes, all confirmed in the current code, share one root +cause: **the three enrichment stages are treated as a single atomic unit for +persistence, identity, and bounding, even though they differ radically in cost, +durability value, runtime, and likelihood of being killed.** + +`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:472`) runs three stages +in a fixed order through `runEnrichmentStage` (`:413`): + +| stage | order | cost | durability value | runtime on a large schema | likely to be killed | +|-------|-------|------|------------------|---------------------------|---------------------| +| `descriptions` (`:524`) | 1st | high — one paid LLM call per table | high | minutes | low | +| `embeddings` (`:553`) | 2nd | medium | high | seconds–minutes | low | +| `relationships` (`:587`) | 3rd | low — best-effort joins | low | **minutes, silent** | **high** | + +The slowest, most-killable, least-valuable stage runs **last**, and it gates the +durability of the two expensive stages held in memory before it. + +### 1. Enrichment is lost if relationship detection is interrupted + +The queryable artifact agents search and execute against is the `_schema` manifest +YAML (`semantic-layer//_schema/*.yaml`). It is written **twice**: + +- bare (native column comments only) early, at `local-scan.ts:473` + (`writeLocalScanManifestShards`), before enrichment runs; and +- rewritten **with AI descriptions + accepted joins** by + `writeLocalScanEnrichmentArtifacts` (`local-enrichment-artifacts.ts:310`), called + from `local-scan.ts:510` **after** `runLocalScanEnrichment` returns — i.e. after + all three stages. + +So the descriptions and embeddings reach the queryable layer only via that single +terminal write. If the process is killed/crashes/times out **during** the +`relationships` stage, `runLocalScanEnrichment` never returns, the terminal write +never runs, and the in-memory descriptions + embeddings are discarded — the +`_schema` retains only the bare native comments from the `:473` write. + +Empirically (intake draft): ingesting a 95-table BigQuery dataset produced full +descriptions + embeddings (progress reached "Building embeddings 17/17"), then the +relationship stage ran silently past a supervising deadline and was killed; the +persisted `_schema` had **0** AI descriptions. The most expensive work is the most +likely to be thrown away. + +> A stage-state store (below) does save each completed stage's output to an +> internal SQLite cache as the stage finishes — so the descriptions are not lost to +> the *resume cache*. They are simply never **promoted** to the queryable `_schema` +> until the terminal write. The data survives somewhere the agent cannot query, and +> (per failure mode 2) cannot be reused on the next run either. + +### 2. Re-running does not resume — it re-spends + +`runEnrichmentStage` resolves a completed stage with +`findCompletedStage({ runId, stage, inputHash })` (`local-enrichment.ts:427`), and +the store keys on **`runId`**: `SqliteLocalScanEnrichmentStateStore` declares +`PRIMARY KEY (run_id, stage)` and filters lookups by `run_id` +(`sqlite-local-enrichment-state-store.ts:83,91–115`). `runId` is minted fresh per +ingest invocation (`record.runId`). The cache therefore only resolves *within* one +run; re-running an interrupted ingest gets a new `runId`, misses every cached +stage, and **recomputes descriptions + embeddings from scratch** — re-paying for +LLM work that already succeeded. + +The store already computes and persists `inputHash` next to `runId` — +a stable `sha256` of `{ snapshot, mode, detectRelationships, providerIdentity, +relationshipSettings }` (`enrichment-state.ts:78`). The correct content key is +already on the row; the lookup just uses the volatile column. This is a keying +defect, not a missing capability. + +### 3. Relationship detection is unobservable and unbounded + +`discoverKtxRelationships` (`context/scan/relationship-discovery.ts:218`) profiles a +row sample of **every enabled table** (`profileKtxRelationshipSchema`, +`relationship-profiling.ts:320` — one sampled query per table at +`profileConcurrency`, default 4), validates candidate joins +(`relationship-validation.ts:237` — one coverage query per candidate), and detects +composite keys (`relationship-composite-candidates.ts:515` — per-table plus +cross-table queries). None of the controls the rest of the scan pipeline relies on +were ever wired into this stack: + +- **No progress.** `discoverKtxRelationships` does not accept a progress port; the + caller can only emit start/end around it (`local-enrichment.ts:600,611` — + `update(0, 'Detecting relationships')` … `update(1, 'found N')`). Minutes of + silence between. +- **No honored cancellation.** `KtxScanContext.signal` exists on the contract + (`types.ts`) but **no sub-stage reads it**. +- **No time budget.** Validation has a *count* budget (`validationBudget`, default + `min(2 × tableCount, 1000)`); profiling and composite detection have none. On a + schema with hundreds–thousands of tables, profiling is O(tables) silent queries + with no internal stop condition. + +A supervisor watching for liveness cannot tell a slow-but-working profile from a +true hang, and nothing inside the stage will voluntarily stop — so on a very large +schema it runs far past any reasonable deadline and is killed (which, via failure +mode 1, takes the descriptions with it). + +## Generic use case (independent of any benchmark) + +Any context layer that enriches a real warehouse with paid LLM work must make that +work durable the instant it is produced, resume it across process restarts without +re-paying, and bound the open-ended profiling stage so a large catalog cannot hang +ingest indefinitely. A data team ingesting a 500-table production warehouse over a +flaky connection, a rate-limited LLM budget, or a CI step with a wall-clock limit +hits all three failure modes regardless of any benchmark. This is general +durability and cost hygiene for the ingest pipeline; the benchmark only made it +acute at scale. + +## Design decisions (resolved during refinement) + +These resolve ambiguities the intake draft left open. They constrain the +implementer; the exact code is theirs (requirement-level, per the specs README). + +### D1 — Checkpoint queryable artifacts at the cost boundary, before relationships + +As soon as the last non-relationship stage completes — `embeddings` when an +embedding provider is configured, otherwise `descriptions` — persist the +descriptions + embeddings into the **queryable** `_schema` manifest (and the raw +`descriptions.json` / `embeddings.json` enrichment artifacts), **before** the +`relationships` stage runs. The relationship stage then writes its joins on top: the +manifest builder already re-reads and preserves existing descriptions and +manual/inferred joins on rewrite (`loadExistingManifestState`, +`local-enrichment-artifacts.ts:196`), so the second write is additive, not +destructive. + +Net invariant: **the descriptions + embeddings are always durable and queryable the +moment they are computed**, even if relationship detection then fails, is +interrupted, is budget-truncated, or is skipped. A failed/partial/skipped +relationship stage degrades to "no joins" or "partial joins" — **never** to "no +descriptions." This is the inverse guarantee the current terminal-write ordering +violates. + +The bare `:473` manifest write stays — it is the queryable schema for the +no-providers / enrichment-disabled path. The checkpoint is an additional write that +runs only when enrichment produced descriptions. + +> Orientation (the implementer owns the seam): the lowest-coupling shape is a +> checkpoint hook — `runLocalScanEnrichment` invokes a caller-supplied callback once +> the last non-relationship stage completes, and `local-scan.ts` supplies a callback +> that calls the existing `writeLocalScanEnrichmentArtifacts` for the +> descriptions + embeddings + manifest only (no generated joins yet). The final +> write after the relationship stage proceeds as today. Relationship-specific +> artifacts (`relationships.json`, `relationship-profile.json`, +> `relationship-diagnostics.json`) are written by the final/relationship write, not +> the checkpoint, so the checkpoint never emits misleading empty relationship +> diagnostics. +> +> Rejected alternative: move all artifact writing inside `runLocalScanEnrichment` +> (inject the file store / project). That couples the enrichment module to +> persistence for no gain — the writer already lives in `local-scan.ts` and the +> checkpoint needs only a one-line hook, not a relocation. + +### D2 — Resume by content identity, not by `runId` + +Re-key completed-stage resolution on **`(connectionId, stage, inputHash)`**, +independent of `runId`, so a re-run with an unchanged schema and config resumes the +finished `descriptions` / `embeddings` stages from cache and re-runs only what +actually failed. `inputHash` is already the content fingerprint; `connectionId` +scopes it to the right source. When several rows share a content identity (one per +prior run), the most recent `updatedAt` wins. + +`runId` stays on the stored row for diagnostics and for `listRunStages`, but leaves +the uniqueness/lookup key. + +The state store is a **disposable local resume cache** (`.ktx` local state, +regenerable from a fresh ingest). Re-key it with **no migration bridge** — recreate +the table if its on-disk shape differs from the new `(connection_id, stage, +input_hash)` key, consistent with ktx's no-backward-compatibility policy. Losing the +old cache only means one ingest cannot resume; it never corrupts a queryable +artifact. + +> Rejected alternative: include `syncId` or `mode` in the key. `mode` and the rest +> are already folded into `inputHash`; adding them again would only narrow the key +> and re-break cross-run resume when an incidental field differs. + +### D3 — Make the relationship stage observable and bounded + +Thread three things the rest of the pipeline already supports through +`discoverKtxRelationships` into profiling, validation, and composite detection: + +- **Progress** through the existing progress port (the relationship phase is + already `progress?.startPhase(0.25)` at `local-enrichment.ts:586`): emit per-unit + liveness — "Profiling table K/N", "Validating candidate K/M", and the equivalent + for composite probing — so a supervisor can distinguish slow-but-working from + hung. +- **A flat wall-clock budget** for the whole relationship stage: a new + `scan.relationships.detectionBudgetMs`, a positive integer of milliseconds, + project-level, validated like the other `scan.relationships` fields, **default + 600_000 (10 min), enforced by default.** Checked at unit boundaries (before each + table profile, each candidate validation, each composite probe). It sits **above** + spec 16's per-query deadline (default 30s): each individual query is already + bounded; this bounds the *sum* of them. +- **Honored cancellation:** where `KtxScanContext.signal` is available, the same + unit-boundary check honors it, so external cancellation stops the stage too. + +On budget exhaustion or abort: stop scheduling new work, let in-flight queries +finish (each already bounded by spec 16), finalize with the relationships found so +far, and return a **partial** result — never an unbounded hang and never an +exception that would lose the checkpointed descriptions. + +> Rejected alternative — per-table-scaled budget (N seconds × table count). It is a +> second formula to reason about and "more tables → more budget" partly re-opens the +> unbounded door this requirement closes. One flat, generous, project-level number +> matches how the other `scan.relationships` knobs are shaped and is enough for a +> best-effort stage whose partial output is durable and improvable (D4). +> +> Rejected alternative — a global `KTX_RELATIONSHIP_BUDGET_MS` env knob or a +> per-call override. One opinionated project-level default with a config override is +> the canonical ktx shape; no second runtime path. + +### D4 — A budget-truncated partial is a successful, cached, completed stage + +A graceful budget stop is **not** a failure. The relationship stage saves its +partial result like any completed stage (so a plain re-run resumes it for free, no +re-querying) and marks it `partial` with a reason in the relationship diagnostics +plus a recoverable scan warning. Because `detectionBudgetMs` lives in +`relationshipSettings ⊂ inputHash`, **raising the budget changes the content +identity and triggers a fresh, fuller run** — that is the only "try harder" +mechanism, with no extra flag or runtime path. + +Distinguish the two stop kinds: + +- **Process killed mid-stage** (crash / SIGKILL / supervisor): nothing is saved as + completed, so the next run recomputes the relationship stage (after resuming + descriptions/embeddings from cache via D2). This is the primary durability path. +- **Graceful budget/abort stop**: a partial *is* saved as completed-partial and + resumed cheaply on re-run, unless the budget is raised. + +## Requirements + +### 1. Checkpoint descriptions + embeddings before relationship detection + +The descriptions and embeddings MUST be persisted into the durable, queryable +`_schema` manifest (and the raw enrichment artifacts) as soon as the last +non-relationship stage completes, before the `relationships` stage runs. +Relationship detection appends/merges its joins on completion. The expensive LLM + +embedding enrichment MUST be queryable even if the relationship stage subsequently +fails, is interrupted, is budget-truncated, or is skipped. A failed/partial/skipped +relationship stage MUST degrade to "no/partial joins," never to "no descriptions." + +### 2. Stage resume resolves by content identity across runs + +Completed-stage resolution MUST key on `(connectionId, stage, inputHash)`, +independent of `runId`, so re-running an interrupted ingest resumes the finished +`descriptions` / `embeddings` stages from cache and re-runs only what failed. +Re-running after an interruption MUST NOT re-issue LLM description or embedding +calls for stages that already completed. The resume cache MAY be recreated without a +migration bridge if its schema changes (it is disposable local state). + +### 3. Relationship detection emits progress and honors a wall-clock budget + +The relationship stage MUST emit per-unit progress through the existing progress +port (at minimum per-table during profiling and per-candidate during validation) so +liveness is observable. It MUST enforce a flat wall-clock budget +(`scan.relationships.detectionBudgetMs`, default 600_000 ms, project-level, +overridable, validated as a positive integer) checked at unit boundaries and layered +above spec 16's per-query deadline, and MUST honor `KtxScanContext.signal` where +available. On budget exhaustion or abort it MUST stop scheduling new work, finalize +with the relationships found so far, and return a partial result rather than running +unboundedly or throwing. + +### 4. A budget-truncated relationship result is durable and marked partial + +A graceful budget/abort stop MUST persist the partial relationship result as a +completed stage (so a plain re-run resumes it without re-querying) and MUST mark it +`partial` — in the relationship diagnostics artifact and as a recoverable scan +warning — so downstream consumers can see the joins are incomplete. Raising +`detectionBudgetMs` (which changes `inputHash`) MUST cause a fresh, fuller +relationship run; no separate flag is introduced for "redo." A process killed +mid-stage MUST NOT leave a completed record (so it recomputes on re-run). + +### 5. No regression for small or uninterrupted ingests + +A small or single-run ingest that is never interrupted MUST produce the same +artifacts and the same relationship output as today. The checkpoint write MUST be +idempotent with the final write (descriptions survive the join rewrite); the budget +default MUST be generous enough that normal and large-but-tractable schemas complete +relationship detection fully, hitting the budget only on pathological scale. + +## Acceptance criteria + +- **Durability across interruption:** interrupting an ingest **during** relationship + detection still leaves a queryable semantic layer carrying the table/column + descriptions + embeddings that were generated (verified: re-open the connection; + AI descriptions are present in `_schema`, not just native comments). +- **Resume does not re-spend:** re-running an interrupted ingest does **not** + regenerate descriptions/embeddings whose stage already completed (verified: no LLM + description calls and no embedding calls for the cached tables; only the failed + stage re-runs). Resolution is by `(connectionId, stage, inputHash)`, so the resume + survives a fresh `runId`. +- **Observable + bounded relationships:** a connection with hundreds of tables emits + relationship-stage progress (per-table profiling, per-candidate validation) and + completes within `detectionBudgetMs`; when the budget is hit, the stage stops + gracefully and persists the partial relationships found so far — without + discarding enrichment — marked `partial` in diagnostics and via a recoverable + warning. +- **Partial is cached and improvable:** re-running with an unchanged budget resumes + the partial relationship result from cache (no re-querying); raising + `detectionBudgetMs` triggers a fresh, fuller relationship run. +- **Budget validation:** `detectionBudgetMs` defaults to 600_000, honors a project + override, and rejects an invalid value (zero / negative / non-integer) as a clear + `ktx.yaml` config error. +- **No regression:** small/single-run ingests behave exactly as before — identical + artifacts and relationship output when nothing is interrupted; the checkpoint + + final writes leave descriptions intact alongside the generated joins. + +## Non-goals + +- **Bounding the descriptions stage's per-table LLM call.** Whether an individual + enrichment LLM call can wedge is a separate concern (already being addressed in the + working tree via a per-table enrichment timeout). This spec ensures whatever + descriptions *did* complete are durable; it does not own the per-call timeout. +- **Changing relationship-detection quality, thresholds, or the candidate/validation + algorithm.** The accept/review thresholds, scoring, and the existing + `validationBudget` count cap are unchanged; this spec adds durability, + cross-run resume, progress, and a time budget around them. +- **A per-connection or per-call relationship budget, or a global env override.** + One flat project-level `detectionBudgetMs`; no second runtime path (D3). +- **A new per-query timeout.** Spec 16 already bounds individual queries; this spec + composes above it and does not re-implement query-level deadlines. +- **Replacing the per-query deadline with the stage budget, or vice versa.** They + are independent and layered: a single query is bounded by spec 16; the stage's sum + is bounded by `detectionBudgetMs`. +- **A general checkpoint framework for every ingest stage.** The checkpoint is + specifically the descriptions+embeddings → queryable-manifest promotion before + relationships; it is not a generic per-stage artifact-flush abstraction. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns the +design. + +- **Enrichment orchestration** — `context/scan/local-enrichment.ts`: + `runLocalScanEnrichment` (`:472`), the three `runEnrichmentStage` calls + (`descriptions` `:524`, `embeddings` `:553`, `relationships` `:587`), + `runEnrichmentStage` (`:413`) and its `findCompletedStage` lookup (`:427`). Add the + checkpoint hook after the last non-relationship stage; thread the progress port, + signal, and budget into the relationship stage. +- **Scan driver / write ordering** — `context/scan/local-scan.ts`: bare manifest + write (`:473`), enrichment call (`:492`, currently passing only + `{ runId, progress }` as `context` — wire `signal` through here too), terminal + `writeLocalScanEnrichmentArtifacts` (`:510`), and the enrichment-failure catch + (`:530`, which after D1 no longer loses descriptions). Supply the checkpoint + callback here. +- **Artifact writer** — `context/scan/local-enrichment-artifacts.ts`: + `writeLocalScanEnrichmentArtifacts` (`:310`), `writeLocalScanManifestShards` + (`:270`), and the description-preserving merge in `loadExistingManifestState` + (`:196`) — the basis for the additive checkpoint/final write. +- **Resume cache** — `context/scan/sqlite-local-enrichment-state-store.ts`: + `PRIMARY KEY (run_id, stage)` (`:83`), `findCompletedStage` (`:91`), + `saveCompletedStage` (`:117`). Re-key on `(connection_id, stage, input_hash)`, + pick latest `updated_at`, recreate the table if shape differs (disposable cache). + Lookup interface `KtxScanEnrichmentStageLookup` and `findCompletedStage` + in `context/scan/enrichment-state.ts` (`:10,46`); `computeKtxScanEnrichmentInputHash` + (`:78`). +- **Relationship stack (progress + budget + signal)** — + `context/scan/relationship-discovery.ts` (`discoverKtxRelationships` `:218`, accept + a progress port and budget/deadline + signal), + `context/scan/relationship-profiling.ts` (`profileKtxRelationshipSchema` `:320` — + per-table progress + budget check), + `context/scan/relationship-validation.ts` (`validateKtxRelationshipDiscoveryCandidates` + `:237` — per-candidate progress + budget check, alongside the existing + `validationBudget`), + `context/scan/relationship-composite-candidates.ts` + (`discoverKtxCompositeRelationships` `:515` — budget check). +- **Config** — `context/project/config.ts` `scan.relationships` + (`KtxScanRelationshipConfig`, `:171–213`): add `detectionBudgetMs` (positive + integer ms, default 600_000) to the zod schema and the default config builder. +- **Partial marker** — `context/scan/relationship-diagnostics.ts` + (`buildKtxRelationshipDiagnostics`, the profile/diagnostics artifact shape) carries + a `partial` flag + reason; add a recoverable warning code to the + `KtxScanWarningCode` union in `context/scan/types.ts` (e.g. + `relationship_detection_partial`). +- **Tests** — durability: a fixture ingest interrupted during the relationship stage + leaves AI descriptions in the queryable `_schema`. Resume: a second run with a + fresh `runId` and unchanged `inputHash` resolves the cached descriptions/embeddings + (assert no LLM/embedding calls) and re-runs only relationships. Budget: a schema + large enough (or a tiny `detectionBudgetMs` as the test seam) hits the budget, + emits per-unit progress, returns partial, persists it marked `partial`, and a + re-run resumes the partial; raising the budget re-runs. Resolver/config unit tests + for `detectionBudgetMs` (default / override / invalid). Regression: small + uninterrupted ingest yields identical artifacts and relationship output. +- After implementing, rebuild and re-link so the playground picks it up: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation, not a requirement) + +The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables +(`ebi_chembl` 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code +costs real, rate-limited LLM budget; losing that enrichment to a relationship-stage +interruption — and re-spending it on every retry — makes large-schema ingest +impractical, and an unbounded profiling stage runs past any supervising deadline and +is killed. This is a general durability/cost property of the ingest pipeline, +independent of the benchmark; the benchmark only made it acute at scale. Do not +encode any benchmark specifics in the implementation. + +## Implementation notes + +Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All +four design decisions shipped; no deviations from the resolved design. + +**D2 — resume by content identity** (`sqlite-local-enrichment-state-store.ts`, +`enrichment-state.ts`, `local-enrichment.ts`): the stage table is re-keyed to +`PRIMARY KEY (connection_id, stage, input_hash)`; `findCompletedStage` looks up by +`(connectionId, stage, inputHash)` ordered by `updated_at DESC` (most recent +content identity wins). `KtxScanEnrichmentStageLookup.runId` became `connectionId`; +`runId` stays on the row for diagnostics/`listRunStages`. The store drops and +recreates the table when the on-disk primary key differs (disposable cache, no +migration bridge), detected via `PRAGMA table_info`. + +**D3 — observable + bounded relationship stage** (new +`relationship-detection-budget.ts`): a sticky `KtxRelationshipDetectionBudget` +(`check()`/`stopReason()`) built from `detectionBudgetMs` + `ctx.signal` + an +injectable `now`, plus `mapWithBudget` (a budget-aware concurrent map that +generalizes and replaces the old `mapWithConcurrency`). Threaded through +`discoverKtxRelationships` → profiling (per-table progress + budget stop), +validation (per-candidate progress + budget stop; budget-skipped candidates +degrade to the existing `validation_unattempted` review), and composite +detection (budget stops at PK-detection and coverage-probe boundaries). +`discoverKtxRelationships` now accepts `progress` and `now` and returns +`partial: { reason } | null`. The clock check fires only when work remains, so a +deadline elapsing after the last unit never marks a fully-processed stage partial. + +**D1 — checkpoint before relationships** (`local-enrichment.ts`, +`local-enrichment-artifacts.ts`, `local-scan.ts`): `runLocalScanEnrichment` fires a +caller-supplied `onCheckpoint` once descriptions/embeddings complete and before +the relationship stage runs, gated on `shouldDetectRelationships` so the +no-relationship path keeps a single write. `local-scan.ts` supplies a callback +calling the new `writeLocalScanEnrichmentCheckpoint` (descriptions.json + +embeddings.json + manifest with descriptions and no generated joins — no +relationship artifacts, so no misleading empty diagnostics). The shared +description/embedding JSON writer was factored out so checkpoint and final writes +stay one implementation. `ctx.signal` is now threaded from `RunLocalScanOptions` +into the enrichment context (completing the existing `KtxScanContext.signal` +contract already read by the budget and the in-flight description timeout). + +**D4 — partial is durable + marked** (`relationship-diagnostics.ts`, +`local-enrichment.ts`, `local-enrichment-artifacts.ts`): the diagnostics artifact +carries `partial` + `partialReason`; `runLocalScanEnrichment` pushes a recoverable +`relationship_detection_partial` warning (new `KtxScanWarningCode`) when truncated. +A graceful budget/abort stop returns normally, so the relationship stage saves as a +completed-partial record and resumes cheaply; a process killed mid-stage saves +nothing and recomputes. Raising `detectionBudgetMs` changes `inputHash` +(it lives in `relationshipSettings`), forcing a fresh, fuller run — the only +"try harder" mechanism, no extra flag. + +**Config** (`config.ts`): `scan.relationships.detectionBudgetMs`, positive integer +ms, default `600_000`, validated like the other relationship fields. Documented in +`docs-site/content/docs/configuration/ktx-yaml.mdx`. + +**Tests** (all green): budget unit tests (`relationship-detection-budget.test.ts`); +cross-run resume + table-recreate (`enrichment-state.test.ts`, +`local-enrichment.test.ts`); progress/budget/abort partial +(`relationship-discovery.test.ts`); partial persisted/resumed/re-run-on-raise + +checkpoint ordering + no-checkpoint-when-skipped (`local-enrichment.test.ts`); +end-to-end durability — a relationship-stage failure still leaves AI descriptions +in the queryable `_schema` (`local-scan.test.ts`); diagnostics partial flag +(`relationship-diagnostics.test.ts`); config default/override/invalid +(`config.test.ts`). `pnpm --filter @kaelio/ktx type-check`, `pnpm run dead-code`, +and `pnpm run build && pnpm run link:dev` all pass. (Pre-existing and unrelated: +three `analytics-skill-content.test.ts` markdown-structure assertions fail on this +branch from earlier analytics-skill commits — untouched here.) diff --git a/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md b/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md new file mode 100644 index 00000000..1f4ad022 --- /dev/null +++ b/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md @@ -0,0 +1,533 @@ +# Resilient enrichment under a slow/hung LLM backend + +> Refined spec. Intake draft: `todo/20-resilient-enrichment-under-slow-llm.md`. +> +> **Scope: make the descriptions enrichment stage survive a hung LLM backend and +> an interrupted run.** Two compounding gaps live *inside* the per-table +> description-enrichment path: (1) the per-table LLM timeout fires in JS but does +> not terminate a wedged subprocess backend, so a hung table wedges the whole +> stage indefinitely; (2) descriptions are persisted only at full-stage +> completion, so any interruption discards every already-enriched table. This is +> the enrichment-stage analog of spec 16 (enforced query cancellation — a deadline +> that *stops the work*, not just abandons the promise) and spec 19 (move the +> durability boundary to the cost boundary so expensive LLM work is not lost). It +> composes with both rather than replacing them. + +## Problem + +Two compounding failure modes on the per-table description-enrichment path, both +confirmed in the current code and observed end-to-end together. Their union turned +a single hung table into an indefinite wedge *plus* total loss of an entire +stage's LLM work. + +### 1. The per-table LLM timeout does not terminate the work + +`KtxDescriptionGenerator.generateBatchedTableDescriptions` +(`context/scan/description-generation.ts`, the bounded call ~760–866) wraps the +per-table `this.llmRuntime.generateObject(...)` call in `retryAsync` with a fresh +`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (commit `01f63380`). +A fired timeout is surfaced as `KtxAbortedError` so it is **not** retried (one +wedge stays one timeout, not 3×). That is the correct policy — but the abort never +actually stops a subprocess backend, so the timeout is cosmetic. + +The runtime is selected by the `backend` config field +(`context/llm/local-config.ts`, `KTX_LLM_BACKENDS = +['none','anthropic','vertex','gateway','claude-code','codex']`). Two backends spawn +a **child process the SDK owns** and to which ktx hands only an `AbortSignal`: + +- **`codex`** (`@openai/codex-sdk`, via `context/llm/codex-runtime.ts` → + `codex-sdk-runner.ts`): the SDK runs `spawn(executable, args, { signal })`. Node's + `spawn` signal-option sends the child **SIGTERM** (not SIGKILL) on abort, and the + SDK consumes the child's stdout with `for await (const line of rl)`, re-throwing + the abort error **only after that loop ends**. A child wedged on a hung provider + socket survives SIGTERM → its stdout never closes → the readline loop never ends + → the SDK never throws → ktx's `await generateObject` **never settles**, past the + per-attempt timeout, indefinitely. The child leaks (open provider connections, + ~0% CPU). +- **`claude-code`** (`@anthropic-ai/claude-agent-sdk`, via + `context/llm/claude-code-runtime.ts`, `collectResult` ~275–322): on abort it calls + best-effort `queryResult.interrupt?.()` (errors swallowed) and only checks + `throwIfAborted` **between** streamed messages. A wedged child emits no message, so + the `for await (const message of queryResult)` loop blocks and the graceful + `interrupt()` may never land — the same hang class. + +By contrast, **HTTP backends** (`anthropic`/`vertex`/`gateway`/`openai`, via +`context/llm/ai-sdk-runtime.ts`) pass `abortSignal` straight to the AI SDK's +`generateObject`, which cancels the underlying `fetch` natively — the await settles +promptly and there is no child to leak. + +So ktx holds **no kill handle** on the subprocess backends, and SIGTERM is too +gentle for a wedged child. Spec 16's mechanism (ktx *itself* forks +`read-query-child` and `SIGKILL`s it) works precisely because ktx owns the fork — +which it does not here. + +Observed (BigQuery ingest, codex backend, 2026-06-23): with +`KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min, an operator override), two of +`covid19_usa`'s 252-column tables hung; the stage sat at **268/285 for 41+ +minutes** — well past the 30-min per-attempt timeout — with exactly two codex +children, each holding 3 ESTABLISHED connections at ~0% CPU, until killed by hand. + +### 2. Descriptions are persisted only at full-stage completion + +`generateDescriptions` (`context/scan/local-enrichment.ts` ~279–352) fans out +per-table work through `pLimit(DESCRIPTION_TABLE_CONCURRENCY)` (default 4) and +**accumulates every table's result in an in-memory `updates` array**, returned only +when the whole stage finishes. `runEnrichmentStage` (~413, ~421–474) then calls +`saveCompletedStage` (writing the whole-stage row to `local_scan_enrichment_stages`) +**after** `compute()` returns, and the spec-19 checkpoint write +(`writeLocalScanEnrichmentCheckpoint`, `local-enrichment-artifacts.ts` ~351–379, +fired by the `onCheckpoint` hook in `local-scan.ts`) also runs **only once the +descriptions stage completes**. There is no within-stage persistence: while the +stage runs, every enriched table's description lives only in memory. + +So if the stage cannot complete — 2 of 285 tables hang (gap #1), or the process is +killed, or a supervising watchdog fires — **all** already-enriched tables are lost, +even though their (expensive, paid) LLM descriptions were finished. On the next run, +`findCompletedStage` finds no row, so the descriptions stage **recomputes from +scratch**. + +Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but +**0** rows in `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; +killing the wedged ingest discarded all 283, forcing a from-scratch re-ingest. The +cost of 2 pathological tables was 283 tables' worth of redone LLM calls. + +Sharper still (re-ingest with a short, *enforced* timeout): even when the stage +**runs to the end** — the 2 hung tables hit their timeout and were skipped, so +**283/285** descriptions were generated and the ingest reported success (`Scan +completed` / `Ingest finished`, embeddings built, exit 0) — the descriptions were +**still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the loss is **not** +only "discarded on kill": a stage that completes with *any* skipped/aborted table +threw away **every** successfully-generated description. The skip must be +**graceful** — a skipped table costs one missing description, not the entire stage's +output — which is the strongest argument for per-table incremental persistence: the +283 good descriptions should have been durable the moment each was produced. + +The on-disk artifacts already carry everything needed to fix this *additively*: the +`_schema` manifest encodes per-table completion (a table with `descriptions.ai` is +AI-enriched), and rewrites preserve existing descriptions +(`mergeDescriptionsPreservingExternal`, `manifest.ts` ~96–115; +`loadExistingManifestState`, `local-enrichment-artifacts.ts` ~196–253 — the basis +spec 19 relies on). The durable record and the resume-skip set can be **derived from +the system's own on-disk state**, with no new cache schema. + +## Generic use case (independent of any benchmark) + +Anyone ingesting a large or wide schema with an LLM enrichment backend — +especially a **subprocess** backend, the common local/desktop setup — will +eventually hit a table whose description call hangs: a provider stall, a rate-limit +black-hole, a pathologically large prompt. Without an *enforced* timeout, one such +table wedges the entire ingest indefinitely and leaks the spawned child; without +*incremental* persistence, any interruption throws away all the per-table LLM work +already done — the dominant ingest cost. Both fixes make large-schema enrichment +**resilient and resumable**: a few bad tables degrade to a few skipped +descriptions, not a hung process and a from-scratch redo. This is core robustness +for a general-purpose ingestion product, wholly independent of any benchmark. + +## Design decisions (resolved during refinement) + +These resolve ambiguities the intake draft left open. They constrain the +implementer; the exact code is theirs (requirement-level, per the specs README). + +### D1 — One bounded-call guarantee; enforcement follows the backend's nature + +The canonical contract is a single guarantee for the per-table enrichment call: +**the in-flight work terminates and ktx's await settles within the per-table +deadline plus a small grace, on every backend.** How that guarantee is met follows +from a structural property of the configured backend — *does it own a subprocess?* +— not from a hand-maintained list of provider names: + +- **Subprocess-backed (`codex`, `claude-code`):** the SDK's own abort is + insufficient (SIGTERM-only, and ktx has no kill handle), so ktx runs the call + behind a **boundary it can hard-kill** — a short-lived ktx-owned child process, + made a **process-group leader** (`detached`). The SDK's grandchild (the + `codex`/`claude` binary) inherits that group. On deadline (or `ctx.signal`), ktx + **tree-kills the whole group with SIGKILL** — reaping the wrapper *and* the + grandchild — and rejects promptly. This mirrors spec 16's child-process + + SIGKILL mechanism, extended by the critical step that **killing the immediate + child is not enough**: the grandchild would otherwise orphan to init and keep its + provider connections. Killing the group is the real fix. +- **HTTP-backed (`anthropic`/`vertex`/`gateway`/`openai`):** unchanged. The existing + in-process `abortSignal` → `fetch` cancellation already satisfies the contract — + the await settles promptly and there is no subprocess to leak. Routing these + through a subprocess would pay fork + IPC + credential-passing cost for no benefit. + +> The branch on "subprocess-backed?" is behavior following from an input the backend +> declares about itself, not vendor enumeration — the same guarantee is reached two +> ways because the backends differ structurally. This matches the intake's own split +> ("subprocess SIGKILL for process-backed; request abort for HTTP-backed"). +> +> Rejected alternative — a *settle-only race* (reject ktx's promise on the deadline +> regardless of the SDK, but leave the SDK's child running). It unwedges the stage +> but leaves the orphaned child holding provider connections — the exact leak the +> incident showed — so it fails the intake's "actually cancelled" requirement and +> compounds over a long ingest that hits several hung tables. +> +> Rejected alternative — a *persistent ktx subprocess pool* hosting the runtime, +> killed and respawned on timeout. Terminate-on-deadline destroys the worker, so a +> pool needs respawn + in-flight job-tracking for no benefit: the enrichment call is +> low-frequency relative to its own latency and already concurrency-bounded (4), so +> one short-lived child per call (spec 16's resolved choice) is simpler and as fast. + +**Portability.** ktx supports Windows, where POSIX process groups and +`process.kill(-pgid, …)` do not exist. The tree-kill MUST be portable: a detached +process group + `kill(-pgid, 'SIGKILL')` on POSIX, and a tree-terminating +equivalent on Windows (e.g. `taskkill /pid /T /F` or a job object) so the +grandchild is reaped on every platform the subprocess backends run on. + +### D2 — Default stays moderate and the retry/skip policy is unchanged + +The per-table timeout default stays **120s** (`KTX_ENRICH_LLM_TIMEOUT_MS`), with the +existing per-attempt retry (`KTX_ENRICH_LLM_ATTEMPTS`, default 3) and the +no-retry-on-timeout policy. A hung table costs **at most one timeout**, then the +table is skipped with the existing `enrichment_timeout` warning and the stage +proceeds. The 30-min value in the incident was an operator stopgap chosen *because* +the timeout was cosmetic; once D1 makes the timeout actually terminate the work, a +long timeout is strictly worse for a hang (a hang costs the full timeout), so the +moderate default is the correct operating point. The retry loop stays in +`description-generation.ts`: each attempt runs through the bounded boundary (D1), so +a transient backend error retries while a timeout surfaces as `KtxAbortedError` and +does not. + +> Not introducing a new `ktx.yaml` config field for the timeout. The existing env +> override is the tuning seam; adding a per-connection/per-call/global knob would +> multiply the runtime surface for no stated need (one opinionated default + the +> existing env override is the canonical ktx shape). + +### D3 — Persist descriptions incrementally; derive the resume-skip set from on-disk state + +During the descriptions fan-out, flush completed tables **per batch** (every N +tables / on a timer, at a cadence that bounds the at-risk window) to the durable +on-disk artifacts, reusing spec 19's additive write: + +- the raw descriptions artifact (`descriptions.json`) is the **resume-skip source**; +- the `_schema` manifest is updated additively (`mergeDescriptionsPreservingExternal` + preserves prior `ai:`/`db:`/external keys) so finished descriptions are also + **queryable** the moment they are computed — the spec-19 invariant, one level + deeper. The implementer MAY bound manifest-rewrite cost on huge schemas by + rewriting only changed shards. + +On resume, `generateDescriptions` reads the existing record, **skips any table +already enriched**, computes only the remainder, and returns the merged full set so +the embeddings stage, the checkpoint write, and the stage-store row all see a +complete result exactly as today. + +**The skip is `inputHash`-gated**, preserving spec 19's recompute semantics. The +durable record is tagged with the descriptions stage's `inputHash` +(`computeKtxScanEnrichmentInputHash`). Resume reuses it to skip tables **only when +the current `inputHash` matches** — a genuine resume-after-interruption of the same +content identity. A changed `inputHash` (schema or enrichment settings changed) +ignores the prior record for skipping and recomputes the stage as today; the +manifest write stays additive regardless. The artifact's on-disk shape may gain the +`inputHash` tag with **no migration bridge** (ktx owns the artifact; a stale-shaped +record simply forces one non-incremental run), consistent with ktx's +no-backward-compatibility policy. + +> The skip set is **derived from the artifacts ktx already writes**, not from a new +> per-table cache table. The manifest's `ai:` field already encodes "this table is +> enriched"; a parallel per-table SQLite record would be a second source of truth for +> the same fact and would drift. The whole-stage `local_scan_enrichment_stages` row is +> still written at stage completion (it remains the stage-level resume gate — a clean +> re-run skips the descriptions stage as today); the incremental record only matters +> when the stage did **not** complete — exactly the case where no row exists and +> `compute()` re-runs. + +### D4 — A killed-mid-stage run is durable; resume is cheap + +A process killed mid-stage (gap #1 wedge, SIGKILL, crash, supervisor) leaves the +per-batch-flushed tables durable on disk. The next run resumes the descriptions +stage (no completed `local_scan_enrichment_stages` row → `compute()` runs again), +but `generateDescriptions` now **re-issues LLM calls only for the unfinished +tables**. A failed/skipped table (timeout or exhausted retries) is left for the +remainder set and is retried on the next resume — never silently treated as done. + +## Requirements + +### 1. The per-table enrichment timeout is enforced for subprocess backends + +When the per-table deadline fires (or `ctx.signal` aborts) on a subprocess-backed +backend (`codex`, `claude-code`), the in-flight LLM work — the spawned child **and +its descendants** — MUST be terminated (SIGKILL of the process group / tree), and +ktx's `generateObject` await MUST settle within the deadline plus a small bounded +grace. A hung table MUST cost at most ~one timeout of wall-clock, never unbounded. +The termination MUST be portable across the platforms the subprocess backends run on +(POSIX process-group kill and a Windows tree-kill equivalent). HTTP-backed backends +keep their existing native `abortSignal` → `fetch` cancellation; the guarantee is one +contract met two ways, branching on the backend's structural "owns a subprocess" +property, not on a list of provider names. + +### 2. The timeout default and retry/skip policy are unchanged + +The default per-table timeout stays moderate (current 120s, `KTX_ENRICH_LLM_TIMEOUT_MS`), +with the existing per-attempt retry (default 3, `KTX_ENRICH_LLM_ATTEMPTS`) and the +no-retry-on-timeout policy. On timeout, the table is skipped with the existing +`enrichment_timeout` recoverable warning and the stage proceeds. No new +per-connection / per-call / global timeout knob is added. + +### 3. Descriptions are persisted incrementally during the stage + +Enriched descriptions MUST be flushed to the durable on-disk artifacts **per batch** +(per-table or per-N-tables / on a timer) during the descriptions stage, at a cadence +that bounds the at-risk window to a small number of tables. The flush MUST be +idempotent and additive (never clobber a prior `ai:` description; preserve `db:` and +external keys via the existing merge). Finished tables MUST remain durable even if the +stage never completes — is wedged, killed, or interrupted. A failed/skipped +relationship/embedding stage or a killed descriptions stage MUST NOT lose the +descriptions already flushed. + +### 4. Resume re-enriches only the unfinished tables + +On a resumed ingest with an unchanged `inputHash`, the descriptions stage MUST +re-issue LLM description calls **only for tables not already enriched**, deriving the +already-enriched set from the on-disk artifacts (the `inputHash`-tagged durable +record / the manifest's `ai:` descriptions), and MUST return the merged full result +so downstream stages behave as on a fresh run. A changed `inputHash` (schema or +enrichment settings changed) MUST recompute the stage as today (spec 19's +inputHash-gated semantics preserved). The durable record MAY be recreated without a +migration bridge if its on-disk shape changes (it is regenerable local/artifact +state). + +### 5. No regression for small or uninterrupted ingests + +A small or single-run ingest that is never interrupted MUST produce the same +artifacts (descriptions, manifest, embeddings) as today. The incremental flush MUST +be idempotent with the spec-19 checkpoint and the terminal write (descriptions +survive the embeddings/relationship rewrites). The bounded-call boundary MUST NOT +change a normal successful enrichment's output, only how a wedged call is terminated. + +### 6. A skipped table costs one description, never the stage's output + +A descriptions stage that **completes** with one or more skipped/aborted tables MUST +persist every successfully-generated description (the durable record and the `ai:` +manifest entries) and MUST mark the stage completed (a `local_scan_enrichment_stages` +row, embeddings + downstream proceeding) — it MUST NOT discard the whole stage's +output because some tables were skipped. No single table's failure may reject the +per-table fan-out: a per-table failure degrades to one missing description (left for +the resume remainder), not a failed stage. A genuine `ctx.signal` cancellation is the +only thing that fails the stage (so it resumes), and even then the already-flushed +descriptions remain durable. + +## Acceptance criteria + +- **Enforced timeout (subprocess backend):** a subprocess-backed enrichment call + that hangs past the deadline is terminated within the deadline plus a small grace; + ktx's await settles, the spawned child **and a grandchild it spawned** both exit + (verified via the child's `exit`, not left spinning), and the table is skipped with + an `enrichment_timeout` warning. The stage advances rather than wedging. A + `ctx.signal` abort terminates the same way. +- **HTTP backend unaffected:** an HTTP-backed enrichment call still cancels promptly + on abort via the existing native path, with no subprocess involved. +- **Default + policy:** the default timeout is 120s and a timeout is not retried (one + wedge = one timeout); a transient error is still retried up to the attempt limit. +- **Graceful skip persists the rest:** a stage that completes with one table failing + (timeout, exhausted retries, or an unexpected throw) still writes the other N−1 + descriptions to the durable record + `ai:` `_schema` and marks the stage completed + (a `local_scan_enrichment_stages` row exists); the failed table is a single `null` + description left for the resume remainder, not a discarded stage. +- **Incremental durability:** interrupting the descriptions stage after K of N tables + leaves those K durable on disk (raw artifact + `ai:` descriptions in `_schema`), + with no completed `local_scan_enrichment_stages` row. +- **Resume does not re-spend:** re-running the interrupted ingest (unchanged + `inputHash`, fresh `runId`) issues **no** LLM description calls for the K already- + enriched tables and enriches only the remaining N−K; the returned result is the + full merged set. A changed `inputHash` recomputes the stage. +- **No regression:** a small uninterrupted ingest yields identical artifacts and the + same descriptions/embeddings output as today; the incremental flush is idempotent + with the checkpoint and terminal writes. + +## Non-goals + +- **Incremental persistence of embeddings.** Embeddings are fast and already covered + by spec 19's stage-level cross-run resume; the dominant loss is descriptions. This + spec scopes incremental persistence to the `descriptions` stage. +- **Changing the timeout default, retry counts, or adding a timeout config knob.** + D2 keeps the moderate default and the single env tuning seam. +- **Routing HTTP backends through the subprocess boundary.** Their native abort + already meets the contract; a subprocess would add cost and a credential-passing + surface for no benefit. +- **A persistent subprocess pool.** One short-lived ktx child per subprocess-backed + call; no pool, no respawn/job-tracking (D1). +- **Re-implementing spec 16 (per-query deadline) or spec 19 (relationship-stage + budget, cost-boundary checkpoint, cross-run stage resume).** This spec composes + above them: spec 16 bounds individual queries, spec 19 makes whole stages durable + and resumable, and this spec hardens the per-table enrichment call's termination + and adds within-stage description durability. +- **A general per-stage incremental-flush framework.** The incremental flush is + specifically the descriptions stage; it is not a generic abstraction over every + enrichment stage. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns the +design. + +- **Bounded per-table call (gap #1)** — `context/scan/description-generation.ts`, + `KtxDescriptionGenerator.generateBatchedTableDescriptions` (the bounded+retry block + ~760–866; `enrichTimeoutMs` ~769, `enrichAttempts` ~770, `KtxAbortedError` on + timeout ~811, `enrichment_timeout`/`enrichment_failed` warnings ~858). The retry + loop stays here; each attempt runs through the kill boundary for subprocess + backends. +- **LLM runtime + backend selection** — `context/llm/runtime-port.ts` + (`KtxLlmRuntimePort.generateObject`, `abortSignal` on the input), + `context/llm/local-config.ts` (~127–163, selects `CodexKtxLlmRuntime` / + `ClaudeCodeKtxLlmRuntime` / `AiSdkKtxLlmRuntime`), `context/project/config.ts` + (`KTX_LLM_BACKENDS`). The "owns a subprocess" property should be declared by the + backend/runtime (e.g. on the runtime interface), not inferred from a name list. +- **Subprocess backends** — `context/llm/codex-runtime.ts` + + `context/llm/codex-sdk-runner.ts` (`CodexSdkCliRunner.runStreamed`, the SDK's + `spawn(executable, args, { signal })` is in `@openai/codex-sdk`), + `context/llm/claude-code-runtime.ts` (`collectResult` ~275–322, the `interrupt()` + abort path). These are what the kill boundary must wrap and tree-kill. +- **Reuse spec 16's mechanism (extended to group/tree kill)** — + `connectors/sqlite/read-query-child.ts` (the forked child shape) and + `connectors/sqlite/connector.ts` `runReadQueryOffProcess` (~292–350: `fork`, + deadline timer, `child.kill('SIGKILL')`, `settle()`, the `.js`-if-exists-else-`.ts` + child-URL resolver ~25–27, knip dynamic entry). Gap #1 differs by making the child a + process-group leader and killing the **group/tree** (the SDK grandchild), portably. + Abort helpers: `context/core/abort.ts` (`createAbortError`, `throwIfAborted`, + `linkAbortSignal`). Note the new child hosts an LLM runtime, so the implementer owns + passing the backend config/credentials to it (env/IPC) and serializing the + structured result back. +- **Incremental persistence (gap #2)** — + `context/scan/local-enrichment.ts` (`generateDescriptions` ~279–352: the per-table + `pLimit` fan-out and the in-memory `updates` accumulation; `runEnrichmentStage` + ~413/~421–474 with `findCompletedStage` ~427 and `saveCompletedStage`; the + `onCheckpoint` hook ~598–612). Make `generateDescriptions` resume-aware: read the + existing record, skip already-enriched tables, flush per batch, return the merged + full set. +- **Artifact writer + additive merge** — `context/scan/local-enrichment-artifacts.ts` + (`writeLocalScanEnrichmentCheckpoint` ~351–379, `writeEnrichmentDescriptionArtifacts` + with `descriptions.json` ~316, `writeLocalScanManifestShards` ~270–308, + `loadExistingManifestState` ~196–253, `tableDescription`/`columnDescription` + ~75–105); `context/scan/manifest.ts` (`mergeDescriptionsPreservingExternal` ~96–115, + `SCAN_MANAGED_DESCRIPTION_KEYS`). Factor a per-batch flush that reuses the additive + description/manifest write; tag the durable record with `inputHash`. +- **Stage store + input hash** — + `context/scan/sqlite-local-enrichment-state-store.ts` (`STAGES_TABLE = + 'local_scan_enrichment_stages'`, PK `(connection_id, stage, input_hash)`, + `findCompletedStage`, `saveCompletedStage`), + `context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78). The + whole-stage row stays; the `inputHash` is the gate for the resume-skip set. +- **Scan driver** — `context/scan/local-scan.ts` (the `onCheckpoint` wiring and the + terminal `writeLocalScanEnrichmentArtifacts`), and `KtxScanContext.signal` + (`context/scan/types.ts`) which the kill boundary must honor. +- **Tests** — gap #1: a fake subprocess-backed runtime whose child hangs (ignores + SIGTERM) is killed at a tiny test-seam deadline; assert the await settles within + deadline+grace, the child and a spawned grandchild both exit, and the table is + skipped with `enrichment_timeout`; assert an HTTP-backed abort still settles via the + native path. gap #2: interrupt the descriptions stage after K/N tables (a flush + seam), assert the K are durable (raw artifact + `ai:` in `_schema`) with no completed + stage row; a resume with matching `inputHash` issues no LLM calls for the K and + enriches only N−K; a changed `inputHash` recomputes; regression: a small + uninterrupted ingest yields identical artifacts. +- After implementing, rebuild and re-link so the playground picks it up: + `pnpm run build && pnpm run link:dev`. + +## Benchmark context (motivation, not a requirement) + +Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment +backend). Re-enriching the giant public datasets, `covid19_usa` wedged at 268/285 for +41+ minutes on 2 hung 252-column tables; the 30-min per-table `AbortSignal` timeout +never killed the hung codex children, and because descriptions checkpoint only at +stage completion, the 283 already-enriched tables were unrecoverable — the operator +had to kill, cache-bust, and re-ingest the database from scratch (with a short timeout +as a stopgap). The benchmark merely exercised a large/wide multi-dataset ingest at +scale; the gaps and the fixes are generic production hygiene for any agent that +enriches a real warehouse with a subprocess LLM backend. Do not encode any benchmark +specifics in the implementation. + +## Implementation notes + +Implemented on branch `write-feature-spec-wiki`. Both gaps shipped; all acceptance +criteria are covered by tests. The full ktx test surface for the touched code is +green (the only failures in the whole suite are 3 pre-existing assertions in +`test/skills/analytics-skill-content.test.ts` about the analytics SKILL.md markdown +— an unrelated subsystem this change does not touch). + +### Gap #1 — enforced timeout for subprocess backends + +- **Structural property on the runtime, not a name list.** Added + `subprocessForkSpec(): SubprocessRuntimeForkSpec | null` to `KtxLlmRuntimePort` + (`context/llm/runtime-port.ts`). `CodexKtxLlmRuntime` / `ClaudeCodeKtxLlmRuntime` + return a serializable `{ backend, projectDir, modelSlots }`; `AiSdkKtxLlmRuntime` + (and the deterministic stub) return `null`. The per-table call branches on this, + never on a vendor list (D1). +- **Shared structured core.** Both subprocess runtimes gained + `generateStructuredJson(jsonSchema)` (returns the raw object; the caller + Zod-validates). Their existing `generateObject` was refactored to delegate to the + same streaming core, so structured generation has one implementation. +- **Kill boundary.** New `context/llm/subprocess-generate-object.ts` + (`runGenerateObjectInSubprocess`, `KtxSubprocessDeadlineError`) forks a ktx-owned + child (`subprocess-generate-object-child.ts`) **detached** (process-group leader); + the SDK's model binary inherits the group. On the deadline or `ctx.signal`, ktx + tree-kills the group with `SIGKILL` (`process.kill(-pid, …)` on POSIX, + `taskkill /pid /T /F` on Windows) and rejects promptly; on success the raw + output is Zod-validated. Credentials reach the child via inherited `process.env` + (the runtimes re-derive their allowlisted env), never over IPC. +- **Wiring.** `KtxDescriptionGenerator.generateBatchedTableDescriptions` + (`context/scan/description-generation.ts`) routes each retry attempt through the + boundary for subprocess backends and keeps the native `AbortSignal` → `fetch` + path for HTTP backends. A fired deadline maps to the existing + `KtxAbortedError`/`enrichment_timeout` no-retry policy (one wedge = one timeout); + default stays 120s (D2). +- **Tests.** `test/context/llm/subprocess-generate-object.test.ts` forks a real + fixture child that spawns a grandchild and ignores SIGTERM, and asserts the + deadline/abort tree-kills both (the grandchild PID is reaped) and the await + settles within deadline+grace; plus success / schema-failure / child-error paths. + `test/context/scan/description-generation.test.ts` adds the generator-level + timeout-skip and the "HTTP backend spawns no child" cases. + +### Gap #2 — incremental descriptions persistence + resume + +- **Durable record + resume store.** `createKtxScanDescriptionResumeStore` + (`context/scan/local-enrichment-artifacts.ts`) writes the descriptions-so-far to + a durable record (inputHash-tagged) and **only the manifest shards that gained a + table this batch** (new `onlyChangedTableNames` filter on + `writeLocalScanManifestShards`, additive merge preserved). `load(inputHash)` + returns the prior enriched set only on a matching inputHash (D3). +- **Resume-aware fan-out.** `generateDescriptions` (`context/scan/local-enrichment.ts`) + loads the prior record, skips already-enriched tables, enriches only the + remainder, flushes every `DESCRIPTION_FLUSH_EVERY` (10) completed tables (a single + in-flight flush; the final force-flush drains the tail), and returns the full + merged set (recovered + fresh + `null` for still-failed, so failures are retried, + D4). Wired through `local-scan.ts` (store constructed when not `--dry-run`). +- **Graceful-skip backstop (requirement 6).** The per-table worker wraps the call in + a try/catch: any non-cancellation failure degrades to one `null` description + an + `enrichment_failed` warning and the fan-out continues, so no single table can + reject `Promise.all` / abort the stage. This makes the "one skipped table costs one + description, not the stage's output" guarantee live at the stage boundary + (`generateBatchedTableDescriptions` already degrades its own failures; this is the + explicit backstop). A `ctx.signal` cancellation still propagates (the stage fails + and resumes), and the already-flushed descriptions stay durable. This closes the + field bug where a completed-with-skips stage persisted 0 descriptions / 0 stage rows. +- **Deviation from the spec's literal path (necessary correction).** The durable + record lives at a **stable, non-`syncId`** path + (`raw-sources//live-database/enrichment-progress/descriptions.json`), + not the `syncId`-scoped `…//enrichment/descriptions.json` the spec named. + Reason: a from-scratch interruption (the incident's exact case — no prior + *completed* run) gets a **fresh `syncId`** on the next run + (`buildSyncId` in `context/ingest/local-stage-ingest.ts`), so a `syncId`-scoped + record would be unreachable on resume. The manifest is already at the stable + per-connection scope (`semantic-layer//_schema/`), so this keeps the + resume source at the same stable scope. The `syncId`-scoped `enrichment/descriptions.json` + debug artifact written by the terminal/checkpoint writers is unchanged. +- **Tests.** `test/context/scan/description-resume.test.ts` drives + `runLocalScanEnrichment` against a real git-backed project: a fresh run flushes a + durable record + `ai:` manifest descriptions; a matching-`inputHash` resume issues + zero LLM calls and returns the full merged set; a partial record re-enriches only + the missing tables; a changed `inputHash` recomputes; the changed-shard filter + rewrites only the affected shard; and (requirement 6) a run where one table fails + still persists the other tables (durable record + `ai:`) and **completes the stage** + (a completed `local_scan_enrichment_stages` row), with the failed table left `null` + for resume. + +### Incidental + +- Fixed a stale assertion in `description-generation.test.ts` ("does not run + per-column fallback…" expected 1 call) to `3`, matching the retry policy added in + commit `01f63380` (D2 / acceptance: a transient error retries up to the attempt + limit). The HTTP path is unchanged; the assertion simply predated the retry. +- No new `ktx.yaml` config field or runtime knob was added (D2). The rate-limit + governor is not wired into the scan-enrichment path, so the kill-boundary child + loses no pacing. +- Rebuilt and re-linked (`pnpm run build && pnpm run link:dev`); the child compiles + to `dist/context/llm/subprocess-generate-object-child.js`. diff --git a/spider2-specs/specs/21-selective-enrichment-stages.md b/spider2-specs/specs/21-selective-enrichment-stages.md new file mode 100644 index 00000000..130647b1 --- /dev/null +++ b/spider2-specs/specs/21-selective-enrichment-stages.md @@ -0,0 +1,567 @@ +# Selective enrichment stages (`--stages`) + per-stage cache keys + +> Refined spec. Intake draft: `todo/21-selective-enrichment-stages.md`. +> +> **Scope: make the three enrichment stages independently invalidatable and +> independently re-runnable.** Today one coarse cache key gates all three stages, +> so changing any one stage's inputs re-pays for every stage — most painfully the +> expensive per-table `descriptions`. And there is no CLI surface to re-run a +> chosen subset. This spec splits the key per stage (so a change invalidates only +> the stage it touched) and adds a `--stages` flag that force-re-runs a chosen +> subset while preserving the others. It is the operability follow-on to spec 19 +> (durable, cross-run stage resume) and spec 20 (resilient, per-table-resumable +> descriptions); it composes with both rather than replacing them. + +## Problem + +Enrichment has three stages — **`descriptions`** (one paid LLM call per table), +**`embeddings`** (sentence-transformer vectors over the schema + descriptions), +**`relationships`** (FK/join detection, optionally LLM-proposed). After specs 19 +and 20 these stages are durable and resumable, but they are still **coupled for +cache invalidation and unreachable for selective re-run**. Three facts make a +targeted re-run impossible without a full, expensive re-enrich. + +### 1. One coarse cache key gates all three stages + +`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:611`) computes a single +`inputHash` from `{ snapshot, mode, detectRelationships, providerIdentity, +relationshipSettings }` and every stage reuses it — `descriptions` (~`:642`), +`embeddings` (~`:673`), `relationships` (~`:729`). `providerIdentity` itself +(`localScanProviderIdentity`, `local-scan.ts:241–255`) is one blob conflating the +description LLM identity, the embedding model/dimensions/batch size, **and** the +whole relationship config — and it redundantly re-encodes `mode` and +`relationships`, which the coarse hash already mixes in. + +The consequence: flipping `scan.relationships.llmProposals`, switching the LLM +backend, or upgrading the embeddings model changes the **one** hash and so +invalidates **all three** stages. ktx then re-runs the expensive per-table +`descriptions` even though they did not conceptually change. The headline cost of +the system — paid LLM description calls — is thrown away on any unrelated +enrichment-config edit. + +### 2. No CLI surface to select stages + +The enrichment internals already support a relationships-only path +(`KtxScanMode` `'relationships'`, `types.ts:12` — `descriptions`/`embeddings` are +gated on `mode === 'enriched'` at `local-enrichment.ts:632`, while +`shouldDetectRelationships` admits `mode === 'relationships'` at `:624–626`). But +`ktx ingest` hardcodes `mode: 'enriched'` (`public-ingest.ts:973`) and exposes no +flag to select a subset (`ingest-commands.ts:26–49` — only `--no-query-history` +and friends). The relationships-only capability is built but unreachable, and there +is no way at all to ask for "descriptions only" or "embeddings only." + +### 3. The foundation for "touch one stage, keep the rest" already exists + +The per-stage store `local_scan_enrichment_stages` is keyed +`(connection_id, stage, input_hash)` (spec 19) and the descriptions write is +additive — `mergeDescriptionsPreservingExternal` (`manifest.ts`) and +`loadExistingManifestState` (`local-enrichment-artifacts.ts`) preserve prior `ai:`, +`db:`, and external description keys on rewrite; spec 20's per-table resume record +(`createKtxScanDescriptionResumeStore`, `local-enrichment-artifacts.ts:286`) already +re-issues LLM calls only for the still-failed tables. So "recompute one stage, leave +the others byte-for-byte" needs only two missing pieces: **per-stage key +granularity** and a **CLI surface** to select stages. + +**Requirement:** let an operator re-run a chosen subset of enrichment stages on an +already-ingested connection, recomputing only those stages, preserving the others' +artifacts untouched, and **re-paying only for what genuinely changed** — never +re-running the costly `descriptions` because an unrelated stage's inputs moved. + +## Generic use case (independent of any benchmark) + +Any team running ktx in production maintains its semantic layer over time: they +improve the description prompt or switch the description LLM, upgrade the embeddings +model, or turn on LLM-proposed joins. Today each of those forces a **full re-enrich +of every connection** — re-running the expensive per-table descriptions even when +only embeddings or relationships changed. Two routine operations should be cheap and +targeted: + +- **"Re-embed everything on the new model."** Swapping the embeddings model should + recompute only embeddings, leaving descriptions and joins on disk. +- **"Backfill joins now that `llmProposals` is on."** Enabling LLM-proposed + relationships should recompute only relationships. + +And one operation needs an explicit trigger because no input changed: + +- **"These descriptions came out thin — re-run them with a longer timeout."** A + connection whose description coverage is poor because tables timed out (same + snapshot, same LLM, so the hash is unchanged) should be re-runnable on demand, + cheaply retrying only the tables that failed. + +This is core operability for a long-lived ingestion product and is wholly +independent of any benchmark. + +## Design decisions (resolved during refinement) + +These resolve ambiguities the intake draft left open. They constrain the +implementer; the exact code is theirs (requirement-level, per the specs README). + +### D1 — Split the coarse hash into three per-stage input hashes + +Replace the single `computeKtxScanEnrichmentInputHash` call with **per-stage** hash +computation, each keyed on only that stage's own inputs. Decompose the +`localScanProviderIdentity` blob into the slices each stage actually depends on: + +- **`descriptions`** → `{ snapshot, llmIdentity }`, where `llmIdentity` is the + description-LLM identity (`llm.models.default`, `baseUrlConfigured`). **Not** the + embedding model/dimensions/batch size, **not** relationship settings. +- **`embeddings`** → `{ snapshot, embeddingIdentity, descriptionDigest }`, where + `embeddingIdentity` is `{ model, dimensions, batchSize }` and `descriptionDigest` + is a stable digest of the resolved description text the embeddings consume (the + same text `buildEmbeddings` → `buildKtxColumnEmbeddingText` feeds the model, + `local-enrichment.ts:466–486`, `embedding-text.ts:17–44`). This content-addresses + embeddings on their real upstream (D4). +- **`relationships`** → `{ snapshot, relationshipSettings (incl. `llmProposals` and + `detectionBudgetMs`), llmIdentity }`. **Not** the description content (decision X, + D5), **not** the embedding identity. + +`mode` and `detectRelationships` drop out of the per-stage inputs: each stage +produces output under exactly one mode, so the stage name already scopes that, and +re-mixing `mode` only re-couples the keys. After the split, flipping `llmProposals` +invalidates only `relationships`; swapping the embeddings model invalidates only +`embeddings`; switching the description LLM invalidates only `descriptions`. + +The per-stage hash becomes the key everywhere a single hash is used today: the +`local_scan_enrichment_stages` lookup/save in `runEnrichmentStage`, and the spec-20 +descriptions resume record (`createKtxScanDescriptionResumeStore`), which is now +keyed on the **descriptions** stage's hash — so changing the embedding model no +longer busts the descriptions resume record, a strict improvement. + +> **No migration bridge.** The stage store and the descriptions resume record are +> disposable local `.ktx` state (regenerable from a fresh ingest). The new per-stage +> keys simply miss the old coarse-keyed rows, forcing one full re-enrich on the next +> run after upgrade. Recreate/ignore stale-shaped records with no compatibility +> shim, consistent with specs 19/20 and ktx's no-backward-compatibility policy. + +### D2 — `--stages ` selects a subset; one gate, no new mode + +Add `ktx ingest [connectionId] --stages `, a non-empty subset of +`descriptions,embeddings,relationships`. Plural because it takes a **set**: +`--stages relationships` and `--stages descriptions,embeddings` both read naturally, +and the plural signals "list expected." Flag absent = all three (today's behavior). + +A Commander custom parser validates each name against the canonical stage registry +and parses into an ordered, de-duplicated set. **An unknown or empty stage name is a +hard `InvalidArgumentError`** — never silently ignored. The set threads CLI → +`runKtxPublicIngest` (`KtxScanArgs`) → `runLocalScan` → `runLocalScanEnrichment`. + +Inside enrichment the run set is **`(mode/provider-eligible stages) ∩ (selected +stages)`** — a single gate. Each existing stage block additionally checks +membership in the selected set (`descriptions`/`embeddings` already gate on +`mode === 'enriched'` + providers; `relationships` on `shouldDetectRelationships`). +This adds **no** new `KtxScanMode` variant and **no** second parallel selection +path; `mode` keeps meaning "the connection's enrichment level," and `--stages` means +"which of those stages to (re)compute this run." A named stage that cannot run +because a prerequisite is absent (e.g. `--stages embeddings` with no embedding +provider configured) MUST fail or warn clearly, never silently no-op. + +> Rejected alternative — repurpose `mode` (`--stages relationships` → +> `mode: 'relationships'`). It only expresses single-stage cases, leaves +> `descriptions,embeddings` with no mode, and creates two ways to say "relationships +> only." The explicit stage set is the one canonical selector. + +### D3 — A named stage force-re-runs; per-table resume still avoids re-paying + +Naming a stage in `--stages` carries the intent "recompute this," so a named stage +**re-enters its `compute()`, bypassing the spec-19 completed-row short-circuit** in +`runEnrichmentStage` (`local-enrichment.ts:538–547`). The spec-20 machinery still +applies **inside** `compute()`: + +- `--stages descriptions` re-enters `generateDescriptions`, which loads the + per-table resume record and re-issues LLM calls **only for the still-null/failed + tables** (when the descriptions hash is unchanged) — the "fill thin coverage with + a longer `KTX_ENRICH_LLM_TIMEOUT_MS`" case, paying only for the gaps. +- A genuine input change (e.g. switching the LLM → a new descriptions hash) + invalidates the resume record and rebuilds the stage fully, as today. + +Stages **not** named are skipped entirely — not run, not resumed — and their +on-disk artifacts are left exactly as they are (additive write; preserve-others is +already the behavior). The **no-flag default is unchanged**: all eligible stages +run, the completed-row short-circuit is respected (spec-19 cross-run resume). + +Behavior follows from the input (did you explicitly name the stage?), not the call +path. A consequence to state plainly: `--stages descriptions,embeddings,relationships` +is **not** identical to passing no flag — naming all three is the explicit "force a +full enrichment recompute," whereas no flag is "ingest, resuming whatever is done." + +### D4 — Downstream staleness: one real edge, content-addressed, surfaced not silent + +The only hard dependency between stages is **`descriptions → embeddings`** +(embeddings embed the description text; `relationships` is decoupled, D5). Two +mechanisms keep it correct without a hardcoded dependency table: + +- **Self-healing via content-addressing.** Because the embeddings hash includes + `descriptionDigest` (D1), re-running `descriptions` changes that digest, so a + later embeddings run (or a full ingest) sees a hash miss and recomputes — stale + embeddings can never silently persist across a future embeddings run. (Without + this, the embeddings hash would be unchanged after a description edit and a later + run would wrongly short-circuit on stale vectors.) +- **Surfaced immediately.** After a selective run, for each **unselected** stage that + has artifacts on disk, recompute its *current* per-stage hash from on-disk state + and compare it to the stored completed-row hash; if they differ, emit a + **recoverable `enrichment_stage_stale` warning** naming the stale stage and the + cascade command (e.g. `--stages descriptions,embeddings`). This is derived from the + system's own state — it also catches "you changed the embedding model in `ktx.yaml` + but only ran `--stages descriptions`." + +The run **never silently leaves a stale-but-unflagged downstream**, and **never +silently auto-cascades** extra work — the operator is told and decides. Re-running +`descriptions` does **not** flag `relationships` stale (D5). + +### D5 — Relationships are decoupled from description content, but still get it as context + +`relationships` keys on `{ snapshot, relationshipSettings, llmIdentity }` and is +**not** invalidated or stale-flagged by a description change (decision X). Rationale: +relationships are the low-value, best-effort, expensive-to-probe stage (spec 19's +own framing); coupling them to description content would make every routine +description re-run also invalidate joins — re-opening the exact over-invalidation +this spec exists to close. + +Independently, a `relationships`-only run (descriptions stage not running this +invocation) MUST **hydrate its working schema from the persisted on-disk enriched +`_schema`** (AI descriptions + embeddings) so `llmProposals` runs with full +description context, not raw column names. Today the relationship stage builds its +schema from the bare snapshot (db comments only — `local-enrichment.ts:621,688,740` +never merge the AI descriptions), so this also closes a latent gap: both the +full-run and the relationships-only paths MUST feed `llmProposals` the +best-available descriptions (fresh-this-run if `descriptions` ran, else on-disk) — +behavior from inputs, not path. + +### D6 — Scope: enrichment stages only, composable with existing flags + +`--stages` controls only the three enrichment stages. It is **orthogonal to and +composable with** the existing `--no-query-history` flag — a pure joins backfill +across everything is `ktx ingest --all --stages relationships --no-query-history`. +Schema introspection still runs (it is the hash substrate and the enrichment base, +and it is cheap — no LLM). The stage-name namespace is built as a **registry** so it +can later extend to the broader scan phases (schema / query-history / source / +memory) and subsume the inconsistent negative `--no-query-history` flag — but that +unification is **out of scope** here. + +## Requirements + +### 1. Per-stage input hashes + +Each enrichment stage MUST key its cache lookup/save and (for `descriptions`) its +resume record on a hash of only that stage's own inputs, per D1 +(`descriptions` ← snapshot + LLM identity; `embeddings` ← snapshot + embedding +identity + a digest of the embedded description text; `relationships` ← snapshot + +relationship settings + LLM identity). Changing one stage's inputs MUST invalidate +**only** that stage. The single coarse `computeKtxScanEnrichmentInputHash` over +`{ snapshot, mode, detectRelationships, providerIdentity, relationshipSettings }` +MUST be removed in favor of per-stage computation. The stage store and the +descriptions resume record MAY be recreated without a migration bridge (disposable +local state). + +### 2. `--stages` flag with strict validation + +`ktx ingest` MUST accept `--stages `, a non-empty subset of +`descriptions,embeddings,relationships`, defaulting (when absent) to all three. An +unknown or empty stage name MUST be a hard parse error (`InvalidArgumentError`), +never silently ignored. The selected set MUST thread through to enrichment and gate +which stage blocks run as `(mode/provider-eligible) ∩ (selected)` — one gate, no new +`KtxScanMode` variant, no second selection path. A selected stage whose prerequisite +is missing MUST fail or warn clearly, not silently no-op. + +### 3. Selecting a stage force-re-runs it; unselected stages are preserved + +A stage named in `--stages` MUST re-enter its `compute()`, bypassing the +completed-stage short-circuit, while still using the spec-20 per-table resume record +so `descriptions` re-issues LLM calls only for still-failed tables (unchanged hash) +and rebuilds fully on a changed hash. A stage **not** named MUST NOT run and MUST +leave its on-disk artifacts untouched. The no-flag default MUST preserve spec-19 +cross-run resume (all eligible stages, completed-row short-circuit respected). + +### 4. Downstream staleness is surfaced, never silent + +After a selective run, the run MUST emit a recoverable `enrichment_stage_stale` +warning for every **unselected** stage whose current per-stage hash no longer +matches its stored completed-row hash (derived from on-disk state, naming the stage +and the cascade command). The embeddings hash MUST include a digest of the embedded +description text so a later embeddings run self-heals after a description change. The +run MUST NOT silently leave a stale-but-unflagged downstream and MUST NOT silently +auto-cascade. A description change MUST NOT stale-flag `relationships`. + +### 5. Relationships run with description context + +When the `relationships` stage runs without `descriptions` having run in the same +invocation, it MUST hydrate its working schema from the persisted on-disk enriched +`_schema` (AI descriptions + embeddings) so `llmProposals` has the same description +context as a full enriched run, not bare column names. The full-run and +relationships-only paths MUST feed `llmProposals` descriptions consistently. + +### 6. No regression for normal ingests + +A normal `ktx ingest` with no `--stages` flag MUST produce the same artifacts as +today (descriptions, embeddings, manifest, relationships) and MUST preserve spec-19 +cross-run resume and spec-20 per-table description resume. The per-stage hash split +MUST NOT change a normal run's output, only which stages a *changed* input +invalidates. + +## Acceptance criteria + +- **Per-stage invalidation isolation:** flipping `scan.relationships.llmProposals` + re-runs only `relationships` (descriptions + embeddings resolve from cache, no LLM + description calls, no re-embedding); swapping the embeddings model re-runs only + `embeddings`; switching the description LLM re-runs only `descriptions`. Verified by + asserting no LLM description calls / no embed calls for the unaffected stages. +- **Flag parse + validation:** `--stages relationships` and + `--stages descriptions,embeddings` parse to the right set; `--stages foo`, + `--stages` (empty), and `--stages descriptions,foo` each fail with a clear + `InvalidArgumentError`. +- **Resume-aware force-rerun:** on a connection whose `descriptions` stage completed + with K failed/null tables (unchanged hash), `--stages descriptions` re-issues LLM + calls for exactly those K tables and leaves the already-good descriptions + untouched; the run completes and the K are now enriched. A changed descriptions + hash instead rebuilds all tables. +- **Preserve others:** after `--stages descriptions`, the on-disk `embeddings` and + `relationships` artifacts are byte-stable (unselected stages did not run). +- **Derived staleness warning:** after `--stages descriptions` changes the + descriptions, the run emits `enrichment_stage_stale` for `embeddings` (its + recomputed hash diverged) and does **not** emit it for `relationships` (decision + X); a subsequent `--stages embeddings` clears it. +- **Relationships context:** a `--stages relationships` run on an already-described + connection feeds the on-disk AI descriptions into `llmProposals` (verified: the + proposal prompt carries descriptions, not just column names). +- **No regression:** a normal uninterrupted `ktx ingest` (no flag) yields identical + artifacts and the same descriptions/embeddings/relationship output as today, with + spec-19/20 resume intact. + +## Non-goals + +- **Unifying `--stages` with the broader scan phases or `--no-query-history`.** The + namespace is built to extend later; this spec ships only the three enrichment + stages, composable with the existing query-history flag (D6). +- **A new `KtxScanMode` variant or a second stage-selection path.** One gate, + `(eligible) ∩ (selected)` (D2). +- **Coupling `relationships` to description content** (decision X, D5). Improving + descriptions does not invalidate or stale-flag joins. +- **Auto-cascading downstream re-runs.** Staleness is surfaced as a warning; the + operator chooses to cascade (D4). +- **Capturing prompt/code-level description-prompt changes in the hash.** The + descriptions hash keys on snapshot + LLM identity (config/model), not the prompt + text; a pure prompt improvement that does not change a hash input will not + force-rebuild already-good descriptions. Forcing that is out of scope — the + operator changes a real input or selects the stage with a changed config. +- **Re-implementing spec 19 (cross-run stage resume, completed-row store) or spec 20 + (per-table description resume, enforced timeout).** This spec composes above them: + it splits the key those stages resume on and adds the CLI surface to select and + force-re-run stages. +- **A general per-phase incremental-flush framework.** The selection mechanism is the + three enrichment stages; it is not a generic abstraction over every ingest phase. + +## Implementation orientation + +Line numbers drift; treat these as anchors, not addresses. The implementer owns the +design. + +- **Coarse hash → per-stage hashes** — `context/scan/enrichment-state.ts` + (`computeKtxScanEnrichmentInputHash` `:78`, `ComputeKtxScanEnrichmentInputHashInput` + `:57`): replace with per-stage hash functions (or one function taking a per-stage + input slice). `context/scan/local-enrichment.ts` (`:611` single hash; the three + `runEnrichmentStage` calls at `descriptions` ~`:635`, `embeddings` ~`:666`, + `relationships` ~`:722`; `runEnrichmentStage` `:524` and its short-circuit + `:538–547`). The `descriptions` hash also feeds `generateDescriptions`' + `resumeStore.load(inputHash)` (`:345`). +- **Provider-identity decomposition** — `context/scan/local-scan.ts` + (`localScanProviderIdentity` `:241–255`, the enrichment call site `:498–537`): + split into `llmIdentity` / `embeddingIdentity`, drop the redundant `mode` / + `relationships` re-encoding, and pass each stage only its slice. +- **`descriptionDigest`** — `context/scan/local-enrichment.ts` (`buildEmbeddings` + `:457–486`) and `context/scan/embedding-text.ts` (`buildKtxColumnEmbeddingText` + `:17–44`): digest the resolved per-column/table description text that the embeddings + consume, and fold that digest into the embeddings hash. +- **CLI flag** — `commands/ingest-commands.ts` (`:26–49` option declarations, + `:51–104` action handler): add `--stages` with a custom parser that validates + against the canonical stage registry (`KTX_SCAN_ENRICHMENT_STAGES` in + `enrichment-state.ts:4`) and rejects unknown/empty names with `InvalidArgumentError`. + Thread through `public-ingest.ts` (`KtxScanArgs` build `:969–978`, `mode: 'enriched'` + `:973`) → `scan.ts` (`runKtxScan`) → `local-scan.ts` (`runLocalScan`) → + `runLocalScanEnrichment`. +- **Stage gating + force-rerun** — `context/scan/local-enrichment.ts`: gate each stage + block on membership in the selected set (`descriptions` `:632`, `embeddings` + `:663–665`, `relationships` `:720`); make a named stage bypass the completed-row + short-circuit in `runEnrichmentStage` while the inner `compute()` keeps the spec-20 + per-table resume. `KtxLocalScanEnrichmentInput` (`:60–85`) gains the selected-stage + set. +- **Staleness detection + warning** — `context/scan/local-enrichment.ts` (after the + stage blocks): recompute each unselected stage's current hash from on-disk state, + compare to the stored completed-row hash, push a recoverable warning on mismatch. + Add `enrichment_stage_stale` to the `KtxScanWarningCode` union in + `context/scan/types.ts` (alongside `relationship_detection_partial`). +- **Relationships description context** — `context/scan/local-enrichment.ts` + (`schema` built at `:621`/`:688`, passed to `discoverKtxRelationships` `:736–746`): + hydrate `schema` with the best-available descriptions (fresh-this-run or loaded from + the on-disk `_schema` via `loadExistingManifestState`, + `local-enrichment-artifacts.ts`) before relationship detection. +- **Stage store + resume record** — + `context/scan/sqlite-local-enrichment-state-store.ts` + (`local_scan_enrichment_stages`, PK `(connection_id, stage, input_hash)`, + `findCompletedStage`, `saveCompletedStage`); `createKtxScanDescriptionResumeStore` + (`local-enrichment-artifacts.ts:286–332`, path `:265–267`, inputHash gate + `:305–307`) — both now keyed on the relevant per-stage hash. No migration bridge. +- **Config inputs** — `context/project/config.ts` (`scanRelationshipsSchema` + `:171–218` incl. `llmProposals` `:174` and `detectionBudgetMs`; + `scan.enrichment.embeddings` model/dimensions/batchSize; `llm.models.default`, + `llm.provider.gateway.base_url`): the sources of each per-stage identity slice. +- **Tests** — per-stage invalidation isolation (flip one input, assert only the + matching stage recomputes); `--stages` parse/validate (good subsets + unknown/empty + rejected); resume-aware force-rerun (`--stages descriptions` retries only the null + tables, leaves good ones, completes); preserve-others (unselected artifacts + byte-stable); derived staleness (`enrichment_stage_stale` fires for embeddings after + a descriptions change, not for relationships; cleared by a later `--stages + embeddings`); relationships-only run feeds on-disk descriptions to `llmProposals`; + regression — a normal no-flag ingest yields identical artifacts with spec-19/20 + resume intact. +- After implementing, rebuild and re-link so the playground picks it up: + `pnpm run build && pnpm run link:dev`. +- **Docs:** add `--stages` to the `ktx ingest` CLI reference + (`docs-site/content/docs/cli-reference/`) and note the per-stage cache behavior + where enrichment/ingest is described. + +## Benchmark context (motivation, not a requirement) + +Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A +level-aware audit found (a) a tail of BigQuery datasets with poor *column*-description +coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) that want a **`descriptions`-only** +re-run with a longer timeout, and (b) a desire to **backfill joins** across all +already-ingested datasets after enabling `llmProposals` — without re-paying for +descriptions. Both were blocked by the coarse single `inputHash` (flipping +`llmProposals` or re-describing invalidated the whole enrichment) and the absence of a +stage-selective CLI flag. The benchmark merely exercised large-scale multi-backend +ingestion at scale; the gap and the fix are generic production operability. Do not +encode any benchmark specifics in the implementation. + +## Implementation notes + +Shipped on branch `write-feature-spec-wiki`. All seven requirements implemented; +all acceptance criteria covered by tests. + +**What was built / where:** + +- **Per-stage hashes (D1, Req 1).** `context/scan/enrichment-state.ts`: removed the + coarse `computeKtxScanEnrichmentInputHash` and added + `computeKtxDescriptionsStageHash` (snapshot + `llmIdentity`), + `computeKtxEmbeddingsStageHash` (snapshot + `embeddingIdentity` + `descriptionDigest`), + `computeKtxRelationshipsStageHash` (snapshot + `relationshipSettings` + `llmIdentity`), + plus `computeKtxScanDescriptionDigest` and the `KtxScanLlmIdentity` / + `KtxScanEmbeddingIdentity` types. `KTX_SCAN_ENRICHMENT_STAGES` is now exported as the + canonical registry. `local-scan.ts` `localScanProviderIdentity` was split into + `localScanLlmIdentity` + `localScanEmbeddingIdentity` (dropping the redundant + `mode`/`relationships` re-encoding). `mode`/`detectRelationships` dropped out of the + keys. No migration bridge — the stage store + descriptions resume record just miss the + old coarse-keyed rows. +- **`descriptionDigest` (D1/D4).** `local-enrichment.ts`: extracted + `buildKtxColumnEmbeddingTexts(snapshot, descriptions)`, shared by the embeddings stage + and the digest, so the embeddings hash content-addresses the exact text the model sees. +- **`--stages` flag (D2/D6, Req 2).** `commands/ingest-commands.ts`: + `parseEnrichmentStagesOption` (Commander parser) validates against the registry, + rejects unknown/empty with `InvalidArgumentError`, returns an ordered de-duplicated + set; threaded through `KtxPublicIngestArgs` → `context-build-view` → `KtxScanArgs` → + `RunLocalScanOptions` → `KtxLocalScanEnrichmentInput`. One gate + (`(eligible) ∩ (selected)`); no new `KtxScanMode`. A selected-but-ineligible stage + emits a new `enrichment_stage_skipped` warning (never a silent no-op). +- **Force-rerun (D3, Req 3).** `runEnrichmentStage` gained `forceRecompute`; a named + stage bypasses the spec-19 completed-row short-circuit while `generateDescriptions` + still consults the spec-20 per-table resume record (retries only failed tables on an + unchanged hash). +- **Descriptions hydration + `llmProposals` context (D5, Req 5).** `runLocalScanEnrichment` + resolves best-available descriptions (fresh-this-run, else on-disk via a lazy + `loadPriorDescriptions` thunk wired from `local-scan.ts` → + `loadOnDiskDescriptionUpdates` in `local-enrichment-artifacts.ts`). `snapshotToKtxEnrichedSchema` + now merges `ai` descriptions, and `relationship-llm-proposal.ts` `buildEvidencePacket` + now carries the resolved description text — closing the latent gap on **both** the + full-run and relationships-only paths. +- **Derived staleness (D4, Req 4).** `enrichment_stage_stale` warning code + + `findLatestCompletedStage` on the state store (interface + sqlite + test store). After a + selective run, each unselected stage with a completed row is compared against its + freshly recomputed hash; a mismatch warns and names the cascade command. Relationships + are never flagged by a description change (decoupled per D5). +- **Docs.** `docs-site/content/docs/cli-reference/ktx-ingest.mdx`: `--stages` flag row, a + "Selecting enrichment stages" section (per-stage cache, force-rerun, staleness), and + examples. + +**Deviation from the spec — embeddings hydration is descriptions-only.** D5 states a +relationships-only run should hydrate "AI descriptions **and** embeddings" from the +on-disk `_schema`. Investigation found the `_schema` manifest shards store only +descriptions; embedding vectors are written to a **syncId-scoped** `enrichment/embeddings.json` +that no code reads back, and each run mints a fresh syncId — so there is no durable +per-connection embeddings artifact to hydrate from. A relationships-only run therefore +hydrates **descriptions** (required for, and verified against, the `llmProposals` +acceptance criterion) but **not** embeddings. Consequence: a `--stages relationships` +backfill gets deterministic + name-based + LLM-proposed candidates (the point of +`llmProposals`), but not the embedding-similarity candidates a full run would add. +Durable embeddings hydration (persist vectors at a stable per-connection path, or read +them from the vector index) is a clean follow-on and was left out of scope. + +**Tests:** `enrichment-state.test.ts` (per-stage hash stability + isolation), +`commands/ingest-commands.test.ts` (parser good/bad subsets, threading, text-capture +guard), `local-enrichment.test.ts` (force-rerun bypasses short-circuit + preserves +others, naming all three forces a full recompute, per-stage invalidation isolation, +prerequisite warning, on-disk descriptions reach `llmProposals`, resume-aware forced +descriptions rerun, derived `enrichment_stage_stale` fires for embeddings/not +relationships and clears after re-embed). Full `pnpm --filter @kaelio/ktx run test`, +`type-check`, `dead-code`, and `build` pass. (One pre-existing unrelated failure in +`test/skills/analytics-skill-content.test.ts` — the analytics `SKILL.md` lacks a +`**Window functions**` heading the test expects — was present before this work and left +untouched.) + +--- + +## ⚠️ Defect found in post-implementation validation (2026-06-24) + +**`--stages` subset excluding `descriptions` WIPES existing on-disk descriptions.** Violates Req +"preserve-others / a selective run never deletes another stage's artifacts." + +**Reproduction (deterministic):** +- `northwind` before: 110 `ai:` column/table descriptions, 0 join edges. +- `ktx-dev ingest northwind --stages relationships` → completes in ~35s, adds **22 join edges** ✅ + but the rewritten `public.yaml` has **0 descriptions** (no `ai:`, no `db:`, columns bare). ❌ +- A full `ktx-dev ingest northwind` (all stages) restores 110 descriptions + keeps the 22 joins. + +**Likely root cause:** the relationships-only path rewrites the schema from the raw snapshot + only the +freshly-run stage. The implementation notes claim `snapshotToKtxEnrichedSchema` merges `ai` descriptions +and that descriptions are hydrated "fresh-this-run, else on-disk via `loadPriorDescriptions`" — but on the +**write path** of a subset run the prior descriptions are NOT merged into the emitted schema (they reach +the `llmProposals` evidence packet only). So the on-disk `_schema` loses them. + +**Impact:** blocks the intended joins-everywhere backfill (`--stages relationships` across all dbs) and the +`--stages descriptions`-only re-runs — either would destroy the unselected stage's artifacts across every +db. Caught on a 1-db validation before any rollout. + +**Acceptance fix:** after any `--stages` subset, the on-disk `_schema` must **retain all prior `ai:`/`db:` +descriptions** (and prior joins when descriptions-only) for stages not named — only the named stages' +artifacts change. Add a regression test that ingests a fully-enriched fixture, runs `--stages relationships`, +and asserts description count is unchanged while joins increase. + +### ✅ Fixed (2026-06-24) + +**Real root cause (deeper than the first diagnosis):** the wipe happened in **two** places, and the first +fix attempt only addressed one. `runLocalScan` (`context/scan/local-scan.ts`) writes the **structural** +manifest shard from the bare snapshot *before* enrichment runs; that write merges with the on-disk shard, +but the merge (`mergeDescriptionsPreservingExternal`, `live-database/manifest.ts`) treats `ai`/`db` as +**scan-managed** and overwrites them with whatever the run emits — and the structural write emits none. So a +subset run deleted the descriptions on the structural pre-write, *then* `runLocalScanEnrichment` read the +already-wiped shard via `loadPriorDescriptions` and had nothing to restore. (A unit-level enrichment test +passed because it never exercised the structural pre-write — a divergent-harness miss; the regression test +was rewritten to go through the full `runLocalScan` path.) + +**What changed:** +- `runLocalScanEnrichment` (`local-enrichment.ts`) now returns the **best-available** descriptions + (`resolveDownstreamDescriptions()` — fresh-this-run if `descriptions` ran, else the on-disk ones) as + `descriptionUpdates`, instead of `[]` when the stage is skipped — so the enrichment write re-applies them. +- `runLocalScan` (`local-scan.ts`) now, on a subset run, **captures the prior on-disk descriptions before + the structural manifest write** and feeds them to both the structural write and enrichment — so the + structural pre-write preserves them too (robust even if relationship detection later fails). +- Joins were already preserved for `--stages descriptions` via the existing manual/inferred + `preservedJoins` path; verified by a symmetric test. + +**Tests:** `local-scan.test.ts` — a full `runLocalScan` `--stages relationships` run preserves on-disk `ai` +descriptions while adding a join (RED without the fix, GREEN with it). `local-enrichment.test.ts` — the +enrichment-layer contract (`--stages relationships` preserves descriptions / `--stages descriptions` +preserves joins). + +**Live validation (northwind, 15 tables):** `--stages relationships` BEFORE `ai:110 joins:22` → AFTER +`ai:110 joins:22` (descriptions intact; previously wiped to 0). `--stages descriptions` restored the +descriptions from the spec-20 resume record (`ai:0 → ai:110`) with **no** LLM calls while keeping `joins:22`. +Full `pnpm --filter @kaelio/ktx run test` (3089 passed), `type-check`, `dead-code`, and `build` pass. diff --git a/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md b/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md new file mode 100644 index 00000000..ad70e83d --- /dev/null +++ b/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md @@ -0,0 +1,66 @@ +# Multi-connection routing guidance in the ktx-analytics skill + +## Problem + +The agent-facing `ktx-analytics` skill (installed into agent environments via +the ktx skills/install mechanism, see `.ktx/agents/install-manifest.json` in +projects) describes the query workflow — wiki_search → sl_read_source → +sl_query / sql_execution — but assumes the connection is obvious. In a +multi-connection project nothing tells the agent to *first decide which +connection the question is about*, and several tools silently require it: + +- `sql_execution`, `sl_read_source`, `entity_details`: `connectionId` + **required**; +- `sl_query`, `discover_data`, `dictionary_search`: optional, but + auto-inference only works with exactly one connection + (`local-query.ts` `resolveLocalConnectionId` ~29-38 — throws with zero or + multiple connections). + +An agent that skips routing either errors out or, worse, queries the wrong +database when names overlap. + +## Generic use case + +Any ktx project with more than one connection — the common shape for a data +org (warehouse + product DB + events DB). Routing is the first step of every +question, and the skill should encode it so individual agents don't have to +rediscover it. + +## Requirements + +1. **Add an explicit routing step (step 0) to the skill's workflow:** + - Call `connection_list` to see what exists. + - Match the question's domain to a connection using connection ids/names, + `discover_data` hits, and wiki context — not guesswork. + - If genuinely ambiguous after discovery, ask the user rather than pick. +2. **Thread the resolved `connectionId` everywhere:** all subsequent + `sl_query`, `sql_execution`, `sl_read_source`, `entity_details`, + `dictionary_search`, `discover_data` calls, and `wiki_search` once spec 01 + lands (search scoped to the resolved connection plus unscoped pages). +3. **Single-connection projects stay frictionless:** the skill should say + routing is trivial when `connection_list` returns one entry — don't add a + mandatory ceremony step for the common simple case. +4. **Capture routing knowledge:** when the agent learns a non-obvious + question-domain → connection mapping, the skill should encourage + `memory_ingest` so the mapping becomes wiki knowledge for next time. + +This is a docs/prompt change in the skill content (plus any skill-install +plumbing if the skill is versioned); no engine changes required. + +## Acceptance criteria + +- In a fixture project with ≥2 connections, an agent following the skill + resolves the correct connection before its first data query, and no tool + call fails with "connectionId is required". +- In a single-connection project the skill-driven flow is unchanged (no + extra mandatory steps). +- Skill text nowhere assumes a default/implicit connection. + +## Benchmark context (motivation only) + +Spider 2.0-Lite local subset = 30 SQLite connections in one project; every +one of the 135 questions targets exactly one of them. Connection ids are set +to the benchmark's database names, so with this skill guidance routing is +mechanical (`connection_list` + name match) and needs no benchmark-specific +instructions — which is the point: the harness gives the agent only the +question text. diff --git a/spider2-specs/todo/04-offline-schema-docs-adapter.md b/spider2-specs/todo/04-offline-schema-docs-adapter.md new file mode 100644 index 00000000..d37fd97f --- /dev/null +++ b/spider2-specs/todo/04-offline-schema-docs-adapter.md @@ -0,0 +1,51 @@ +# Offline schema-documentation ingest adapter + +> **Priority: LOW / backlog.** Explicitly **not** needed for the Spider +> 2.0-Lite benchmark — we verified the benchmark's offline schema files +> (DDL dumps + sample-row JSONs) are a strict subset of what the live SQLite +> scan already captures (DDL, types, PKs, sample values, cardinality +> profiling). Implement specs 01-03 first; pick this up only if a real +> use case shows up. + +## Problem + +The ingest pipeline's schema knowledge comes from live database scans +(`live-database` adapter) or BI-tool adapters (metabase, looker, dbt…). +There is no adapter for **offline schema documentation**: files describing +tables/columns that exist outside the database — column-description +spreadsheets, data dictionaries, DDL exports with comments, hand-maintained +schema docs. + +## Generic use case + +Teams whose richest schema documentation lives outside `information_schema`: +a wiki export of column meanings, a governance tool's CSV data dictionary, +DDL files with COMMENT clauses the production scan can't see, or +environments where ktx has no live access at all and must build the semantic +layer from documentation alone. + +## Requirements (sketch — refine when picked up) + +1. A new ingest adapter (peer of `metabase`/`dbt` in + `context/ingest/adapters/`) consuming a configured local path of schema + docs per connection. +2. Input formats to start: DDL files (`.sql`/`.csv` of CREATE statements) + and tabular column dictionaries (CSV/JSON: table, column, description, + …). Extensible to other formats. +3. Output: **enrichment, not duplication** — merge descriptions/metadata + into the manifest-backed semantic-layer sources and dictionary for the + matching connection. Where a live scan exists, offline docs fill gaps + (descriptions, enum meanings, deprecation notes) and flag drift + (documented column missing from live schema and vice versa) rather than + creating parallel wiki pages that duplicate schema info. +4. Works without live database access (documentation-only bootstrap of a + connection's semantic layer), clearly marked as unverified-against-live. + +## Acceptance criteria (sketch) + +- Given a connection with a live scan plus an offline column dictionary, + semantic-layer sources carry the documented descriptions, and drift + between doc and live schema is reported. +- Given a connection with docs only (no live access), `sl list`/`sl read` + expose manifest sources built from the docs. +- No wiki pages are created that merely restate table/column lists. diff --git a/spider2-specs/todo/05-composite-key-join-detection.md b/spider2-specs/todo/05-composite-key-join-detection.md new file mode 100644 index 00000000..0f3a6c7e --- /dev/null +++ b/spider2-specs/todo/05-composite-key-join-detection.md @@ -0,0 +1,59 @@ +# Composite-key (multi-column) join detection + +> Priority: MEDIUM. Found empirically during the first Spider2-lite sqlite +> smoke test (2026-06-13): relationship detection emitted **zero joins** for a +> database whose fact tables are linked only by composite keys. Agents still +> answered correctly by inferring the join from shared `grain`, so this didn't +> cost benchmark points — but it forces inference that explicit joins would +> remove, and the gap is generic. + +## Problem + +Relationship detection appears to emit only single-column joins. For the IPL +sqlite database, every table came back with `joins=0`, even though its fact +tables are connected by a 4-column composite key +(`match_id, over_id, ball_id, innings_no`) shared across `ball_by_ball`, +`batsman_scored`, `extra_runs`, and `wicket_taken`. The semantic layer did +correctly record that shared key as each table's `grain`, which is why agents +could recover the relationship — but no `joins:` entries were produced for the +fact-to-fact links. + +## Generic use case + +Event/fact tables keyed by composite business keys are common: ledger lines +(`account_id, period, line_no`), telemetry (`device_id, ts, metric`), sports +ball-by-ball, EAV/log schemas. Whenever there are no single-column FKs but a +multi-column key recurs across tables, ktx should detect and document the join +so agents (and `sl_query`) don't have to infer it. + +## Requirements + +1. Relationship detection considers **multi-column** join candidates, not just + single-column ones. A strong signal already exists in ktx: when two tables + share an identical (or subset/superset) declared `grain`, that grain is a + prime composite-join candidate. +2. Emitted joins carry the full composite condition, e.g. + `on: a.match_id = b.match_id AND a.over_id = b.over_id AND a.ball_id = b.ball_id AND a.innings_no = b.innings_no`, + with a sensible `relationship` cardinality. +3. The existing validation/threshold machinery + (`scan.relationships.acceptThreshold` etc.) applies to composite candidates + too; profile-based validation should check join selectivity on the full key. +4. No regression for single-column joins; don't explode combinatorially — + bound candidate generation (e.g. only consider shared-grain keys and + declared/!inferred PK overlaps, cap column count). +5. `sl_query` can compile a join across a composite-key relationship. + +## Acceptance criteria + +- For a fixture with two tables sharing a 3- or 4-column grain and no + single-column FK, ingest emits a composite join between them with the full + multi-column `on` condition. +- `sl read ` shows the composite join; `sl_query` can traverse it. +- Single-column join detection is unchanged on existing fixtures. + +## Benchmark context (motivation only) + +IPL (and similar ball-by-ball/event schemas in the Spider2-lite local set) +have no single-column FKs; their joins are entirely composite. Explicit +composite joins would let the agent rely on documented relationships instead +of inferring them from grain. diff --git a/spider2-specs/todo/13-canonical-authoritative-source-measures.md b/spider2-specs/todo/13-canonical-authoritative-source-measures.md new file mode 100644 index 00000000..f80c4c2d --- /dev/null +++ b/spider2-specs/todo/13-canonical-authoritative-source-measures.md @@ -0,0 +1,89 @@ +# Canonical / authoritative-source measures in the semantic layer + +## Problem + +Many schemas contain an **authoritative table** that already encodes a metric's +business rules — an official standings/leaderboard table, a general-ledger or +period-end balance table, a materialized summary/snapshot — alongside the **raw +transactional** rows the metric *could* be re-derived from. Re-deriving the metric +from the raw rows frequently diverges from the canonical definition, because the +authoritative table bakes in rules the raw data doesn't expose (drop-scores, +penalties, adjustments, reconciliations, as-of snapshots). + +Today ktx's semantic layer doesn't distinguish "authoritative summary" tables from +raw fact tables, so the analytics skill has no signal that one source is canonical +for a metric — and the agent often re-derives from raw rows and gets a defensible- +but-different number. + +## Generic use case (independent of any benchmark) + +- "Championship points per competitor this season" — a sports schema may hold both + raw per-event results AND an official standings table that applies drop-scores + and penalties. The standings table is the canonical source; summing raw results + is wrong. +- "Account balance as of month end" — prefer a ledger/balance-snapshot table over + re-summing every transaction (which may miss adjustments). +- "Monthly recognized revenue" — prefer a finance summary table over re-deriving + from line items. + +In each case a real analyst should be steered to the authoritative source. + +## Requirements + +1. **Detect candidate authoritative tables during ingest.** Heuristics only — + e.g. tables whose name/role suggests a summary (`*standings*`, `*balance*`, + `*summary*`, `*snapshot*`, `*ledger*`), tables that are a coarser-grained + aggregation of another table, or tables documented as authoritative in provided + docs/wiki. Surface them as such in the semantic layer. + +2. **Represent the metric as an SL measure backed by the authoritative table.** + Where a canonical source exists, define the measure over it so a query for that + metric resolves to the authoritative source by default. (The analytics skill + already prefers SL measures over raw SQL — spec 07/skill rule — so this plugs + into existing behavior.) + +3. **Keep raw re-derivation available** as a non-default alternative; the measure + documents which source it uses and why, so the choice is transparent and + overridable. + +## Fairness boundary (HARD — this spec is fairness-sensitive) + +The choice of authoritative source MUST be driven by **schema/structure or provided +documentation** — the table exists, is structured as a summary, or is documented as +authoritative. It must **NEVER** be driven by observing which interpretation matches +a benchmark gold answer. Concretely: + +- ✅ Fair: "a table named/structured as official standings exists and aggregates the + raw results → treat it as the canonical points source." +- ❌ Forbidden: "for question X, use table T because that's what reproduces the gold + result." That is per-instance gold-tuning (cheating) and must not appear in ktx, + the ingest heuristics, or any mapping. + +If a metric is genuinely underspecified and only the gold answer disambiguates the +intended source, it is **not fairly fixable** — leave it. Whether this feature helps +any specific benchmark instance is therefore *conditional* on a real schema/doc basis +existing; do not manufacture one. + +## Leak-safety (hard constraint) + +No benchmark table names, queries, gold values, or instance-specific mappings +anywhere in the spec, the heuristics, or tests. Examples must be synthetic/generic. + +## Acceptance criteria + +- Ingest can flag candidate authoritative/summary tables via generic heuristics + (name/role/aggregation/doc signals), with no benchmark-specific rules. +- The semantic layer can express a measure as backed by a designated authoritative + source; the skill resolves the metric to it by default; raw re-derivation remains + available and the choice is documented. +- Tests use synthetic schemas only; no gold-derived mappings exist anywhere. + +## Benchmark context (motivation only) + +Some SQLite-subset metric questions are underspecified between a raw-derivation and +an authoritative-table interpretation (e.g. season points from raw results vs an +official standings table). This is the roadmap's "canonical semantic-layer measures +from schema + provided docs" item. It is fair ONLY where schema/docs support one +source; the gold-only cases are explicitly out of scope (fixing them would require +tuning to gold). Larger than the spec 09–12 skill-content tweaks: this touches +ingest + the semantic-layer model. diff --git a/spider2-specs/todo/17-lifecycle-event-metrics.md b/spider2-specs/todo/17-lifecycle-event-metrics.md new file mode 100644 index 00000000..7b8a6e2b --- /dev/null +++ b/spider2-specs/todo/17-lifecycle-event-metrics.md @@ -0,0 +1,57 @@ +# 17 — Lifecycle-event metrics in the semantic layer + +**Status:** draft (intake). Requirement-level; the implementer refines into `specs/17-*.md`. + +## Problem / requirement + +Many entities carry **several lifecycle timestamps** for the same record — an order has +`placed/purchased`, `approved`, `shipped/carrier-handoff`, `delivered`, and `estimated-delivery` +times; a ticket has `opened`, `assigned`, `resolved`, `closed`; a payment has `initiated`, +`authorized`, `settled`. When an analyst asks for a count/volume/rate of records **in a named +completed state, by period** ("delivered orders by month", "resolved tickets per week", "settled +payments by day"), the correct time anchor is the timestamp of *that named event*, not the +record-creation timestamp. + +Today ktx ingests these timestamps as **peer date dimensions** with good column descriptions, but it +does **not model the lifecycle event itself** — so nothing in the semantic layer tells a solver (or a +human) that "delivered orders over time" should be anchored to the delivery timestamp. The choice is +left to per-query reasoning, which is exactly where it goes wrong. (A companion analytics-skill rule +now nudges the *solver* — ktx commit `226341cf` — but the durable, reusable home for this is the +**model**, so any consumer of the semantic layer gets it for free.) + +**Requirement:** during enrichment/ingestion, when a source has a state/status column plus one or more +lifecycle timestamps whose names/descriptions map to that state's values, infer **lifecycle-event +metrics** — e.g. a `delivered_orders` metric defined as `COUNT(*)` filtered to the delivered state with +its **default time dimension** set to the matching event timestamp (`order_delivered_customer_date`), +distinct from the creation-anchored `orders` metric. Keep the inference conservative and +source-traceable (column names + enriched descriptions only); never invent a state/timestamp pairing +that the schema/descriptions don't independently support. + +## Sketch (implementer to refine) + +- Detect (state column, lifecycle-timestamp) pairs from column names + enrichment descriptions + (e.g. status value `delivered` ↔ `*_delivered_*_date`; `resolved` ↔ `resolved_at`). +- Emit a metric per detected completed state: filter = the state predicate, grain = record, + `defaultTimeDimension` = the matching event timestamp. +- Surface these via `discover_data` / `entity_details` so "delivered orders over time" retrieves the + delivery-anchored metric rather than a bare row count over the creation date. +- Gate behind the existing `enrichment.mode: llm` path; respect the conservative-inference bar + (precision over recall — a wrong pairing is worse than none). + +## Generic use case (independent of the benchmark) + +Any operational/transactional schema (e-commerce orders, support tickets, payments, claims, shipments) +has this multi-timestamp lifecycle shape. An analyst asking "how many X were last +month" almost always means *entered that state* last month. Encoding the event→timestamp mapping in the +model makes every downstream question (BI tool, ad-hoc SQL, an LLM agent) pick the right anchor without +re-deriving it, and prevents the silent "grouped by when they started" error. + +## Benchmark context (motivation only — not a benchmark-specific rule) + +Surfaced by the `spider2-autofix` loop, round r1: Spider 2.0-Lite `Brazilian_E_Commerce` cases local028 +("delivered orders for each month") and local031 ("highest monthly delivered orders volume") both failed +because the solver bucketed delivered orders by `order_purchase_timestamp` instead of +`order_delivered_customer_date`. The trace showed the solver had both columns and even compared both +date bases for local031 before choosing purchase. A skill-text rule flipped both cases this round; this +spec is the **model-layer** form of the same fix, which would make the right anchor the default for any +solver and any lifecycle schema.