mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
feat: ktx batch — scan resilience, analytics SQL craft, connector hardening (#312)
* docs: add spider2-specs handoff directory for benchmark-driven feature specs
* feat(cli): connection-scoped wiki pages
Add an optional `connections` frontmatter field so database-specific wiki
knowledge can be scoped to a connection without polluting searches about other
databases, while page keys stay a flat, globally-unique namespace.
- connections: single string or list; absent/empty ⇒ unscoped (applies to all)
- wiki_search (MCP) and `ktx wiki --connection` return unscoped ∪ matching
pages, filtered at the disk-load seam so all three search lanes draw their
candidate pool from the already-scoped set (not a post-filter)
- wiki_write accepts connections with REPLACE semantics and rejects a
connection-scoped write whose key collides with a disjoint-connection page
(data-loss guard; hard error, no silent clobber)
- explicit connection-id args (wiki_search, memory_ingest, ktx wiki) are
validated against ktx.yaml via a shared assertConfiguredConnectionId, which
also closes the prior gap where memory_ingest's connectionId was unvalidated;
persisted ids absent from config warn (not fail) in `ktx status`
- prompt guidance in the wiki_capture skill and external-ingest prompt; the
session connectionId is surfaced to the memory agent and ingest work units
Implements spider2-specs/specs/01-connection-scoped-wiki.md; intake draft moved
to spider2-specs/done/.
* docs(spider2-specs): add specs/ refinement stage and composite-key join spec
Describe the todo/ → specs/ → done/ pipeline in the README (refined specs are
the durable artifact; intake drafts move to done/ on ship) and add a
MEDIUM-priority spec for multi-column composite-key join detection found during
the first sqlite smoke test.
* feat(cli): add --verbatim ingest mode for authoritative documents
Store each --text/--file document body unchanged as a GLOBAL wiki page
instead of routing it through the memory agent, which may rewrite,
condense, or re-title it. The LLM derives only metadata (summary, tags,
sl_refs) and only for frontmatter fields the document does not already
set; the stored body is written by code and never edited.
- Deterministic page key: files derive it from the filename, inline
text from its leading Markdown heading (headless inline text is
rejected — pass it as --file instead).
- Idempotent: re-running the same body is a no-op; a different body at
the same key fails loudly rather than overwriting.
- Works with llm.provider.backend: none, deriving a degraded summary
from the heading or first sentence.
- Existing frontmatter (including unmodeled fields like effective_date)
passes through untouched; --connection-id scopes the page.
* feat(cli): SQL-authoring craft and per-dialect notes tool for the analytics skill
Spec 07: add a dialect-agnostic <sql_craft> block to the ktx-analytics skill (schema discovery, composition, window-function correctness, numeric precision, answer completeness) with one worked window-then-filter example. Workflow steps gain pointers into it; existing guidance is unchanged.
Spec 08: add a read-only sql_dialect_notes MCP tool returning a connection's engine SQL conventions (FQTN form, identifier quoting/case, date/time, top-N idiom, JSON access), resolved through the existing sqlAnalysisDialectForDriver path. Notes are per-dialect markdown files under context/sql-analysis/dialects, served by the tool and copied to dist (package-internal, never installed). Non-SQL connections return a clear KtxExpectedError. The flat skill gains a one-line pointer to the tool.
Both spider2-specs intake drafts move to done/ with implementation notes.
* feat(cli): tolerate objects that fail introspection during scan
Isolate per-object introspection failures so one broken or inaccessible object no longer zeroes out a connection's whole semantic layer: the sqlite and bigquery connectors introspect each object defensively (tryIntrospectObject), the live-database adapter records a scan outcome and fetch report, and enabled_tables accepts catalog.db.name, db.name, or bare names with a clear no-match error. Includes matching ktx-daemon introspection changes, docs, and tests.
* docs(spider2-specs): add 06-scan-tolerate-broken-objects spec
* feat(cli): generalize analytics fan-out rule to multi-hop join chains
The ktx-analytics skill's fan-out rule only reliably caught single-hop
inflation; agents still silently fanned out on multi-hop chains where the
offending one-to-many join sits several hops below the SUM/COUNT and is easy
to miss.
Rewrite the Composition rule so the danger reads as cumulative across the whole
chain (pre-aggregate per measure-owning table), add an affirmative
grain-verification habit (default: pre-aggregate to grain; escape hatch:
COUNT(DISTINCT key) for pure counts only; SUM/AVG of a fanned-out measure must
pre-aggregate), and add one generic wrong-vs-right worked example. Content-only
and dialect-agnostic; no new tool, flag, or config.
Implements spider2-specs/specs/09 and annotates spec 07's one-example
constraint as superseded.
* feat(cli): add panel-completeness, time-series window, and text-encoded numeric SQL craft
Extend the analytics skill's <sql_craft> with three correctness habits and
route the dialect-specific halves through sql_dialect_notes:
- Panel completeness (spec 10): full-domain spine -> LEFT JOIN -> COALESCE for
"each/every/all/per" questions, defaulted by measure additivity.
- Time-series windows (spec 11): explicit cumulative frames, calendar-range
rolling windows with minimum-periods guards, and period-over-period via LAG.
- Text-encoded numerics (spec 12): sample distinct values, strip/scale/cast in
one early CTE, and confirm coverage with a failure-detecting cast.
Add per-dialect Series, Rolling window, and Safe cast notes to all seven
dialect files so the skill stays dialect-agnostic while the engine-specific
syntax lives in sql_dialect_notes. Tests updated and passing (19).
* docs(spider2-specs): add specs 10-12 for analytics SQL-craft additions
Refined specs and completion records for the panel-completeness spine (10),
time-series window recipes (11), and text-encoded numeric parsing (12)
implemented in the preceding commit.
* docs(spider2-specs): add backlog intake drafts 13-14
- 13: canonical authoritative-source measures
- 14: output-completeness final check
* skill(analytics): spec 14 output-completeness + iter1 (active column planning)
Bundles two changes (entangled in SKILL.md; future spider2 iterations land as
separate commits):
- spec 14 (output-completeness): multi-part "answer every requested output" rule
+ a "Final completeness check" in workflow Step 6 and <sql_craft>; analytics
skill-content test updated; intake draft -> done/, refined spec added.
- iter1 experiment: spec 14's passive end-check did not change behavior on the
benchmark's output-completeness failures, so (a) the Plan step now writes the
exact output-column list UP FRONT as a contract the final SELECT must match,
and (b) "expose identity" -> "project BOTH the entity id and its name" (covers
both omission directions). All generic craft.
Driven by the Spider 2.0-Lite failure analysis (incomplete output was the
largest failure bucket); benchmark only as motivation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* skill(analytics): iter2 — deterministic order in string/array aggregation
GROUP_CONCAT/string_agg/array_agg element order is undefined without an explicit
ORDER BY; also note SQLite's default text sort is binary/case-sensitive (uppercase
before lowercase) vs case-insensitive (COLLATE NOCASE). Generic SQLite craft.
Spider 2.0-Lite motivation: an ordered-ingredient-list question failed only on the
within-string element order (right elements, wrong order); benchmark as motivation only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(mcp): structured, leveled logging for the MCP server
Add one synchronous pino logger per MCP server process, written through the
io.stderr sink: plain JSON when stderr is not a TTY, colorized pino-pretty
(sync, in-process) when it is. Every tool call logs tool.start with its raw
params BEFORE the handler runs and tool.end after (info / warn past
KTX_MCP_SLOW_TOOL_MS / error), correlated by callId plus sessionId, so a
runaway sql_execution leaves a recoverable start line with its exact SQL and
no matching end. HTTP logs session.open/close and wires the previously-dead
transport.onerror to transport.error; stdio routes its transport error
through the logger. Level via KTX_MCP_LOG_LEVEL (default info). Existing
mcp_request_completed telemetry and registerParsedTool are unchanged; no
worker/async transport and no redaction in v1 (logs are local-only).
Implements spider2-specs/specs/15-mcp-server-structured-logging.md and moves
the intake draft to done/.
* feat(mcp): report uptimeMs in MCP server /health
The /health endpoint now includes uptimeMs (monotonic elapsed time since
the server started), mirroring the Python daemon's uptime_ms telemetry
field.
* feat(cli): bound read-query execution with a per-connection deadline
Enforce one shared query deadline (default 30s, overridable per connection via
query_timeout_ms) on every executeReadOnly path, so an accidentally-expensive
LLM-authored query returns a fast "query exceeded Ns" KtxQueryError instead of
hanging the MCP server.
- New shared contract context/connections/query-deadline.ts
(resolveQueryDeadlineMs, queryDeadlineExceededError); query_timeout_ms added to
the shared warehouse schema; BigQuery's job_timeout_ms removed.
- SQLite runs the read query in a short-lived forked child process and enforces
the deadline with SIGKILL. worker_threads + terminate() was tried first but
cannot interrupt a synchronous better-sqlite3 scan (the native loop never
yields); SIGKILL reclaims the process in ~2ms and keeps the event loop free.
- Remote connectors apply a real server-side statement timeout and re-wrap their
own timeout signal as KtxQueryError: Postgres statement_timeout/57014, MySQL
max_execution_time/3024, Snowflake STATEMENT_TIMEOUT_IN_SECONDS/604, ClickHouse
max_execution_time + aligned request_timeout/159, SQL Server requestTimeout/
ETIMEOUT, BigQuery jobTimeoutMs.
- Relationship validation skips a candidate to review on a deadline timeout
instead of aborting the pass; the deadline surfaces through the existing MCP
pino logger as a matched tool.start/tool.end(error) pair (no new logging code).
Also fixes a pre-existing, unrelated invalid cast in mcp-server-factory.test.ts
that was breaking tsc -p tsconfig.test.json.
* docs(spider2-specs): mark spec 16 (bounded query execution) done
Append Implementation notes to the refined spec (what shipped, where, and the
worker-thread -> child-process+SIGKILL deviation with its evidence) and move the
intake draft from todo/ to done/.
* skill(analytics): iter3 — measure-as-amount, inter-event gap, top-per-metric career
Three generic interpretation rules: a named business measure (sales/revenue/spend)
means its amount not a row count; "inter-event duration/gap" is LAG/LEAD time-between
events not a magnitude column; "highest across several achievements" aggregates per
metric over the whole history. All three demonstrably FIRE (verified on local008/003/152
SQL). local008 flips to correct (mechanism-aligned). 003/152 still fail on a different
axis (source-column / grouping). Generic craft; benchmark only as motivation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* skill(analytics): spine-for-extreme-selection + aggregate-over-selected-set
Two generic answer-completeness refinements:
- Selecting the extreme group (lowest/highest count over a period/category
domain) must rank over the COMPLETE spine, not only groups with fact rows —
an empty period is a genuine 0 and often the true minimum.
- An aggregate scoped to a per-entity selected set ('avg revenue per actor in
those top-3 films') is computed ACROSS that set, distinct from the per-item
value; project both.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter2 — sharpen extreme-selection spine + top-N ranking-measure
- spine-for-extreme: concrete cue that a zero-row period never appears in a
GROUP BY of the facts; generate the full calendar, LEFT JOIN, COALESCE, then rank.
- aggregate-over-selected-set: top-N selection ranks by the named ranking measure
(the item's own revenue), independent of the per-item share that feeds the aggregate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter3 — comparison-between-two-extremes is one wide row
Distinguishes a cross-item comparison ('the difference between the highest and
lowest month' -> single wide row, both extremes side by side + the comparison
column) from 'report a metric for each group' (-> stays long). Generic, question-
derived; targets the wide-vs-long shape gap without affecting per-group long output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter4 — anchor a period bucket to the named lifecycle event
When a record carries multiple lifecycle timestamps (created/placed, approved,
shipped, delivered, completed, settled) and the question counts/measures records
in a named *completed state* by period ("delivered orders by month", "shipped
items per week"), bucket the period by that named event's own timestamp, not the
record-creation timestamp; the state value is the qualifying filter, the matching
timestamp is the time anchor. Wording priority is explicit — purchased/placed/
created/submitted/ordered keep the start-event timestamp — and a non-temporal
state filter (counts by customer/city/seller with no period) introduces no anchor.
Generic analytics craft: counting completed-state records by their creation date
silently answers "records that later reached that state, grouped by when they
started" instead of the question asked. Surfaced via the spider2-autofix loop;
FAIR_PRODUCT (adversary-screened, restatable from question wording + schema/
semantic-layer lifecycle descriptions, no gold dependency).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter5 — canonicalize observed URL-path variants before page-level analysis
When a question groups/filters/sequences web pages by a path/url column, sample
its distinct values; if the data itself shows /route and /route/ variants for the
same page context, canonicalize in an early CTE (preserve / as root, strip trailing
slashes from non-root paths, map an observed empty path to / only when the column is
a URL path with blank root-page events) and use the canonical path everywhere above.
Explicitly forbids inventing aliases the data doesn't show: no merging different
route names, no stripping query/fragment/host/scheme, no lowercasing, and no
canonicalization when the question asks for raw URL/path or slash-vs-no-slash diffs.
Generic web-analytics craft: raw request logs routinely store the same user-visible
page with and without a trailing slash, so grouping raw labels silently splits one
page into several. Surfaced via the spider2-autofix loop (Codex runner, round r2);
FAIR_PRODUCT (adversary-screened, restatable from URL-path semantics + page-grain
question wording + solver-observed distinct values, no gold dependency). The rule
fired mechanism-aligned on both targets; flipped local330 (landing/exit page counts),
local331 residual is a separate sequence-semantics axis beyond canonicalization.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter6 — coverage over a selected group is a set-membership aggregate
When a question first selects a group of entities ("the top 5 actors", "these
products") and then asks what count/share/percentage of a DIFFERENT subject domain
relates to *these* selected entities ("what % of customers rented films featuring
these actors"), the subject set is the UNION across the whole group: count DISTINCT
subject ids once across the selected entities and return one collective value at the
subject-domain grain — not one row per selected entity (which double-counts subjects
related to more than one entity and answers a different question). Narrowly guarded:
emit one row per entity only when the wording says "for each / per / by / list" or
asks for each entity's own metric ("top 5 players and their batting averages").
The collective-coverage cousin of the existing per-entity selected-set rule. Generic
analytics craft (per-entity metric vs set-level coverage). Surfaced via the
spider2-autofix loop (Codex runner, round r3); FAIR_PRODUCT (adversary-screened,
restatable from wording alone, no gold dependency). Flipped local195 mechanism-aligned
(union COUNT(DISTINCT customer)/total, one scalar); 0 regression across 5 passing
per-entity top-N guards (local023/024/029/212/221 stayed long).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): label-only joins must LEFT JOIN — incomplete dims silently drop fact rows
Mirror of the existing fan-out rule for the DROP direction: an inner JOIN to a
dimension table used only to attach a display attribute silently discards every
fact row whose key has no parent when the dimension is incomplete (trimmed
catalogs, late-arriving / SCD-gap rows), shrinking counts/sums and the universe
over which shares/averages/medians are computed. Guidance: LEFT JOIN pure
enrichment; inner-join a dimension only when intended as a filter; key the
aggregate/GROUP BY on the fact column, not the dimension column.
Spider2 autofix round 'joindim': flips complex_oracle local050 (FAIL->PASS,
official scorer) — solver dropped the gratuitous products inner-join and
recovered the exact gold. local060/063 also adopt LEFT JOIN (rule fires) but
remain gold-convention-blocked. Guards local061/067 held.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(spider2-specs): add todo/17 — lifecycle-event metrics (semantic-layer)
Draft intake spec surfaced by the spider2-autofix loop (round r1): the model-layer
form of the shipped iter4 lifecycle-date-anchoring skill rule — infer per-state
lifecycle-event metrics (e.g. delivered_orders with defaultTimeDimension = the
delivery timestamp) during enrichment so the correct time anchor is the default for
any consumer, not only an agent that loaded the skill. Generic; FAIR_PRODUCT.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(connectors): accept leading underscore in connection/identifier ids
The safe-identifier validator regex /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/ allowed an
underscore everywhere except the first character, so a connection id / database
name that legitimately starts with '_' (valid in Snowflake, e.g. _1000_GENOMES)
could never be ingested or queried. Allow a leading underscore across all 16
duplicated validators (connection ids, source ids, page/wiki keys, warehouse-
verification tool schemas). Path-safety is unaffected — '.' and '/' remain
excluded, and assertSafePathToken still blocks traversal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): generic geospatial query guidance
Add a Snowflake ST_* dialect note (ST_MAKEPOINT lon-first, ST_DWITHIN/ST_CONTAINS/
ST_WITHIN/ST_INTERSECTS, bbox->polygon via ST_MAKEPOLYGON/ST_MAKELINE) and a
dialect-agnostic 'Spatial predicates' recipe in the analytics skill (resolve the
entity geometry, build an area-of-interest polygon, test with the engine's
containment/proximity/overlap predicate; mind lon/lat argument order). Steers the
solver off hand-rolled lat/lon BETWEEN boxes toward correct, index-assisted
geospatial predicates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): parse code/dependency text by language grammar
Add two generic <sql_craft> rules: (1) parse imported/required/loaded packages by
the language or manifest format (Java import keep-package-path allowing underscores/
mixed-case; Python import/from + alias stripping; R library/require; .ipynb parse
JSON cell source before language rules; JSON manifests flatten the dependency object
keys), stripping comments/prose and splitting multi-import lines; (2) on a
de-duplicated table with a documented copy/occurrence count, choose COUNT(*) vs the
weight column from the population the question names, not silently. Steers off one
broad regex that drops valid identifiers and matches prose.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): source filters/dates/measures from the owning fact grain
Add a <sql_craft> rule for joined fact tables at different grains (parent order
vs child line item): read each predicate, calendar bucket, and measure from the
table whose grain the question names, not whichever is in scope post-join. An
order-grain filter ("orders that are Complete", "the order's creation date")
must come from the parent even though the child carries its own status/created_at;
line price/cost come from the child. Mirror at metric grain: don't combine a
parent-grain count with child rows (num_of_item * SUM(line_price) per line) —
aggregate each measure at its own grain before combining.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): collapse multi-valued classes to one representative per entity before counting/concentration
When an entity carries a multi-valued classification array (IPC/CPC codes, tags)
and the methodology counts entities-per-class or a concentration/diversity metric
(HHI, originality, share), pick ONE representative per entity first (the array's
main/primary/first flag, else a defined fallback like most-frequent), then
aggregate; and use COUNT(DISTINCT entity) when the denominator is defined as a
count of entities. Unnesting the array otherwise multiplies an entity's weight by
its code count, inflating per-class frequencies and skewing the ranking/score.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(connectors): introspect BigQuery datasets hosted in foreign projects
A dataset_ids/dataset_id entry may now be written `project.dataset` to
introspect a dataset hosted in another project while query jobs still bill to
credentials.project_id. Entries are parsed once at the config boundary into
canonical {project, dataset} pairs; introspection, primary-key discovery,
testConnection, getTableRowCount, and listTables (grouped per project) all
resolve in the dataset's own project, and scanned tables are labeled with that
project so sampling, distinct-value, and read queries resolve. Bare entries are
unchanged.
Implements spider2-specs/specs/18-bigquery-cross-project-datasets.md.
* feat(scan): durable, resumable, bounded relationship detection during enrichment
Move the enrichment persistence boundary to the cost boundary and bound the
open-ended relationship stage (spec 19).
- Checkpoint descriptions + embeddings into the queryable `_schema` manifest
(and the raw enrichment artifacts) before relationship detection runs, via a
new `onCheckpoint` hook + `writeLocalScanEnrichmentCheckpoint`. An interrupted,
budget-truncated, or failed relationship stage now degrades to "no joins",
never "no descriptions".
- Resume the enrichment cache by content identity: re-key the SQLite stage store
on `(connection_id, stage, input_hash)` so a re-run with a fresh runId resumes
finished descriptions/embeddings instead of re-paying for LLM work. The
disposable cache recreates its table if the on-disk key shape differs.
- Make the relationship stage observable and bounded: a sticky wall-clock budget
(`scan.relationships.detectionBudgetMs`, default 600000 ms) + per-unit progress
+ honored `ctx.signal`, threaded through profiling, validation, and composite
detection. On exhaustion/abort it stops scheduling, finalizes, and returns a
partial result instead of throwing or hanging.
- Mark a budget/abort-truncated result partial (diagnostics `partial`/`partialReason`
+ recoverable `relationship_detection_partial` warning). A graceful partial saves
as a completed stage and resumes cheaply; raising the budget changes inputHash
and forces a fresh, fuller run. A process killed mid-stage saves nothing.
Document `detectionBudgetMs` in the ktx.yaml reference. Append implementation
notes to specs/19 and move the intake draft to done/.
Also carries the in-tree per-table enrichment LLM timeout work it builds on
(`description-generation.ts` + the `enrichment_timeout` warning code), which is
intertwined in `local-enrichment.ts`/`types.ts` and cannot be split into a
separately-building commit.
* feat(scan): bound + retry the per-table enrichment LLM call
The batched table-description call had no retry (sampleTable retried 3x, this did
not), so a single transient backend error (e.g. an overloaded/burst rejection when
many tables enrich concurrently) silently nulled a whole table's descriptions —
observed dropping ~70% of a db's tables during a bad window despite ample quota.
- Wrap generateObject in retryAsync (3 attempts + backoff; KTX_ENRICH_LLM_ATTEMPTS).
- Fresh per-attempt timeout (KTX_ENRICH_LLM_TIMEOUT_MS, default 120s) still bounds a
wedged wide table; a timeout is surfaced as KtxAbortedError so it is NOT retried
(one wedge stays one timeout, not 3x).
- Granular per-table progress + start/done/retry/timeout logging.
Composes with spec 19 (its non-goal #1): spec 19 makes completed descriptions durable;
this makes more of them complete.
* feat(scan): survive a hung LLM enrichment backend and resume descriptions
Two compounding failure modes on the per-table description-enrichment path (spec 20):
Enforced per-table timeout for subprocess backends. The runtime declares whether it owns an SDK subprocess (subprocessForkSpec on KtxLlmRuntimePort); codex/claude-code calls run behind a ktx-owned detached child that is tree-killed (SIGKILL of the process group on POSIX, taskkill /T on Windows) on the deadline or ctx.signal, reaping the wedged model grandchild. HTTP backends keep native fetch abort. Default stays 120s, one-wedge-one-timeout.
Incremental, resumable descriptions persistence. generateDescriptions flushes enriched tables per batch to an inputHash-tagged durable record (at a stable, non-syncId path) plus only the changed manifest shards, skips already-enriched tables on resume, and never lets one table's failure discard the stage (a skipped table costs one missing description, not the whole stage's output).
Spec 20 refined + intake draft moved to done/.
* feat(scan): selective enrichment stages (--stages) + per-stage cache keys
Split the single coarse enrichment cache key into per-stage hashes
(descriptions <- snapshot + LLM identity; embeddings <- snapshot + embedding
identity + description digest; relationships <- snapshot + relationship settings
+ LLM identity), so changing one stage's inputs invalidates only that stage and
never throws away the expensive per-table descriptions on an unrelated edit.
Add `ktx ingest --stages <list>` to force-re-run a chosen subset on an
already-ingested connection: a named stage bypasses the completed-stage
short-circuit while the per-table descriptions resume record still skips
already-enriched tables, and unselected stages are left untouched on disk. Feed
embeddings + relationships their description context from the on-disk _schema
when descriptions do not run this invocation, and carry descriptions into the
llmProposals evidence packet (closing a latent gap on the full-run path too).
Surface an enrichment_stage_stale warning when an unselected stage's inputs have
drifted, rather than silently cascading the work.
Implements spider2-specs/specs/21-selective-enrichment-stages.md.
* test(analytics): realign SKILL.md acceptance test with the evolved skill
Three assertions in analytics-skill-content.test.ts drifted from the analytics
SKILL.md as later iterations edited the skill without updating the test:
- the sub-heading was renamed Window functions -> Ordering & aggregation
determinism (iter2), so follow the source name;
- the rule "Expose identity, not just the label" was renamed to "Project BOTH
identity and label" (spec 14), so match the new wording;
- the dialect-FQTN guard false-positived on the Java package example
com.planet_ink.coffee_mud, whose backticks made a 3-segment package path read
as a BigQuery/Snowflake `a.b.c` table reference. Drop the backticks so the
guard stays at full strength without weakening it.
* fix(scan): --stages subset must not delete unselected stages' on-disk artifacts
A --stages subset that omitted descriptions wiped all on-disk ai/db descriptions
from the written _schema. runLocalScan writes the structural manifest shard from
the bare snapshot BEFORE enrichment runs, and the shard merge treats ai/db as
scan-managed and overwrites them with whatever the run emits — none, on a subset
that skips descriptions. Enrichment then read the already-wiped shard via
loadPriorDescriptions and had nothing to restore.
runLocalScanEnrichment now returns the best-available descriptions (fresh-this-run
if descriptions ran, else loaded from the on-disk _schema) instead of [], and
runLocalScan captures the prior descriptions before the structural write and feeds
them to both the structural write and enrichment, so an unselected stage's
artifacts survive. Joins were already preserved for --stages descriptions via the
manual/inferred preservedJoins path.
Tests: a full runLocalScan --stages relationships path test (RED without the fix,
GREEN with it — the earlier unit test missed the structural-pre-write ordering),
plus enrichment-layer contract tests for both directions. Validated live on
northwind: --stages relationships keeps all 110 descriptions + 22 joins (was
wiping to 0); --stages descriptions restores descriptions from the spec-20 resume
record (no LLM calls) while keeping joins.
* feat(dialects): bigquery nested-data (ARRAY/STRUCT/UNNEST), geospatial (GEOGRAPHY), SAFE_DIVIDE
bigquery.md lacked the two sections that define BigQuery analytics (present in snowflake.md):
- Nested & repeated data: UNNEST to flatten arrays of STRUCTs (GA360 hits, GA4 event_params),
dot-notation field access, key-value param scalar-subquery extraction, fan-out/COUNT(DISTINCT) guard.
- Geospatial (GEOGRAPHY): ST_GEOGPOINT (lon-first), containment/proximity/distance/intersection
predicates, areal allocation via ST_AREA(ST_INTERSECTION()).
- SAFE_DIVIDE for zero-denominator-safe rates; sharded-table shard-presence note.
Generic BigQuery craft surfaced by sql_dialect_notes; product-completeness (any BQ analyst benefits).
* feat(dialects): sqlite ROUND half-up FP-underflow note (+1e-9 before ROUND)
SQLite ROUND(x,n) rounds half-away-from-zero, but binary FP stores an exact
half-way value just below it, so ROUND(6.475,2) returns 6.47 not 6.48. Add a
dialect note: nudge by a tiny epsilon (1e-9) below display precision before
rounding for deterministic half-up, leaving non-boundary values unchanged.
Generic SQLite craft surfaced by sql_dialect_notes (any analyst rounding a
displayed average/rate/price benefits).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(analytics): list-as-delimited-string, answer-literally, drop free-text columns
Add SKILL.md guidance to emit list-valued answer cells as delimited
STRING (not ARRAY/repeated column), answer the literal ask without
unrequested transformations (HAVING for aggregate bounds), and avoid
projecting unrequested free-text columns that corrupt row-delimited output.
* fix(scan,mcp): gitignore runtime logs, budget-guard LLM proposal, validate enrich timeout
- gitignore `.ktx/logs/` in both scaffold + setup-merge lists: the managed MCP
daemon writes raw tool params (SQL, memory_ingest content) to mcp.log under a
version-controlled `.ktx/`, and snowflake.log already sat there unprotected.
- gate the LLM relationship proposal on the detection budget/abort signal so an
exhausted or aborted stage cannot start a fresh LLM call; document the boundary.
- validate KTX_ENRICH_LLM_TIMEOUT_MS (NaN/0 → 120s default) like enrichAttempts,
so a bad value no longer times out every table immediately.
- daemon introspection now warns on malformed column/FK rows instead of dropping
them silently, matching the table-row path and the "surface broken objects" goal.
- docs: document `ktx wiki -c/--connection`; fix the SQLite query-deadline schema
doc (forked-subprocess SIGKILL, not worker-thread termination).
* fix(scan,wiki,mcp): address PR #312 review findings
- scan: key the description pipeline (resume map, enriched-schema and
embedding-text lookups, manifest write/read) by full table identity via
tableRefKey/buildTableRef, so two same-named tables in different schemas no
longer cross-assign descriptions or skip a sibling on resume
- scan: re-throw a genuine context cancel during the batched description LLM
call so Ctrl-C resumes the stage instead of nulling tables and recording it
completed; per-table timeouts still degrade (context.signal not aborted)
- scan: report statisticalValidation 'skipped' (not 'completed') when a
budget/abort stop leaves relationship profiling partial
- wiki: sync the full page corpus into the sqlite index and filter only the
candidate/result set, so a connection-scoped search no longer prunes other
connections' pages and cached embeddings from the shared index
- wiki: route verbatim ingest through the canonical writePageAndSync so
contentHash is set and later syncs can short-circuit
- mcp: drop the as-unknown-as cast in serializeMcpError
- dialects/analytics: document the integer-division trap on postgres/sqlite/tsql
Adds regression tests for each behavior change.
* fix(wiki): scope connection filter before SQLite lane limit
Connection-scoped wiki search applied the connectionId allowlist after
the lexical/semantic lanes had already truncated to laneCandidatePoolLimit
over the full (connection-agnostic) corpus. When the requested connection
was a minority of a large corpus, its pages were crowded out of the
candidate pool before filtering, so a semantic-only match could be missed
outright and lexical hits under-ranked.
Push the path allowlist into searchLexicalCandidates/searchSemanticCandidates
so LIMIT applies to in-scope rows, matching what the token lane already did,
and drop the now-redundant post-limit JS filters.
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
2afab61417
commit
f65a5b0e2e
200 changed files with 17780 additions and 672 deletions
117
packages/cli/test/commands/ingest-commands.test.ts
Normal file
117
packages/cli/test/commands/ingest-commands.test.ts
Normal file
|
|
@ -0,0 +1,117 @@
|
|||
import { Command } from '@commander-js/extra-typings';
|
||||
import { describe, expect, it, vi } from 'vitest';
|
||||
import type { KtxCliCommandContext } from '../../src/cli-program.js';
|
||||
import { parseEnrichmentStagesOption, registerIngestCommands } from '../../src/commands/ingest-commands.js';
|
||||
|
||||
function makeContext(overrides: Partial<KtxCliCommandContext> = {}): KtxCliCommandContext {
|
||||
let exitCode = 0;
|
||||
return {
|
||||
io: {
|
||||
stdout: { write: vi.fn() },
|
||||
stderr: { write: vi.fn() },
|
||||
},
|
||||
deps: {},
|
||||
packageInfo: { name: '@kaelio/ktx', version: '0.0.0-test' },
|
||||
setExitCode: (code: number) => {
|
||||
exitCode = code;
|
||||
},
|
||||
runInit: vi.fn(),
|
||||
writeDebug: vi.fn(),
|
||||
...overrides,
|
||||
get exitCode() {
|
||||
return exitCode;
|
||||
},
|
||||
} as unknown as KtxCliCommandContext;
|
||||
}
|
||||
|
||||
function ingestProgram(context: KtxCliCommandContext): Command {
|
||||
const program = new Command().exitOverride().option('--project-dir <path>');
|
||||
registerIngestCommands(program, context, { runTextIngest: vi.fn(async () => 0) });
|
||||
return program;
|
||||
}
|
||||
|
||||
describe('parseEnrichmentStagesOption', () => {
|
||||
it('parses a single stage', () => {
|
||||
expect(parseEnrichmentStagesOption('relationships')).toEqual(['relationships']);
|
||||
});
|
||||
|
||||
it('orders and de-duplicates by the canonical registry order', () => {
|
||||
expect(parseEnrichmentStagesOption('embeddings,descriptions')).toEqual(['descriptions', 'embeddings']);
|
||||
expect(parseEnrichmentStagesOption('relationships,relationships,descriptions')).toEqual([
|
||||
'descriptions',
|
||||
'relationships',
|
||||
]);
|
||||
});
|
||||
|
||||
it('tolerates surrounding whitespace and empty segments', () => {
|
||||
expect(parseEnrichmentStagesOption(' descriptions , , embeddings ')).toEqual(['descriptions', 'embeddings']);
|
||||
});
|
||||
|
||||
it('rejects an empty list', () => {
|
||||
expect(() => parseEnrichmentStagesOption('')).toThrow(/non-empty/);
|
||||
expect(() => parseEnrichmentStagesOption(' , ')).toThrow(/non-empty/);
|
||||
});
|
||||
|
||||
it('rejects an unknown stage name', () => {
|
||||
expect(() => parseEnrichmentStagesOption('foo')).toThrow(/unknown stage "foo"/);
|
||||
expect(() => parseEnrichmentStagesOption('descriptions,foo')).toThrow(/unknown stage "foo"/);
|
||||
});
|
||||
});
|
||||
|
||||
describe('ktx ingest --stages', () => {
|
||||
it('threads a parsed stage set into the public ingest args', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = ingestProgram(context);
|
||||
|
||||
await program.parseAsync(
|
||||
['--project-dir', '/tmp/ktx', 'ingest', 'warehouse', '--stages', 'descriptions,embeddings'],
|
||||
{ from: 'user' },
|
||||
);
|
||||
|
||||
expect(publicIngest).toHaveBeenCalledTimes(1);
|
||||
expect(publicIngest.mock.calls[0]?.[0]).toMatchObject({
|
||||
command: 'run',
|
||||
targetConnectionId: 'warehouse',
|
||||
stages: ['descriptions', 'embeddings'],
|
||||
});
|
||||
});
|
||||
|
||||
it('omits stages entirely when the flag is absent (default = all)', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = ingestProgram(context);
|
||||
|
||||
await program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', 'warehouse'], { from: 'user' });
|
||||
|
||||
expect(publicIngest).toHaveBeenCalledTimes(1);
|
||||
expect(publicIngest.mock.calls[0]?.[0]).not.toHaveProperty('stages');
|
||||
});
|
||||
|
||||
it('rejects an unknown stage with a clear parse error', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = ingestProgram(context);
|
||||
|
||||
await expect(
|
||||
program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', 'warehouse', '--stages', 'foo'], { from: 'user' }),
|
||||
).rejects.toThrow(/unknown stage "foo"/);
|
||||
expect(publicIngest).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('rejects --stages combined with text capture', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const runTextIngest = vi.fn(async () => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = new Command().exitOverride().option('--project-dir <path>');
|
||||
registerIngestCommands(program, context, { runTextIngest });
|
||||
|
||||
await expect(
|
||||
program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', '--text', 'hi', '--stages', 'descriptions'], {
|
||||
from: 'user',
|
||||
}),
|
||||
).rejects.toThrow(/--stages applies to database ingest only/);
|
||||
expect(publicIngest).not.toHaveBeenCalled();
|
||||
expect(runTextIngest).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
|
@ -1,4 +1,5 @@
|
|||
import { describe, expect, it, vi } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { bigQueryConnectionConfigFromConfig, isKtxBigQueryConnectionConfig, type KtxBigQueryClient, KtxBigQueryScanConnector, type KtxBigQueryClientFactory, type KtxBigQueryDataset, type KtxBigQueryQueryJob, type KtxBigQueryTableRef, prepareBigQueryReadOnlyQuery } from '../../../src/connectors/bigquery/connector.js';
|
||||
import { createBigQueryLiveDatabaseIntrospection } from '../../../src/connectors/bigquery/live-database-introspection.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
|
|
@ -114,11 +115,40 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
expect(isKtxBigQueryConnectionConfig({ driver: 'mysql' })).toBe(false);
|
||||
expect(bigQueryConnectionConfigFromConfig({ connectionId: 'warehouse', connection })).toMatchObject({
|
||||
projectId: 'project-1',
|
||||
datasetIds: ['analytics'],
|
||||
datasetIds: [{ project: 'project-1', dataset: 'analytics' }],
|
||||
location: 'US',
|
||||
});
|
||||
});
|
||||
|
||||
it('parses project.dataset entries to host-project pairs and rejects malformed entries', () => {
|
||||
expect(
|
||||
bigQueryConnectionConfigFromConfig({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['bigquery-public-data.austin_311', 'analytics'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
},
|
||||
}).datasetIds,
|
||||
).toEqual([
|
||||
{ project: 'bigquery-public-data', dataset: 'austin_311' },
|
||||
{ project: 'project-1', dataset: 'analytics' },
|
||||
]);
|
||||
|
||||
for (const badEntry of ['proj.ds.table', 'proj.', '.ds']) {
|
||||
expect(() =>
|
||||
bigQueryConnectionConfigFromConfig({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: [badEntry],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
},
|
||||
}),
|
||||
).toThrow(/connections\.warehouse/);
|
||||
}
|
||||
});
|
||||
|
||||
it('introspects datasets, table metadata, primary keys, and normalized types', async () => {
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
|
|
@ -184,6 +214,84 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
]);
|
||||
});
|
||||
|
||||
it('introspects a foreign-hosted dataset under its own project while billing stays local', async () => {
|
||||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['bigquery-public-data.austin_311'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
location: 'US',
|
||||
},
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'foreign' });
|
||||
|
||||
const client = vi.mocked(clientFactory.createClient).mock.results[0]?.value as KtxBigQueryClient;
|
||||
expect(client.dataset).toHaveBeenCalledWith('austin_311', 'bigquery-public-data');
|
||||
expect(clientFactory.createClient).toHaveBeenCalledWith(expect.objectContaining({ projectId: 'project-1' }));
|
||||
expect(snapshot.scope).toEqual({
|
||||
catalogs: ['bigquery-public-data'],
|
||||
datasets: ['bigquery-public-data.austin_311'],
|
||||
});
|
||||
expect(snapshot.metadata.project_id).toBe('project-1');
|
||||
expect(snapshot.tables[0]).toMatchObject({
|
||||
catalog: 'bigquery-public-data',
|
||||
db: 'austin_311',
|
||||
name: 'orders',
|
||||
});
|
||||
});
|
||||
|
||||
it('introspects datasets across multiple host projects, each under its own project', async () => {
|
||||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['bigquery-public-data.austin_311', 'analytics'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
location: 'US',
|
||||
},
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'multi' });
|
||||
|
||||
const client = vi.mocked(clientFactory.createClient).mock.results[0]?.value as KtxBigQueryClient;
|
||||
expect(client.dataset).toHaveBeenCalledWith('austin_311', 'bigquery-public-data');
|
||||
expect(client.dataset).toHaveBeenCalledWith('analytics', 'project-1');
|
||||
expect(snapshot.scope.catalogs).toEqual(['bigquery-public-data', 'project-1']);
|
||||
expect(snapshot.scope.datasets).toEqual(['bigquery-public-data.austin_311', 'analytics']);
|
||||
expect(snapshot.tables.map((table) => ({ catalog: table.catalog, db: table.db, name: table.name }))).toEqual([
|
||||
{ catalog: 'bigquery-public-data', db: 'austin_311', name: 'orders' },
|
||||
{ catalog: 'project-1', db: 'analytics', name: 'orders' },
|
||||
]);
|
||||
});
|
||||
|
||||
it('keeps same-named datasets in different projects distinct', async () => {
|
||||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['proj_a.shared', 'proj_b.shared'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
},
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'same-name' });
|
||||
|
||||
expect(snapshot.scope.catalogs).toEqual(['proj_a', 'proj_b']);
|
||||
expect(snapshot.scope.datasets).toEqual(['proj_a.shared', 'proj_b.shared']);
|
||||
expect(snapshot.tables.map((table) => `${table.catalog}.${table.db}.${table.name}`)).toEqual([
|
||||
'proj_a.shared.orders',
|
||||
'proj_b.shared.orders',
|
||||
]);
|
||||
});
|
||||
|
||||
it.each([
|
||||
Object.assign(new Error('Access Denied'), { code: 403 }),
|
||||
Object.assign(new Error('Not found'), { errors: [{ reason: 'notFound' }] }),
|
||||
|
|
@ -330,6 +438,50 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
expect(skippedGet).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('skips a table that fails introspection and ingests its healthy siblings', async () => {
|
||||
const ordersGet = vi.fn(async (): ReturnType<KtxBigQueryTableRef['get']> => [
|
||||
{ metadata: { type: 'TABLE', numRows: '5', schema: { fields: [{ name: 'id', type: 'INT64', mode: 'REQUIRED' }] } } },
|
||||
]);
|
||||
const brokenGet = vi.fn(async (): ReturnType<KtxBigQueryTableRef['get']> => {
|
||||
throw new Error('Access Denied: Table project-1:analytics.locked');
|
||||
});
|
||||
const clientFactory: KtxBigQueryClientFactory = {
|
||||
createClient: vi.fn(() => ({
|
||||
getDatasets: vi.fn(async (): ReturnType<KtxBigQueryClient['getDatasets']> => [[{ id: 'analytics' }]]),
|
||||
dataset: vi.fn(
|
||||
(): KtxBigQueryDataset => ({
|
||||
get: vi.fn(async () => [{ id: 'analytics' }]),
|
||||
getTables: vi.fn(async (): ReturnType<KtxBigQueryDataset['getTables']> => [
|
||||
[
|
||||
{ id: 'orders', get: ordersGet },
|
||||
{ id: 'locked', get: brokenGet },
|
||||
],
|
||||
]),
|
||||
}),
|
||||
),
|
||||
createQueryJob: vi.fn(async (): ReturnType<KtxBigQueryClient['createQueryJob']> => [
|
||||
{
|
||||
getQueryResults: async (): ReturnType<KtxBigQueryQueryJob['getQueryResults']> => [
|
||||
[],
|
||||
undefined,
|
||||
{ schema: { fields: [{ name: 'table_name', type: 'STRING' }, { name: 'column_name', type: 'STRING' }] } },
|
||||
],
|
||||
},
|
||||
]),
|
||||
})),
|
||||
};
|
||||
const connector = new KtxBigQueryScanConnector({ connectionId: 'warehouse', connection, clientFactory });
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'skip-test' });
|
||||
|
||||
expect(snapshot.tables.map((table) => table.name)).toEqual(['orders']);
|
||||
expect(snapshot.warnings).toHaveLength(1);
|
||||
expect(snapshot.warnings?.[0]).toMatchObject({
|
||||
code: 'object_introspection_failed',
|
||||
table: 'locked',
|
||||
metadata: { object: 'project-1.analytics.locked' },
|
||||
});
|
||||
});
|
||||
|
||||
it('constructs for discovery without dataset scope and lists tables through one region information schema query', async () => {
|
||||
const createQueryJob = vi.fn(
|
||||
async (
|
||||
|
|
@ -441,7 +593,7 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { ...connection, max_bytes_billed: '987654321', job_timeout_ms: 30_000 },
|
||||
connection: { ...connection, max_bytes_billed: '987654321', query_timeout_ms: 30_000 },
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
|
|
@ -491,4 +643,35 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
]),
|
||||
});
|
||||
});
|
||||
|
||||
it('maps a BigQuery job timeout to KtxQueryError', async () => {
|
||||
const timeoutError = new Error('Job execution was cancelled: Job timed out after 5000ms');
|
||||
const clientFactory: KtxBigQueryClientFactory = {
|
||||
createClient: vi.fn(() => ({
|
||||
getDatasets: vi.fn(async (): ReturnType<KtxBigQueryClient['getDatasets']> => [[{ id: 'analytics' }]]),
|
||||
dataset: vi.fn(
|
||||
(datasetId: string): KtxBigQueryDataset => ({
|
||||
get: vi.fn(async () => [{ id: datasetId }]),
|
||||
getTables: vi.fn(async (): ReturnType<KtxBigQueryDataset['getTables']> => [[]]),
|
||||
}),
|
||||
),
|
||||
createQueryJob: vi.fn(async (): ReturnType<KtxBigQueryClient['createQueryJob']> => {
|
||||
throw timeoutError;
|
||||
}),
|
||||
})),
|
||||
};
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { ...connection, query_timeout_ms: 5_000 },
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const execution = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select count(*) from `project-1`.`analytics`.`orders`' },
|
||||
{ runId: 'scan-run-1' },
|
||||
);
|
||||
await expect(execution).rejects.toBeInstanceOf(KtxQueryError);
|
||||
await expect(execution).rejects.toThrow('query exceeded 5s');
|
||||
await expect(execution).rejects.toMatchObject({ cause: timeoutError });
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
import { describe, expect, it, vi } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { clickHouseClientConfigFromConfig, isKtxClickHouseConnectionConfig, KtxClickHouseScanConnector, prepareClickHouseReadOnlyQuery, type KtxClickHouseClientFactory } from '../../../src/connectors/clickhouse/connector.js';
|
||||
import { createClickHouseLiveDatabaseIntrospection } from '../../../src/connectors/clickhouse/live-database-introspection.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
|
|
@ -385,6 +386,43 @@ describe('KtxClickHouseScanConnector', () => {
|
|||
await connector.cleanup();
|
||||
});
|
||||
|
||||
it('applies max_execution_time + an outlasting request_timeout and maps code 159 to KtxQueryError', async () => {
|
||||
let capturedConfig: { request_timeout?: number; clickhouse_settings?: Record<string, unknown> } | undefined;
|
||||
const timeoutError = Object.assign(new Error('Code: 159. DB::Exception: Timeout exceeded'), { code: 159 });
|
||||
const clientFactory: KtxClickHouseClientFactory = {
|
||||
createClient: vi.fn((config) => {
|
||||
capturedConfig = config as { request_timeout?: number; clickhouse_settings?: Record<string, unknown> };
|
||||
return {
|
||||
query: vi.fn(async () => {
|
||||
throw timeoutError;
|
||||
}),
|
||||
close: vi.fn(async () => undefined),
|
||||
};
|
||||
}),
|
||||
};
|
||||
const connector = new KtxClickHouseScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'clickhouse',
|
||||
host: 'ch.example.test',
|
||||
database: 'analytics',
|
||||
username: 'reader',
|
||||
password: 'test-pass', // pragma: allowlist secret
|
||||
query_timeout_ms: 5_000,
|
||||
},
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const execution = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select count(*) from events' },
|
||||
{ runId: 'scan-run-1' },
|
||||
);
|
||||
await expect(execution).rejects.toBeInstanceOf(KtxQueryError);
|
||||
await expect(execution).rejects.toThrow('query exceeded 5s');
|
||||
expect(capturedConfig?.clickhouse_settings?.max_execution_time).toBe(5);
|
||||
expect(capturedConfig?.request_timeout).toBe(10_000);
|
||||
});
|
||||
|
||||
it('adapts native ClickHouse snapshots to live-database introspection for local ingest', async () => {
|
||||
const introspection = createClickHouseLiveDatabaseIntrospection({
|
||||
connections: {
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
import { describe, expect, it, vi } from 'vitest';
|
||||
import type { FieldPacket, RowDataPacket } from 'mysql2/promise';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { createMysqlLiveDatabaseIntrospection } from '../../../src/connectors/mysql/live-database-introspection.js';
|
||||
import { isKtxMysqlConnectionConfig, KtxMysqlScanConnector, mysqlConnectionPoolConfigFromConfig, prepareMysqlReadOnlyQuery, type KtxMysqlConnectionConfig, type KtxMysqlPoolFactory } from '../../../src/connectors/mysql/connector.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
|
|
@ -84,6 +85,9 @@ function fakePoolFactory(): KtxMysqlPoolFactory {
|
|||
[{ name: 'column_name' }, { name: 'estimated_cardinality' }],
|
||||
);
|
||||
}
|
||||
if (/^\s*SET SESSION max_execution_time/i.test(sql)) {
|
||||
return mysqlResult([], []);
|
||||
}
|
||||
throw new Error(`Unexpected SQL: ${sql} params=${JSON.stringify(params)}`);
|
||||
});
|
||||
const release = vi.fn();
|
||||
|
|
@ -172,6 +176,9 @@ function multiSchemaMysqlPoolFactory(
|
|||
expect(params).toEqual(['analytics', 'mart']);
|
||||
return mysqlResult([], []);
|
||||
}
|
||||
if (/^\s*SET SESSION max_execution_time/i.test(sql)) {
|
||||
return mysqlResult([], []);
|
||||
}
|
||||
throw new Error(`Unexpected SQL: ${sql} params=${JSON.stringify(params)}`);
|
||||
});
|
||||
return {
|
||||
|
|
@ -596,4 +603,47 @@ describe('KtxMysqlScanConnector', () => {
|
|||
foreignKeys: [],
|
||||
});
|
||||
});
|
||||
|
||||
it('sets session max_execution_time to the resolved deadline and maps errno 3024 to KtxQueryError', async () => {
|
||||
const issued: Array<{ sql: string; params?: unknown }> = [];
|
||||
const timeoutError = Object.assign(new Error('Query execution was interrupted, maximum statement execution time exceeded'), {
|
||||
errno: 3024,
|
||||
code: 'ER_QUERY_TIMEOUT',
|
||||
});
|
||||
const poolFactory: KtxMysqlPoolFactory = {
|
||||
createPool: vi.fn(() => ({
|
||||
getConnection: vi.fn(async () => ({
|
||||
query: vi.fn(async (sql: string, params?: unknown) => {
|
||||
issued.push({ sql, params });
|
||||
if (/^\s*SET SESSION max_execution_time/i.test(sql)) {
|
||||
return [[], []] as [RowDataPacket[], FieldPacket[]];
|
||||
}
|
||||
throw timeoutError;
|
||||
}),
|
||||
release: vi.fn(),
|
||||
})),
|
||||
end: vi.fn(async () => undefined),
|
||||
})),
|
||||
};
|
||||
const connector = new KtxMysqlScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'mysql',
|
||||
host: 'db.example.test',
|
||||
database: 'analytics',
|
||||
username: 'reader',
|
||||
password: 'test-password', // pragma: allowlist secret
|
||||
query_timeout_ms: 5_000,
|
||||
},
|
||||
poolFactory,
|
||||
});
|
||||
|
||||
const execution = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select count(*) from orders' },
|
||||
{ runId: 'scan-run-1' },
|
||||
);
|
||||
await expect(execution).rejects.toBeInstanceOf(KtxQueryError);
|
||||
await expect(execution).rejects.toThrow('query exceeded 5s');
|
||||
expect(issued[0]).toEqual({ sql: 'SET SESSION max_execution_time = ?', params: [5_000] });
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
import { describe, expect, it, vi } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { createPostgresLiveDatabaseIntrospection } from '../../../src/connectors/postgres/live-database-introspection.js';
|
||||
import { isKtxPostgresConnectionConfig, KtxPostgresScanConnector, postgresPoolConfigFromConfig, preparePostgresReadOnlyQuery, type KtxPostgresConnectionConfig, type KtxPostgresPoolFactory } from '../../../src/connectors/postgres/connector.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
|
|
@ -148,7 +149,7 @@ describe('KtxPostgresScanConnector', () => {
|
|||
database: 'analytics',
|
||||
user: 'reader',
|
||||
password: 'test-password', // pragma: allowlist secret
|
||||
options: '-c search_path=analytics,public',
|
||||
options: '-c search_path=analytics,public -c statement_timeout=30000',
|
||||
ssl: { rejectUnauthorized: false },
|
||||
});
|
||||
const libpqPreferConfig = postgresPoolConfigFromConfig({
|
||||
|
|
@ -401,6 +402,61 @@ describe('KtxPostgresScanConnector', () => {
|
|||
).rejects.toThrow('Only read-only SELECT/WITH queries can be executed locally');
|
||||
});
|
||||
|
||||
it('applies the resolved deadline as statement_timeout and maps a 57014 cancellation to KtxQueryError', () => {
|
||||
expect(
|
||||
postgresPoolConfigFromConfig({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'postgres',
|
||||
host: 'db.example.test',
|
||||
database: 'analytics',
|
||||
username: 'reader',
|
||||
password: 'test-password', // pragma: allowlist secret
|
||||
query_timeout_ms: 5_000,
|
||||
},
|
||||
}).options,
|
||||
).toBe('-c search_path=public -c statement_timeout=5000');
|
||||
});
|
||||
|
||||
it('maps a Postgres statement_timeout cancellation (57014) to a KtxQueryError', async () => {
|
||||
const timeoutError = Object.assign(new Error('canceling statement due to statement timeout'), { code: '57014' });
|
||||
const poolFactory: KtxPostgresPoolFactory = {
|
||||
createPool() {
|
||||
return {
|
||||
async connect() {
|
||||
return {
|
||||
query: vi.fn(async () => {
|
||||
throw timeoutError;
|
||||
}),
|
||||
release: vi.fn(),
|
||||
};
|
||||
},
|
||||
end: vi.fn(async () => undefined),
|
||||
};
|
||||
},
|
||||
};
|
||||
const connector = new KtxPostgresScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'postgres',
|
||||
host: 'db.example.test',
|
||||
database: 'analytics',
|
||||
username: 'reader',
|
||||
password: 'test-password', // pragma: allowlist secret
|
||||
query_timeout_ms: 5_000,
|
||||
},
|
||||
poolFactory,
|
||||
});
|
||||
|
||||
const execution = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select count(*) from orders' },
|
||||
{ runId: 'scan-run-1' },
|
||||
);
|
||||
await expect(execution).rejects.toBeInstanceOf(KtxQueryError);
|
||||
await expect(execution).rejects.toThrow('query exceeded 5s');
|
||||
await expect(execution).rejects.toMatchObject({ cause: timeoutError });
|
||||
});
|
||||
|
||||
it('limits introspection to tables in tableScope', async () => {
|
||||
const queries: Array<{ sql: string; params?: unknown[] }> = [];
|
||||
const poolFactory: KtxPostgresPoolFactory = {
|
||||
|
|
|
|||
|
|
@ -7,6 +7,7 @@ vi.mock('snowflake-sdk', () => ({
|
|||
createPool,
|
||||
}));
|
||||
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { createSnowflakeLiveDatabaseIntrospection } from '../../../src/connectors/snowflake/live-database-introspection.js';
|
||||
import { isKtxSnowflakeConnectionConfig, KtxSnowflakeScanConnector, prepareSnowflakeReadOnlyQuery, snowflakeConnectionConfigFromConfig, type KtxSnowflakeConnectionConfig, type KtxSnowflakeDriver, type KtxSnowflakeDriverFactory } from '../../../src/connectors/snowflake/connector.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
|
|
@ -271,6 +272,57 @@ describe('KtxSnowflakeScanConnector', () => {
|
|||
expect(close).toHaveBeenCalledTimes(1);
|
||||
});
|
||||
|
||||
it('sets STATEMENT_TIMEOUT_IN_SECONDS to the resolved deadline and maps a Snowflake timeout to KtxQueryError', async () => {
|
||||
createPool.mockReset();
|
||||
const executedSql: string[] = [];
|
||||
const timeoutError = Object.assign(
|
||||
new Error('Statement reached its statement or warehouse timeout of 5 second(s) and was canceled.'),
|
||||
{ code: 604 },
|
||||
);
|
||||
const connection = {
|
||||
execute: vi.fn(
|
||||
(input: {
|
||||
sqlText: string;
|
||||
complete: (error: Error | null, statement: ReturnType<typeof fakeSnowflakeStatement>, rows: unknown[]) => void;
|
||||
}) => {
|
||||
executedSql.push(input.sqlText);
|
||||
if (/^ALTER SESSION/i.test(input.sqlText)) {
|
||||
input.complete(null, fakeSnowflakeStatement(), [{ ONE: 1 }]);
|
||||
} else {
|
||||
input.complete(timeoutError, fakeSnowflakeStatement(), []);
|
||||
}
|
||||
},
|
||||
),
|
||||
};
|
||||
createPool.mockReturnValue({
|
||||
use: vi.fn(async (fn: (conn: typeof connection) => Promise<unknown>) => fn(connection)),
|
||||
drain: vi.fn(async () => undefined),
|
||||
clear: vi.fn(async () => undefined),
|
||||
});
|
||||
const connector = new KtxSnowflakeScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'snowflake',
|
||||
authMethod: 'password',
|
||||
account: 'acct',
|
||||
warehouse: 'WH',
|
||||
database: 'ANALYTICS',
|
||||
schema_name: 'PUBLIC',
|
||||
username: 'reader',
|
||||
password: 'fixture-pass', // pragma: allowlist secret
|
||||
query_timeout_ms: 5_000,
|
||||
},
|
||||
});
|
||||
|
||||
const execution = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select count(*) from orders' },
|
||||
{ runId: 'run-1' },
|
||||
);
|
||||
await expect(execution).rejects.toBeInstanceOf(KtxQueryError);
|
||||
await expect(execution).rejects.toThrow('query exceeded 5s');
|
||||
expect(executedSql[0]).toBe('ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = 5');
|
||||
});
|
||||
|
||||
it('introspects schema, primary keys, comments, row counts, and dimensions', async () => {
|
||||
const connector = new KtxSnowflakeScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
|
|
|
|||
|
|
@ -1,12 +1,19 @@
|
|||
import Database from 'better-sqlite3';
|
||||
import type { ChildProcess } from 'node:child_process';
|
||||
import { writeFileSync } from 'node:fs';
|
||||
import { mkdtemp, rm } from 'node:fs/promises';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
import { afterEach, beforeEach, describe, expect, it } from 'vitest';
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
|
||||
import { createSqliteLiveDatabaseIntrospection } from '../../../src/connectors/sqlite/live-database-introspection.js';
|
||||
import { isKtxSqliteConnectionConfig, KtxSqliteScanConnector, sqliteDatabasePathFromConfig } from '../../../src/connectors/sqlite/connector.js';
|
||||
import {
|
||||
forkReadQueryChild,
|
||||
isKtxSqliteConnectionConfig,
|
||||
KtxSqliteScanConnector,
|
||||
sqliteDatabasePathFromConfig,
|
||||
} from '../../../src/connectors/sqlite/connector.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
import { resolveEnabledTables } from '../../../src/context/scan/enabled-tables.js';
|
||||
|
||||
describe('KtxSqliteScanConnector', () => {
|
||||
let tempDir: string;
|
||||
|
|
@ -150,6 +157,74 @@ describe('KtxSqliteScanConnector', () => {
|
|||
]);
|
||||
});
|
||||
|
||||
it('skips an object that fails introspection and ingests the rest with one recoverable warning', async () => {
|
||||
const brokenDbPath = join(tempDir, 'broken.db');
|
||||
const brokenDb = new Database(brokenDbPath);
|
||||
brokenDb.exec(`
|
||||
CREATE TABLE base (id INTEGER PRIMARY KEY, start_date TEXT);
|
||||
CREATE VIEW emp_hire_periods_with_name AS SELECT id, start_date FROM base;
|
||||
CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT NOT NULL);
|
||||
INSERT INTO customers (id, name) VALUES (1, 'Ada');
|
||||
DROP TABLE base;
|
||||
`);
|
||||
brokenDb.close();
|
||||
|
||||
const connector = new KtxSqliteScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { driver: 'sqlite', path: brokenDbPath },
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'sqlite' }, { runId: 'scan-run-broken' });
|
||||
|
||||
expect(snapshot.tables.map((table) => table.name)).toEqual(['customers']);
|
||||
expect(snapshot.warnings).toHaveLength(1);
|
||||
expect(snapshot.warnings?.[0]).toMatchObject({
|
||||
code: 'object_introspection_failed',
|
||||
table: 'emp_hire_periods_with_name',
|
||||
recoverable: true,
|
||||
});
|
||||
expect(snapshot.warnings?.[0]?.message).toContain('no such table');
|
||||
});
|
||||
|
||||
it('returns no tables and only warnings when every object fails introspection', async () => {
|
||||
const brokenDbPath = join(tempDir, 'all-broken.db');
|
||||
const brokenDb = new Database(brokenDbPath);
|
||||
brokenDb.exec(`
|
||||
CREATE TABLE base (id INTEGER PRIMARY KEY, value TEXT);
|
||||
CREATE VIEW only_view AS SELECT id, value FROM base;
|
||||
DROP TABLE base;
|
||||
`);
|
||||
brokenDb.close();
|
||||
|
||||
const connector = new KtxSqliteScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { driver: 'sqlite', path: brokenDbPath },
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'sqlite' }, { runId: 'scan-run-all-broken' });
|
||||
|
||||
expect(snapshot.tables).toEqual([]);
|
||||
expect(snapshot.warnings).toHaveLength(1);
|
||||
expect(snapshot.warnings?.[0]?.code).toBe('object_introspection_failed');
|
||||
});
|
||||
|
||||
it('restricts introspection to enabled_tables, accepting both "main.<name>" and bare "<name>"', async () => {
|
||||
const connector = new KtxSqliteScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { driver: 'sqlite', path: dbPath },
|
||||
});
|
||||
|
||||
for (const entry of ['main.customers', 'customers']) {
|
||||
const tableScope = resolveEnabledTables({ driver: 'sqlite', enabled_tables: [entry] }) ?? undefined;
|
||||
const snapshot = await connector.introspect(
|
||||
{ connectionId: 'warehouse', driver: 'sqlite', ...(tableScope ? { tableScope } : {}) },
|
||||
{ runId: `scan-run-scope-${entry}` },
|
||||
);
|
||||
expect(snapshot.tables.map((table) => table.name)).toEqual(['customers']);
|
||||
expect(snapshot.metadata.discovered_object_names).toEqual(['customers', 'orders', 'recent_orders']);
|
||||
}
|
||||
});
|
||||
|
||||
it('lists schemaless tables and views for setup discovery', async () => {
|
||||
const connector = new KtxSqliteScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
|
|
@ -224,6 +299,101 @@ describe('KtxSqliteScanConnector', () => {
|
|||
expect(snapshot.tables.map((table) => table.name)).toEqual(['orders']);
|
||||
});
|
||||
|
||||
describe('bounded read-query execution', () => {
|
||||
// A recursive CTE that spins ~1e9 iterations in SQLite's VM with no yield
|
||||
// point — the single-aggregate-row shape that maxRows cannot bound. Natural
|
||||
// completion is far beyond the test window, so a fast finish proves the
|
||||
// child was killed, not that the query completed.
|
||||
const pathologicalSql =
|
||||
'WITH RECURSIVE c(x) AS (SELECT 1 UNION ALL SELECT x + 1 FROM c WHERE x < 1000000000) SELECT COUNT(*) AS n FROM c';
|
||||
|
||||
let children: ChildProcess[];
|
||||
const trackingSpawn = () => {
|
||||
const child = forkReadQueryChild();
|
||||
children.push(child);
|
||||
return child;
|
||||
};
|
||||
|
||||
beforeEach(() => {
|
||||
children = [];
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
for (const child of children) {
|
||||
if (child.exitCode === null && child.signalCode === null) {
|
||||
child.kill('SIGKILL');
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
it('terminates a pathological query at the deadline, keeps the event loop free, and reaps the child', async () => {
|
||||
const connector = new KtxSqliteScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { driver: 'sqlite', path: dbPath, query_timeout_ms: 250 },
|
||||
spawnReadQueryChild: trackingSpawn,
|
||||
});
|
||||
|
||||
const pending = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: pathologicalSql },
|
||||
{ runId: 'deadline-test' },
|
||||
);
|
||||
|
||||
// The event loop stays free while the query runs off-process, so this
|
||||
// concurrent timer fires before the deadline rejects the query.
|
||||
let concurrentFiredWhilePending = false;
|
||||
void pending.catch(() => {});
|
||||
await new Promise((resolveTimer) => setTimeout(resolveTimer, 80));
|
||||
concurrentFiredWhilePending = true;
|
||||
|
||||
await expect(pending).rejects.toThrow(/^query exceeded \d+s$/);
|
||||
expect(concurrentFiredWhilePending).toBe(true);
|
||||
|
||||
// The off-process executor was actually killed (SIGKILL), not left spinning.
|
||||
expect(children).toHaveLength(1);
|
||||
const child = children[0]!;
|
||||
await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), {
|
||||
timeout: 5_000,
|
||||
});
|
||||
expect(child.signalCode).toBe('SIGKILL');
|
||||
});
|
||||
|
||||
it('returns identical results to the in-process path for a normal query', async () => {
|
||||
const connector = new KtxSqliteScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { driver: 'sqlite', path: dbPath },
|
||||
spawnReadQueryChild: trackingSpawn,
|
||||
});
|
||||
|
||||
await expect(
|
||||
connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select id, status from orders order by id' },
|
||||
{ runId: 'normal' },
|
||||
),
|
||||
).resolves.toEqual({
|
||||
headers: ['id', 'status'],
|
||||
rows: [
|
||||
[10, 'paid'],
|
||||
[11, 'open'],
|
||||
],
|
||||
totalRows: 2,
|
||||
rowCount: 2,
|
||||
});
|
||||
});
|
||||
|
||||
it('rejects invalid SQL on the main thread without spawning a child', async () => {
|
||||
const connector = new KtxSqliteScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { driver: 'sqlite', path: dbPath },
|
||||
spawnReadQueryChild: trackingSpawn,
|
||||
});
|
||||
|
||||
await expect(
|
||||
connector.executeReadOnly({ connectionId: 'warehouse', sql: 'delete from orders' }, { runId: 'invalid' }),
|
||||
).rejects.toThrow('Only read-only SELECT/WITH queries can be executed locally');
|
||||
expect(children).toHaveLength(0);
|
||||
});
|
||||
});
|
||||
|
||||
it('adapts native SQLite snapshots to live-database introspection for local ingest', async () => {
|
||||
const introspection = createSqliteLiveDatabaseIntrospection({
|
||||
projectDir: tempDir,
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
import { describe, expect, it, vi } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { createSqlServerLiveDatabaseIntrospection } from '../../../src/connectors/sqlserver/live-database-introspection.js';
|
||||
import { isKtxSqlServerConnectionConfig, KtxSqlServerScanConnector, prepareSqlServerReadOnlyQuery, sqlServerConnectionPoolConfigFromConfig, type KtxSqlServerConnectionConfig, type KtxSqlServerPoolFactory, type KtxSqlServerQueryResult } from '../../../src/connectors/sqlserver/connector.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
|
|
@ -404,6 +405,52 @@ describe('KtxSqlServerScanConnector', () => {
|
|||
await connector.cleanup();
|
||||
});
|
||||
|
||||
it('sets requestTimeout to the resolved deadline and maps an ETIMEOUT to KtxQueryError', async () => {
|
||||
expect(
|
||||
sqlServerConnectionPoolConfigFromConfig({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'sqlserver',
|
||||
host: 'db.example.test',
|
||||
database: 'analytics',
|
||||
username: 'reader',
|
||||
query_timeout_ms: 5_000,
|
||||
},
|
||||
}),
|
||||
).toMatchObject({ requestTimeout: 5_000 });
|
||||
|
||||
const timeoutError = Object.assign(new Error('Timeout: Request failed to complete in 5000ms'), { code: 'ETIMEOUT' });
|
||||
const poolFactory: KtxSqlServerPoolFactory = {
|
||||
createPool: vi.fn(async () => {
|
||||
const request = {
|
||||
input: vi.fn(() => request),
|
||||
query: vi.fn(async () => {
|
||||
throw timeoutError;
|
||||
}),
|
||||
};
|
||||
return { request: () => request, close: vi.fn(async () => undefined) };
|
||||
}),
|
||||
};
|
||||
const connector = new KtxSqlServerScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'sqlserver',
|
||||
host: 'db.example.test',
|
||||
database: 'analytics',
|
||||
username: 'reader',
|
||||
query_timeout_ms: 5_000,
|
||||
},
|
||||
poolFactory,
|
||||
});
|
||||
|
||||
const execution = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select count(*) from dbo.orders' },
|
||||
{ runId: 'scan-run-1' },
|
||||
);
|
||||
await expect(execution).rejects.toBeInstanceOf(KtxQueryError);
|
||||
await expect(execution).rejects.toThrow('query exceeded 5s');
|
||||
});
|
||||
|
||||
it('hoists leading CTEs before applying the SQL Server TOP wrapper', async () => {
|
||||
const queries: string[] = [];
|
||||
const request = {
|
||||
|
|
|
|||
|
|
@ -0,0 +1,26 @@
|
|||
import { describe, expect, it } from 'vitest';
|
||||
import type { KtxProjectConnectionConfig } from '../../../src/context/project/config.js';
|
||||
import { assertConfiguredConnectionId } from '../../../src/context/connections/configured-connections.js';
|
||||
|
||||
const connections = {
|
||||
sales_db: { driver: 'sqlite' } as unknown as KtxProjectConnectionConfig,
|
||||
events_db: { driver: 'sqlite' } as unknown as KtxProjectConnectionConfig,
|
||||
};
|
||||
|
||||
describe('assertConfiguredConnectionId', () => {
|
||||
it('returns the id when configured', () => {
|
||||
expect(assertConfiguredConnectionId(connections, 'sales_db')).toBe('sales_db');
|
||||
});
|
||||
|
||||
it('throws listing the configured ids when unknown', () => {
|
||||
expect(() => assertConfiguredConnectionId(connections, 'warehouse')).toThrow(
|
||||
'Unknown connection "warehouse". Configured connections: events_db, sales_db.',
|
||||
);
|
||||
});
|
||||
|
||||
it('reports none configured for an empty connections map', () => {
|
||||
expect(() => assertConfiguredConnectionId({}, 'warehouse')).toThrow(
|
||||
'Unknown connection "warehouse". Configured connections: (none configured).',
|
||||
);
|
||||
});
|
||||
});
|
||||
36
packages/cli/test/context/connections/query-deadline.test.ts
Normal file
36
packages/cli/test/context/connections/query-deadline.test.ts
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
import { describe, expect, it } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import {
|
||||
DEFAULT_QUERY_TIMEOUT_MS,
|
||||
queryDeadlineExceededError,
|
||||
resolveQueryDeadlineMs,
|
||||
} from '../../../src/context/connections/query-deadline.js';
|
||||
|
||||
describe('resolveQueryDeadlineMs', () => {
|
||||
it('returns the 30s default when no override is set', () => {
|
||||
expect(DEFAULT_QUERY_TIMEOUT_MS).toBe(30_000);
|
||||
expect(resolveQueryDeadlineMs(undefined)).toBe(30_000);
|
||||
expect(resolveQueryDeadlineMs({ driver: 'sqlite' })).toBe(30_000);
|
||||
});
|
||||
|
||||
it('honors a positive-integer query_timeout_ms override', () => {
|
||||
expect(resolveQueryDeadlineMs({ query_timeout_ms: 5_000 })).toBe(5_000);
|
||||
expect(resolveQueryDeadlineMs({ query_timeout_ms: 1 })).toBe(1);
|
||||
});
|
||||
|
||||
it('rejects a zero, negative, or non-integer override', () => {
|
||||
expect(() => resolveQueryDeadlineMs({ query_timeout_ms: 0 })).toThrow(/positive integer/);
|
||||
expect(() => resolveQueryDeadlineMs({ query_timeout_ms: -5 })).toThrow(/positive integer/);
|
||||
expect(() => resolveQueryDeadlineMs({ query_timeout_ms: 1.5 })).toThrow(/positive integer/);
|
||||
expect(() => resolveQueryDeadlineMs({ query_timeout_ms: '5000' as unknown as number })).toThrow(/positive integer/);
|
||||
});
|
||||
});
|
||||
|
||||
describe('queryDeadlineExceededError', () => {
|
||||
it('is a KtxQueryError with the canonical seconds-rounded message', () => {
|
||||
const error = queryDeadlineExceededError(30_000);
|
||||
expect(error).toBeInstanceOf(KtxQueryError);
|
||||
expect(error.message).toBe('query exceeded 30s');
|
||||
expect(queryDeadlineExceededError(45_000).message).toBe('query exceeded 45s');
|
||||
});
|
||||
});
|
||||
|
|
@ -91,6 +91,7 @@ function llm(decisions: Array<{ role: string; exclude: boolean; reason: string }
|
|||
generateText: vi.fn(),
|
||||
generateObject,
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
};
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -130,6 +130,39 @@ describe('createDaemonLiveDatabaseIntrospection', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('maps daemon warnings into the snapshot and drops codes Node cannot render', async () => {
|
||||
const runJson = vi.fn(async () => ({
|
||||
...daemonResponse,
|
||||
tables: [],
|
||||
warnings: [
|
||||
{
|
||||
code: 'object_introspection_failed',
|
||||
message: 'permission denied for relation locked',
|
||||
table: 'locked',
|
||||
recoverable: true,
|
||||
metadata: { object: 'public.locked' },
|
||||
},
|
||||
{ code: 'totally_unknown_code', message: 'ignored', recoverable: true },
|
||||
],
|
||||
}));
|
||||
const introspection = createDaemonLiveDatabaseIntrospection({
|
||||
connections: { warehouse: { driver: 'postgres', url: 'postgres://localhost:5432/warehouse' } },
|
||||
schemas: ['public'],
|
||||
runJson,
|
||||
});
|
||||
|
||||
const snapshot = await introspection.extractSchema('warehouse');
|
||||
expect(snapshot.warnings).toEqual([
|
||||
{
|
||||
code: 'object_introspection_failed',
|
||||
message: 'permission denied for relation locked',
|
||||
table: 'locked',
|
||||
recoverable: true,
|
||||
metadata: { object: 'public.locked' },
|
||||
},
|
||||
]);
|
||||
});
|
||||
|
||||
it('calls a running daemon HTTP endpoint when baseUrl is configured', async () => {
|
||||
const requests: Array<{ url: string | undefined; body: unknown }> = [];
|
||||
const server = createServer((request, response) => {
|
||||
|
|
|
|||
|
|
@ -1,9 +1,14 @@
|
|||
import { mkdtemp, readdir, rm } from 'node:fs/promises';
|
||||
import Database from 'better-sqlite3';
|
||||
import { mkdtemp, readdir, readFile, rm } from 'node:fs/promises';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
import { describe, expect, it, vi } from 'vitest';
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
|
||||
import { tableRefSet, type KtxTableRefKey } from '../../../../../src/context/scan/table-ref.js';
|
||||
import { LiveDatabaseSourceAdapter } from '../../../../../src/context/ingest/adapters/live-database/live-database.adapter.js';
|
||||
import { createSqliteLiveDatabaseIntrospection } from '../../../../../src/connectors/sqlite/live-database-introspection.js';
|
||||
import { resolveEnabledTables } from '../../../../../src/context/scan/enabled-tables.js';
|
||||
import { KtxExpectedError } from '../../../../../src/errors.js';
|
||||
import type { FetchContext } from '../../../../../src/context/ingest/types.js';
|
||||
|
||||
describe('LiveDatabaseSourceAdapter', () => {
|
||||
it('fetches a schema snapshot through the introspection port', async () => {
|
||||
|
|
@ -109,3 +114,106 @@ describe('LiveDatabaseSourceAdapter', () => {
|
|||
}
|
||||
});
|
||||
});
|
||||
|
||||
describe('LiveDatabaseSourceAdapter (sqlite) tolerant scan', () => {
|
||||
const CONNECTION_ID = 'warehouse';
|
||||
let tempDir: string;
|
||||
|
||||
beforeEach(async () => {
|
||||
tempDir = await mkdtemp(join(tmpdir(), 'ktx-live-db-tolerant-'));
|
||||
});
|
||||
|
||||
afterEach(async () => {
|
||||
await rm(tempDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
function adapterFor(dbPath: string): LiveDatabaseSourceAdapter {
|
||||
return new LiveDatabaseSourceAdapter({
|
||||
introspection: createSqliteLiveDatabaseIntrospection({
|
||||
projectDir: tempDir,
|
||||
connections: { [CONNECTION_ID]: { driver: 'sqlite', path: dbPath } },
|
||||
}),
|
||||
});
|
||||
}
|
||||
|
||||
function ctx(overrides: Partial<FetchContext> = {}): FetchContext {
|
||||
return { connectionId: CONNECTION_ID, sourceKey: 'live-database', ...overrides };
|
||||
}
|
||||
|
||||
it('ingests healthy objects and reports the broken view as a skip', async () => {
|
||||
const dbPath = join(tempDir, 'partial.db');
|
||||
const db = new Database(dbPath);
|
||||
db.exec(`
|
||||
CREATE TABLE base (id INTEGER PRIMARY KEY, start_date TEXT);
|
||||
CREATE VIEW emp_hire_periods_with_name AS SELECT id, start_date FROM base;
|
||||
CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT NOT NULL);
|
||||
DROP TABLE base;
|
||||
`);
|
||||
db.close();
|
||||
|
||||
const adapter = adapterFor(dbPath);
|
||||
const stagedDir = join(tempDir, 'staged-partial');
|
||||
await adapter.fetch(undefined, stagedDir, ctx());
|
||||
|
||||
await expect(adapter.detect(stagedDir)).resolves.toBe(true);
|
||||
|
||||
const warnings = JSON.parse(await readFile(join(stagedDir, 'warnings.json'), 'utf8')) as {
|
||||
warnings: Array<{ code: string; table?: string }>;
|
||||
};
|
||||
expect(warnings.warnings).toHaveLength(1);
|
||||
expect(warnings.warnings[0]).toMatchObject({
|
||||
code: 'object_introspection_failed',
|
||||
table: 'emp_hire_periods_with_name',
|
||||
});
|
||||
|
||||
const report = await adapter.readFetchReport(stagedDir);
|
||||
expect(report?.skipped.map((issue) => issue.entityId)).toEqual(['emp_hire_periods_with_name']);
|
||||
});
|
||||
|
||||
it('raises a clear connection error when every object fails introspection', async () => {
|
||||
const dbPath = join(tempDir, 'all-broken.db');
|
||||
const db = new Database(dbPath);
|
||||
db.exec(`
|
||||
CREATE TABLE base (id INTEGER PRIMARY KEY, value TEXT);
|
||||
CREATE VIEW only_view AS SELECT id, value FROM base;
|
||||
DROP TABLE base;
|
||||
`);
|
||||
db.close();
|
||||
|
||||
const adapter = adapterFor(dbPath);
|
||||
await expect(adapter.fetch(undefined, join(tempDir, 'staged-all-broken'), ctx())).rejects.toThrow(KtxExpectedError);
|
||||
});
|
||||
|
||||
it('treats a genuinely empty database as a recognized, empty success', async () => {
|
||||
const dbPath = join(tempDir, 'empty.db');
|
||||
new Database(dbPath).close();
|
||||
|
||||
const adapter = adapterFor(dbPath);
|
||||
const stagedDir = join(tempDir, 'staged-empty');
|
||||
await adapter.fetch(undefined, stagedDir, ctx());
|
||||
await expect(adapter.detect(stagedDir)).resolves.toBe(true);
|
||||
await expect(adapter.readFetchReport(stagedDir)).resolves.toBeNull();
|
||||
});
|
||||
|
||||
it('ingests exactly the enabled_tables subset and fails clearly on a zero-match scope', async () => {
|
||||
const dbPath = join(tempDir, 'scoped.db');
|
||||
const db = new Database(dbPath);
|
||||
db.exec(`
|
||||
CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT);
|
||||
CREATE TABLE orders (id INTEGER PRIMARY KEY, customer_id INTEGER);
|
||||
`);
|
||||
db.close();
|
||||
const adapter = adapterFor(dbPath);
|
||||
|
||||
const scope = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['main.customers'] }) ?? undefined;
|
||||
const stagedDir = join(tempDir, 'staged-scoped');
|
||||
await adapter.fetch(undefined, stagedDir, ctx({ tableScope: scope }));
|
||||
const meta = JSON.parse(await readFile(join(stagedDir, 'connection.json'), 'utf8')) as { tableCount: number };
|
||||
expect(meta.tableCount).toBe(1);
|
||||
|
||||
const typoScope = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['nope'] }) ?? undefined;
|
||||
await expect(
|
||||
adapter.fetch(undefined, join(tempDir, 'staged-zero'), ctx({ tableScope: typoScope })),
|
||||
).rejects.toThrow(/matched no objects.*Available objects: customers, orders/s);
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -14,7 +14,7 @@ describe('buildLiveDatabaseManifestShards', () => {
|
|||
it('builds shard objects with generated joins and preserved external descriptions', () => {
|
||||
const existingDescriptions = new Map<string, LiveDatabaseManifestExistingDescriptions>([
|
||||
[
|
||||
'orders',
|
||||
'public.orders',
|
||||
{
|
||||
table: { user: 'Pinned analyst description', db: 'Old db description' },
|
||||
columns: new Map([['id', { user: 'Pinned id description', db: 'Old id description' }]]),
|
||||
|
|
@ -189,7 +189,7 @@ describe('buildLiveDatabaseManifestShards', () => {
|
|||
it('preserves external usage keys while replacing historic SQL managed keys', () => {
|
||||
const existingUsage = new Map([
|
||||
[
|
||||
'orders',
|
||||
'public.orders',
|
||||
{
|
||||
narrative: 'Old generated usage narrative.',
|
||||
frequencyTier: 'low' as const,
|
||||
|
|
|
|||
|
|
@ -0,0 +1,65 @@
|
|||
import { describe, expect, it } from 'vitest';
|
||||
import { assertLiveDatabaseScanOutcome } from '../../../../../src/context/ingest/adapters/live-database/scan-outcome.js';
|
||||
import { tableRefSet } from '../../../../../src/context/scan/table-ref.js';
|
||||
import type { KtxSchemaSnapshot, KtxSchemaTable } from '../../../../../src/context/scan/types.js';
|
||||
|
||||
function table(name: string): KtxSchemaTable {
|
||||
return { catalog: null, db: null, name, kind: 'table', comment: null, estimatedRows: 0, columns: [], foreignKeys: [] };
|
||||
}
|
||||
|
||||
function snapshot(overrides: Partial<KtxSchemaSnapshot>): KtxSchemaSnapshot {
|
||||
return {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'sqlite',
|
||||
extractedAt: '2026-06-14T00:00:00.000Z',
|
||||
scope: {},
|
||||
metadata: {},
|
||||
tables: [],
|
||||
...overrides,
|
||||
};
|
||||
}
|
||||
|
||||
describe('assertLiveDatabaseScanOutcome', () => {
|
||||
it('passes when at least one object was ingested, even with skips', () => {
|
||||
expect(() =>
|
||||
assertLiveDatabaseScanOutcome({
|
||||
connectionId: 'warehouse',
|
||||
scope: undefined,
|
||||
snapshot: snapshot({
|
||||
tables: [table('customers')],
|
||||
warnings: [{ code: 'object_introspection_failed', message: 'boom', table: 'broken', recoverable: true }],
|
||||
}),
|
||||
}),
|
||||
).not.toThrow();
|
||||
});
|
||||
|
||||
it('passes for a legitimately empty database (no scope, no objects)', () => {
|
||||
expect(() =>
|
||||
assertLiveDatabaseScanOutcome({ connectionId: 'warehouse', scope: undefined, snapshot: snapshot({}) }),
|
||||
).not.toThrow();
|
||||
});
|
||||
|
||||
it('fails clearly when every introspected object failed', () => {
|
||||
expect(() =>
|
||||
assertLiveDatabaseScanOutcome({
|
||||
connectionId: 'warehouse',
|
||||
scope: undefined,
|
||||
snapshot: snapshot({
|
||||
warnings: [
|
||||
{ code: 'object_introspection_failed', message: 'no such table: base', table: 'only_view', recoverable: true },
|
||||
],
|
||||
}),
|
||||
}),
|
||||
).toThrow(/all 1 introspected object failed.*only_view: no such table: base/s);
|
||||
});
|
||||
|
||||
it('fails clearly when a non-empty enabled_tables scope matched nothing, naming available objects', () => {
|
||||
expect(() =>
|
||||
assertLiveDatabaseScanOutcome({
|
||||
connectionId: 'warehouse',
|
||||
scope: tableRefSet([{ catalog: null, db: null, name: 'typo_table' }]),
|
||||
snapshot: snapshot({ metadata: { discovered_object_names: ['customers', 'orders'] } }),
|
||||
}),
|
||||
).toThrow(/matched no objects.*typo_table.*Available objects: customers, orders/s);
|
||||
});
|
||||
});
|
||||
|
|
@ -91,6 +91,7 @@ describe('createLocalBundleIngestRuntime', () => {
|
|||
generateText: vi.fn(),
|
||||
generateObject: vi.fn(),
|
||||
runAgentLoop: vi.fn(async () => ({ stopReason: 'natural' as const })),
|
||||
subprocessForkSpec: vi.fn(() => null),
|
||||
};
|
||||
project.config.llm = {
|
||||
provider: { backend: 'claude-code' },
|
||||
|
|
|
|||
|
|
@ -137,16 +137,19 @@ describe('local ktx LLM config', () => {
|
|||
generateText: vi.fn(),
|
||||
generateObject: vi.fn(),
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: vi.fn(() => null),
|
||||
}));
|
||||
const createCodexRuntime = vi.fn(() => ({
|
||||
generateText: vi.fn(),
|
||||
generateObject: vi.fn(),
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: vi.fn(() => null),
|
||||
}));
|
||||
const createAiSdkRuntime = vi.fn(() => ({
|
||||
generateText: vi.fn(),
|
||||
generateObject: vi.fn(),
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: vi.fn(() => null),
|
||||
}));
|
||||
const createKtxLlmProvider = vi.fn(() => ({
|
||||
getModel: vi.fn(),
|
||||
|
|
|
|||
138
packages/cli/test/context/llm/subprocess-generate-object.test.ts
Normal file
138
packages/cli/test/context/llm/subprocess-generate-object.test.ts
Normal file
|
|
@ -0,0 +1,138 @@
|
|||
import { type ChildProcess } from 'node:child_process';
|
||||
import { mkdtempSync, readFileSync, rmSync } from 'node:fs';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
|
||||
import { z } from 'zod';
|
||||
import { isAbortError } from '../../../src/context/core/abort.js';
|
||||
import {
|
||||
KtxSubprocessDeadlineError,
|
||||
runGenerateObjectInSubprocess,
|
||||
} from '../../../src/context/llm/subprocess-generate-object.js';
|
||||
import type { SubprocessRuntimeForkSpec } from '../../../src/context/llm/runtime-port.js';
|
||||
import { HANGING_CHILD, killTestChildren, RESPONDING_CHILD, spawnTestChild } from './subprocess-test-children.test-utils.js';
|
||||
|
||||
const FORK_SPEC: SubprocessRuntimeForkSpec = { backend: 'codex', projectDir: '/tmp', modelSlots: { default: 'codex' } };
|
||||
|
||||
function isAlive(pid: number): boolean {
|
||||
try {
|
||||
process.kill(pid, 0);
|
||||
return true;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
describe('runGenerateObjectInSubprocess', () => {
|
||||
let children: ChildProcess[];
|
||||
let workDir: string;
|
||||
|
||||
function forkFake(code: string, env: Record<string, string> = {}): () => ChildProcess {
|
||||
return () => spawnTestChild(children, code, env);
|
||||
}
|
||||
|
||||
beforeEach(() => {
|
||||
children = [];
|
||||
workDir = mkdtempSync(join(tmpdir(), 'ktx-subproc-'));
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
killTestChildren(children);
|
||||
rmSync(workDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
it('tree-kills a wedged child at the deadline and reaps its grandchild', async () => {
|
||||
const pidFile = join(workDir, 'gc.pid');
|
||||
const start = Date.now();
|
||||
const pending = runGenerateObjectInSubprocess({
|
||||
forkSpec: FORK_SPEC,
|
||||
role: 'candidateExtraction',
|
||||
prompt: 'x',
|
||||
schema: z.object({ answer: z.string() }),
|
||||
jsonSchema: { type: 'object' },
|
||||
deadlineMs: 300,
|
||||
spawnChild: forkFake(HANGING_CHILD, { KTX_TEST_GC_PID_FILE: pidFile }),
|
||||
});
|
||||
|
||||
await expect(pending).rejects.toBeInstanceOf(KtxSubprocessDeadlineError);
|
||||
// Settled within the deadline plus a small grace, not left wedged.
|
||||
expect(Date.now() - start).toBeLessThan(3000);
|
||||
|
||||
const child = children[0]!;
|
||||
await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), { timeout: 5000 });
|
||||
expect(child.signalCode).toBe('SIGKILL');
|
||||
|
||||
const grandchildPid = Number(readFileSync(pidFile, 'utf8'));
|
||||
expect(Number.isInteger(grandchildPid)).toBe(true);
|
||||
await vi.waitFor(() => expect(isAlive(grandchildPid)).toBe(false), { timeout: 5000 });
|
||||
});
|
||||
|
||||
it('tree-kills the same way on an external abort', async () => {
|
||||
const pidFile = join(workDir, 'gc.pid');
|
||||
const controller = new AbortController();
|
||||
const pending = runGenerateObjectInSubprocess({
|
||||
forkSpec: FORK_SPEC,
|
||||
role: 'candidateExtraction',
|
||||
prompt: 'x',
|
||||
schema: z.object({ answer: z.string() }),
|
||||
jsonSchema: { type: 'object' },
|
||||
deadlineMs: 60_000,
|
||||
signal: controller.signal,
|
||||
spawnChild: forkFake(HANGING_CHILD, { KTX_TEST_GC_PID_FILE: pidFile }),
|
||||
});
|
||||
void pending.catch(() => undefined);
|
||||
|
||||
await vi.waitFor(() => expect(() => readFileSync(pidFile, 'utf8')).not.toThrow(), { timeout: 5000 });
|
||||
controller.abort();
|
||||
|
||||
await expect(pending).rejects.toSatisfy(isAbortError);
|
||||
const child = children[0]!;
|
||||
await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), { timeout: 5000 });
|
||||
const grandchildPid = Number(readFileSync(pidFile, 'utf8'));
|
||||
await vi.waitFor(() => expect(isAlive(grandchildPid)).toBe(false), { timeout: 5000 });
|
||||
});
|
||||
|
||||
it('resolves with the schema-validated output on success', async () => {
|
||||
await expect(
|
||||
runGenerateObjectInSubprocess({
|
||||
forkSpec: FORK_SPEC,
|
||||
role: 'candidateExtraction',
|
||||
prompt: 'x',
|
||||
schema: z.object({ answer: z.string() }),
|
||||
jsonSchema: { type: 'object' },
|
||||
deadlineMs: 5_000,
|
||||
spawnChild: forkFake(RESPONDING_CHILD),
|
||||
}),
|
||||
).resolves.toEqual({ answer: 'yes' });
|
||||
});
|
||||
|
||||
it('rejects when the child output fails schema validation', async () => {
|
||||
await expect(
|
||||
runGenerateObjectInSubprocess({
|
||||
forkSpec: FORK_SPEC,
|
||||
role: 'candidateExtraction',
|
||||
prompt: 'x',
|
||||
schema: z.object({ answer: z.string() }),
|
||||
jsonSchema: { type: 'object' },
|
||||
deadlineMs: 5_000,
|
||||
spawnChild: forkFake(RESPONDING_CHILD, { KTX_TEST_RESPONSE: '{"ok":true,"output":{"wrong":1}}' }),
|
||||
}),
|
||||
).rejects.toThrow();
|
||||
});
|
||||
|
||||
it('rejects with the child error message when the child reports failure', async () => {
|
||||
await expect(
|
||||
runGenerateObjectInSubprocess({
|
||||
forkSpec: FORK_SPEC,
|
||||
role: 'candidateExtraction',
|
||||
prompt: 'x',
|
||||
schema: z.object({ answer: z.string() }),
|
||||
jsonSchema: { type: 'object' },
|
||||
deadlineMs: 5_000,
|
||||
spawnChild: forkFake(RESPONDING_CHILD, {
|
||||
KTX_TEST_RESPONSE: '{"ok":false,"message":"backend overloaded"}',
|
||||
}),
|
||||
}),
|
||||
).rejects.toThrow('backend overloaded');
|
||||
});
|
||||
});
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
import { spawn, type ChildProcess } from 'node:child_process';
|
||||
|
||||
// A wedged subprocess-backed call: the child ignores SIGTERM (as a child hung on a
|
||||
// provider socket does), spawns a grandchild (the SDK's model binary stand-in) that
|
||||
// also ignores SIGTERM, and never replies. Only a SIGKILL of the whole process group
|
||||
// reaps it.
|
||||
export const HANGING_CHILD = `
|
||||
process.on('SIGTERM', () => {});
|
||||
const { spawn } = require('node:child_process');
|
||||
const { writeFileSync } = require('node:fs');
|
||||
process.on('message', () => {
|
||||
const gc = spawn(process.execPath, ['-e', 'process.on("SIGTERM",()=>{});setInterval(()=>{},1000000)'], { stdio: 'ignore' });
|
||||
gc.unref();
|
||||
if (process.env.KTX_TEST_GC_PID_FILE) writeFileSync(process.env.KTX_TEST_GC_PID_FILE, String(gc.pid));
|
||||
});
|
||||
`;
|
||||
|
||||
export const RESPONDING_CHILD = `
|
||||
process.on('message', () => {
|
||||
const raw = process.env.KTX_TEST_RESPONSE || '{"ok":true,"output":{"answer":"yes"}}';
|
||||
process.send(JSON.parse(raw), () => process.exit(0));
|
||||
});
|
||||
`;
|
||||
|
||||
export function spawnTestChild(registry: ChildProcess[], code: string, env: Record<string, string> = {}): ChildProcess {
|
||||
const child = spawn(process.execPath, ['-e', code], {
|
||||
detached: true,
|
||||
stdio: ['ignore', 'ignore', 'inherit', 'ipc'],
|
||||
env: { ...process.env, ...env },
|
||||
});
|
||||
registry.push(child);
|
||||
return child;
|
||||
}
|
||||
|
||||
export function killTestChildren(registry: ChildProcess[]): void {
|
||||
for (const child of registry) {
|
||||
if (child.pid !== undefined && child.exitCode === null && child.signalCode === null) {
|
||||
try {
|
||||
process.kill(-child.pid, 'SIGKILL');
|
||||
} catch {
|
||||
// Already exited.
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -63,7 +63,7 @@
|
|||
{
|
||||
"name": "wiki_search",
|
||||
"title": "Wiki Search",
|
||||
"description": "Search ktx wiki pages for reusable business context. Example: wiki_search({ query: \"revenue recognition\", limit: 5 }).",
|
||||
"description": "Search ktx wiki pages for reusable business context. Pass connectionId to scope results to one warehouse (unscoped pages plus pages tagged with that connection) when a concept name collides across databases. Example: wiki_search({ query: \"revenue recognition\", connectionId: \"warehouse\", limit: 5 }).",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
|
|
@ -78,6 +78,11 @@
|
|||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"maximum": 50
|
||||
},
|
||||
"connectionId": {
|
||||
"description": "Scope results to one connection: returns unscoped pages plus pages tagged with this connection. Omit to search all pages.",
|
||||
"type": "string",
|
||||
"minLength": 1
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
|
|
@ -1478,6 +1483,55 @@
|
|||
"taskSupport": "forbidden"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "sql_dialect_notes",
|
||||
"title": "SQL Dialect Notes",
|
||||
"description": "Return the SQL syntax conventions for the dialect of a ktx connection: fully-qualified table-name form, identifier quoting and case-folding, date/time functions, top-N / window-filtering idiom, and JSON access. Call this before writing raw sql_execution SQL against a connection so the SQL matches that engine. Example: sql_dialect_notes({ connectionId: \"warehouse\" }).",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"connectionId": {
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"description": "Connection id whose engine dialect conventions to return."
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"connectionId"
|
||||
],
|
||||
"$schema": "http://json-schema.org/draft-07/schema#"
|
||||
},
|
||||
"outputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"connectionId": {
|
||||
"type": "string"
|
||||
},
|
||||
"dialect": {
|
||||
"type": "string"
|
||||
},
|
||||
"notes": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"connectionId",
|
||||
"dialect",
|
||||
"notes"
|
||||
],
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"additionalProperties": false
|
||||
},
|
||||
"annotations": {
|
||||
"title": "SQL Dialect Notes",
|
||||
"readOnlyHint": true,
|
||||
"idempotentHint": true,
|
||||
"openWorldHint": false
|
||||
},
|
||||
"execution": {
|
||||
"taskSupport": "forbidden"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "memory_ingest",
|
||||
"title": "Memory Ingest",
|
||||
|
|
|
|||
111
packages/cli/test/context/mcp/dialect-notes.test.ts
Normal file
111
packages/cli/test/context/mcp/dialect-notes.test.ts
Normal file
|
|
@ -0,0 +1,111 @@
|
|||
import { readdirSync } from 'node:fs';
|
||||
import { fileURLToPath } from 'node:url';
|
||||
import { describe, expect, it } from 'vitest';
|
||||
import { KtxExpectedError } from '../../../src/errors.js';
|
||||
import { KTX_DATABASE_DRIVER_IDS } from '../../../src/connection-drivers.js';
|
||||
import type { KtxProjectConnectionConfig } from '../../../src/context/project/config.js';
|
||||
import { sqlAnalysisDialectForDriver } from '../../../src/context/sql-analysis/dialect.js';
|
||||
import { DIALECTS_WITH_NOTES, sqlDialectNotes } from '../../../src/context/sql-analysis/dialect-notes.js';
|
||||
import { resolveDialectNotesForConnection } from '../../../src/context/mcp/local-project-ports.js';
|
||||
|
||||
function conn(driver: string): KtxProjectConnectionConfig {
|
||||
return { driver } as KtxProjectConnectionConfig;
|
||||
}
|
||||
|
||||
describe('per-dialect SQL notes', () => {
|
||||
it('covers every dialect reachable from a configured warehouse driver', () => {
|
||||
// Derived from the connector registry, not a hand-maintained list: a new
|
||||
// warehouse driver whose resolved dialect lacks authored notes fails here.
|
||||
for (const driver of KTX_DATABASE_DRIVER_IDS) {
|
||||
const dialect = sqlAnalysisDialectForDriver(driver);
|
||||
expect(DIALECTS_WITH_NOTES, `driver "${driver}" resolves to dialect "${dialect}"`).toContain(dialect);
|
||||
expect(sqlDialectNotes(dialect).length).toBeGreaterThan(0);
|
||||
}
|
||||
});
|
||||
|
||||
it('keeps the authored-dialect list and the ./dialects markdown files in sync', () => {
|
||||
const dir = fileURLToPath(new URL('../../../src/context/sql-analysis/dialects/', import.meta.url));
|
||||
const files = readdirSync(dir)
|
||||
.filter((name) => name.endsWith('.md'))
|
||||
.map((name) => name.replace(/\.md$/, ''))
|
||||
.sort();
|
||||
expect(files).toEqual([...DIALECTS_WITH_NOTES].sort());
|
||||
});
|
||||
|
||||
it('does not author notes for unreachable dialects', () => {
|
||||
// duckdb/databricks appear in the resolver map but no connector produces them.
|
||||
expect(DIALECTS_WITH_NOTES).not.toContain('duckdb');
|
||||
expect(DIALECTS_WITH_NOTES).not.toContain('databricks');
|
||||
});
|
||||
|
||||
it('answers the full rubric for every dialect', () => {
|
||||
for (const dialect of DIALECTS_WITH_NOTES) {
|
||||
const notes = sqlDialectNotes(dialect);
|
||||
expect(notes, `${dialect}: FQTN`).toContain('**FQTN:**');
|
||||
expect(notes, `${dialect}: identifiers`).toContain('**Identifiers:**');
|
||||
expect(notes, `${dialect}: date/time`).toContain('**Date/time:**');
|
||||
expect(notes, `${dialect}: top-N`).toMatch(/\*\*Top-N/);
|
||||
expect(notes, `${dialect}: series`).toMatch(/\*\*Series/);
|
||||
expect(notes, `${dialect}: rolling window`).toMatch(/\*\*Rolling/);
|
||||
expect(notes, `${dialect}: safe cast`).toMatch(/\*\*Safe cast/);
|
||||
expect(notes, `${dialect}: semi-structured`).toMatch(/\*\*(JSON|Semi-structured)/);
|
||||
}
|
||||
});
|
||||
|
||||
it('gives each engine its own idioms and never leaks another engine-only construct', () => {
|
||||
// A sqlite analyst gets sqlite date idioms and never Snowflake/BigQuery-only syntax.
|
||||
expect(sqlDialectNotes('sqlite')).toMatch(/strftime|julianday/);
|
||||
expect(sqlDialectNotes('sqlite')).not.toContain('VARIANT');
|
||||
expect(sqlDialectNotes('sqlite')).not.toContain('_TABLE_SUFFIX');
|
||||
|
||||
// QUALIFY appears only for the engines that actually support it.
|
||||
expect(sqlDialectNotes('snowflake')).toContain('QUALIFY');
|
||||
expect(sqlDialectNotes('bigquery')).toContain('QUALIFY');
|
||||
for (const dialect of ['postgres', 'mysql', 'sqlite', 'clickhouse', 'tsql'] as const) {
|
||||
expect(sqlDialectNotes(dialect), `${dialect} must not mention QUALIFY`).not.toContain('QUALIFY');
|
||||
}
|
||||
|
||||
// Engine-exclusive markers stay in their own dialect.
|
||||
expect(sqlDialectNotes('snowflake')).toContain('VARIANT');
|
||||
expect(sqlDialectNotes('snowflake')).toContain('DATABASE.SCHEMA.TABLE');
|
||||
expect(sqlDialectNotes('bigquery')).toContain('_TABLE_SUFFIX');
|
||||
expect(sqlDialectNotes('clickhouse')).toContain('LIMIT n BY');
|
||||
expect(sqlDialectNotes('tsql')).toContain('TOP (n)');
|
||||
});
|
||||
|
||||
it('contains no benchmark/grader or version-dated content', () => {
|
||||
for (const dialect of DIALECTS_WITH_NOTES) {
|
||||
const notes = sqlDialectNotes(dialect);
|
||||
expect(notes).not.toMatch(/\bspider\b|\bbenchmark\b|\bgold\b|\bgrader\b/i);
|
||||
expect(notes).not.toMatch(/\bas of v(ersion)?\b/i);
|
||||
}
|
||||
});
|
||||
|
||||
it('falls back to postgres notes for a dialect without its own file', () => {
|
||||
expect(sqlAnalysisDialectForDriver('some-future-engine')).toBe('postgres');
|
||||
// redshift is a valid SqlAnalysisDialect but intentionally unauthored.
|
||||
expect(sqlDialectNotes('redshift')).toBe(sqlDialectNotes('postgres'));
|
||||
});
|
||||
});
|
||||
|
||||
describe('resolveDialectNotesForConnection', () => {
|
||||
it('resolves a warehouse connection to its dialect notes', () => {
|
||||
expect(resolveDialectNotesForConnection('wh', conn('sqlite'))).toMatchObject({
|
||||
connectionId: 'wh',
|
||||
dialect: 'sqlite',
|
||||
});
|
||||
expect(resolveDialectNotesForConnection('wh', conn('snowflake')).dialect).toBe('snowflake');
|
||||
// The sqlserver driver resolves to the tsql dialect (resolver codomain).
|
||||
expect(resolveDialectNotesForConnection('wh', conn('sqlserver')).dialect).toBe('tsql');
|
||||
});
|
||||
|
||||
it('rejects a non-SQL context source with a clear expected error, not postgres notes', () => {
|
||||
expect(() => resolveDialectNotesForConnection('mb', conn('metabase'))).toThrow(KtxExpectedError);
|
||||
expect(() => resolveDialectNotesForConnection('mb', conn('metabase'))).toThrow(/not a SQL warehouse/);
|
||||
});
|
||||
|
||||
it('rejects an unconfigured connection', () => {
|
||||
expect(() => resolveDialectNotesForConnection('missing', undefined)).toThrow(KtxExpectedError);
|
||||
expect(() => resolveDialectNotesForConnection('missing', undefined)).toThrow(/not configured/);
|
||||
});
|
||||
});
|
||||
|
|
@ -178,6 +178,7 @@ describe('createLocalProjectMcpContextPorts', () => {
|
|||
|
||||
expect(Object.keys(ports).sort()).toEqual([
|
||||
'connections',
|
||||
'dialectNotes',
|
||||
'dictionarySearch',
|
||||
'discover',
|
||||
'entityDetails',
|
||||
|
|
@ -187,6 +188,7 @@ describe('createLocalProjectMcpContextPorts', () => {
|
|||
expect(Object.keys(ports.connections ?? {}).sort()).toEqual(['list']);
|
||||
expect(Object.keys(ports.knowledge ?? {}).sort()).toEqual(['read', 'search']);
|
||||
expect(Object.keys(ports.semanticLayer ?? {}).sort()).toEqual(['query', 'readSource']);
|
||||
expect(Object.keys(ports.dialectNotes ?? {}).sort()).toEqual(['read']);
|
||||
await expect(ports.connections?.list()).resolves.toEqual([
|
||||
{ id: 'warehouse', name: 'warehouse', connectionType: 'POSTGRESQL' },
|
||||
]);
|
||||
|
|
@ -803,6 +805,47 @@ describe('createLocalProjectMcpContextPorts', () => {
|
|||
expect(search?.results[0]?.score).toBeGreaterThan(0);
|
||||
});
|
||||
|
||||
it('scopes wiki_search to a connection and validates the connection id', async () => {
|
||||
const project = await initKtxProject({ projectDir: tempDir });
|
||||
project.config.connections.sales_db = { driver: 'sqlite', url: 'file:sales.db' };
|
||||
project.config.connections.events_db = { driver: 'sqlite', url: 'file:events.db' };
|
||||
const seed = async (key: string, connections: string[]) => {
|
||||
await project.fileStore.writeFile(
|
||||
`wiki/global/${key}.md`,
|
||||
[
|
||||
'---',
|
||||
`summary: Orders for ${key}`,
|
||||
'usage_mode: auto',
|
||||
...(connections.length > 0 ? ['connections:', ...connections.map((id) => ` - ${id}`)] : []),
|
||||
'---',
|
||||
'',
|
||||
'Orders are recognized when paid.',
|
||||
'',
|
||||
].join('\n'),
|
||||
'ktx',
|
||||
'ktx@example.com',
|
||||
`seed ${key}`,
|
||||
);
|
||||
};
|
||||
await seed('orders-sales', ['sales_db']);
|
||||
await seed('orders-events', ['events_db']);
|
||||
await seed('orders-global', []);
|
||||
|
||||
const ports = createLocalProjectMcpContextPorts(project, { embeddingService: null });
|
||||
|
||||
const scoped = await ports.knowledge?.search({
|
||||
userId: 'local-user',
|
||||
query: 'orders paid',
|
||||
limit: 10,
|
||||
connectionId: 'sales_db',
|
||||
});
|
||||
expect(scoped?.results.map((result) => result.key).sort()).toEqual(['orders-global', 'orders-sales']);
|
||||
|
||||
await expect(
|
||||
ports.knowledge?.search({ userId: 'local-user', query: 'orders', limit: 10, connectionId: 'warehouse' }),
|
||||
).rejects.toThrow('Unknown connection "warehouse". Configured connections: events_db, sales_db.');
|
||||
});
|
||||
|
||||
it('reads seeded semantic-layer sources', async () => {
|
||||
const project = await initKtxProject({ projectDir: tempDir });
|
||||
await seedSlSourceFile(project, {
|
||||
|
|
|
|||
99
packages/cli/test/context/mcp/logger.test.ts
Normal file
99
packages/cli/test/context/mcp/logger.test.ts
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
import { afterEach, describe, expect, it, vi } from 'vitest';
|
||||
import { createMcpLogger, mcpLogLevel, mcpSlowToolMs, serializeMcpError } from '../../../src/context/mcp/logger.js';
|
||||
|
||||
function capturingIo() {
|
||||
let buf = '';
|
||||
return {
|
||||
io: { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } },
|
||||
text: () => buf,
|
||||
json: () =>
|
||||
buf
|
||||
.split('\n')
|
||||
.filter((line) => line.trim().startsWith('{'))
|
||||
.map((line) => JSON.parse(line) as Record<string, unknown>),
|
||||
};
|
||||
}
|
||||
|
||||
describe('mcpLogLevel', () => {
|
||||
it('defaults to info when unset', () => {
|
||||
expect(mcpLogLevel({})).toBe('info');
|
||||
});
|
||||
|
||||
it('accepts a recognized pino level', () => {
|
||||
expect(mcpLogLevel({ KTX_MCP_LOG_LEVEL: 'debug' })).toBe('debug');
|
||||
expect(mcpLogLevel({ KTX_MCP_LOG_LEVEL: 'WARN' })).toBe('warn');
|
||||
});
|
||||
|
||||
it('falls back to info for an unrecognized value', () => {
|
||||
expect(mcpLogLevel({ KTX_MCP_LOG_LEVEL: 'loud' })).toBe('info');
|
||||
});
|
||||
});
|
||||
|
||||
describe('mcpSlowToolMs', () => {
|
||||
it('defaults to 10000ms', () => {
|
||||
expect(mcpSlowToolMs({})).toBe(10_000);
|
||||
});
|
||||
|
||||
it('parses a numeric override', () => {
|
||||
expect(mcpSlowToolMs({ KTX_MCP_SLOW_TOOL_MS: '250' })).toBe(250);
|
||||
});
|
||||
|
||||
it('ignores a non-numeric or negative value', () => {
|
||||
expect(mcpSlowToolMs({ KTX_MCP_SLOW_TOOL_MS: 'soon' })).toBe(10_000);
|
||||
expect(mcpSlowToolMs({ KTX_MCP_SLOW_TOOL_MS: '-5' })).toBe(10_000);
|
||||
});
|
||||
});
|
||||
|
||||
describe('serializeMcpError', () => {
|
||||
it('serializes an Error with type, message, and stack', () => {
|
||||
const out = serializeMcpError(new TypeError('boom'));
|
||||
expect(out.type).toBe('TypeError');
|
||||
expect(out.message).toBe('boom');
|
||||
expect(typeof out.stack).toBe('string');
|
||||
});
|
||||
|
||||
it('reduces a non-error to a message (no synthetic stack)', () => {
|
||||
expect(serializeMcpError('plain text')).toEqual({ message: 'plain text' });
|
||||
});
|
||||
});
|
||||
|
||||
describe('createMcpLogger', () => {
|
||||
afterEach(() => {
|
||||
vi.unstubAllEnvs();
|
||||
});
|
||||
|
||||
it('writes structured JSON lines through io.stderr when not a TTY', () => {
|
||||
const cap = capturingIo();
|
||||
const logger = createMcpLogger(cap.io, { isTTY: false });
|
||||
logger.info({ tool: 'sql_execution', callId: 'abc' }, 'tool.start');
|
||||
|
||||
const [line] = cap.json();
|
||||
expect(line.msg).toBe('tool.start');
|
||||
expect(line.tool).toBe('sql_execution');
|
||||
expect(line.callId).toBe('abc');
|
||||
expect(typeof line.time).toBe('number');
|
||||
expect(line.level).toBe(30);
|
||||
});
|
||||
|
||||
it('writes human-readable (non-JSON) output for a TTY', () => {
|
||||
const cap = capturingIo();
|
||||
const logger = createMcpLogger(cap.io, { isTTY: true });
|
||||
logger.info({ tool: 'sql_execution' }, 'tool.start');
|
||||
|
||||
expect(cap.text()).toContain('tool.start');
|
||||
// pino-pretty output is not a JSON line.
|
||||
expect(cap.text().trim().startsWith('{')).toBe(false);
|
||||
});
|
||||
|
||||
it('honors KTX_MCP_LOG_LEVEL by suppressing below-threshold lines', () => {
|
||||
vi.stubEnv('KTX_MCP_LOG_LEVEL', 'warn');
|
||||
const cap = capturingIo();
|
||||
const logger = createMcpLogger(cap.io, { isTTY: false });
|
||||
logger.info({}, 'routine');
|
||||
logger.warn({}, 'slow');
|
||||
|
||||
const messages = cap.json().map((line) => line.msg);
|
||||
expect(messages).not.toContain('routine');
|
||||
expect(messages).toContain('slow');
|
||||
});
|
||||
});
|
||||
|
|
@ -4,14 +4,17 @@ import { join } from 'node:path';
|
|||
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
|
||||
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
|
||||
import { afterEach, describe, expect, it, vi } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { createLocalProjectMemoryIngest } from '../../../src/context/memory/local-memory.js';
|
||||
import { detectCaptureSignals } from '../../../src/context/memory/capture-signals.js';
|
||||
import type { MemoryAgentInput } from '../../../src/context/memory/types.js';
|
||||
import { parseKtxProjectConfig, serializeKtxProjectConfig } from '../../../src/context/project/config.js';
|
||||
import { initKtxProject } from '../../../src/context/project/project.js';
|
||||
import { jsonToolResult } from '../../../src/context/mcp/context-tools.js';
|
||||
import { createMcpLogger } from '../../../src/context/mcp/logger.js';
|
||||
import { createDefaultKtxMcpServer, createKtxMcpServer } from '../../../src/context/mcp/server.js';
|
||||
import type {
|
||||
KtxDialectNotesMcpPort,
|
||||
KtxDiscoverDataMcpPort,
|
||||
KtxDictionarySearchMcpPort,
|
||||
KtxEntityDetailsMcpPort,
|
||||
|
|
@ -84,6 +87,7 @@ const retainedToolNames = [
|
|||
'memory_ingest_status',
|
||||
'sl_query',
|
||||
'sl_read_source',
|
||||
'sql_dialect_notes',
|
||||
'sql_execution',
|
||||
'wiki_read',
|
||||
'wiki_search',
|
||||
|
|
@ -136,6 +140,13 @@ function makeAllContextTools(): KtxMcpContextPorts {
|
|||
rowCount: 1,
|
||||
}),
|
||||
},
|
||||
dialectNotes: {
|
||||
read: vi.fn<KtxDialectNotesMcpPort['read']>().mockResolvedValue({
|
||||
connectionId: 'warehouse',
|
||||
dialect: 'postgres',
|
||||
notes: '**postgres** SQL conventions',
|
||||
}),
|
||||
},
|
||||
memoryIngest: {
|
||||
ingest: vi.fn<MemoryIngestPort['ingest']>().mockResolvedValue({ runId: 'run-1' }),
|
||||
status: vi.fn<MemoryIngestPort['status']>().mockResolvedValue({
|
||||
|
|
@ -203,6 +214,12 @@ describe('createKtxMcpServer', () => {
|
|||
},
|
||||
sl_query: { title: 'Semantic Layer Query', readOnlyHint: true, openWorldHint: false },
|
||||
sql_execution: { title: 'SQL Execution', readOnlyHint: true, openWorldHint: false },
|
||||
sql_dialect_notes: {
|
||||
title: 'SQL Dialect Notes',
|
||||
readOnlyHint: true,
|
||||
idempotentHint: true,
|
||||
openWorldHint: false,
|
||||
},
|
||||
memory_ingest: { title: 'Memory Ingest', destructiveHint: true, openWorldHint: false },
|
||||
memory_ingest_status: { title: 'Memory Ingest Status', readOnlyHint: true, openWorldHint: false },
|
||||
};
|
||||
|
|
@ -219,6 +236,22 @@ describe('createKtxMcpServer', () => {
|
|||
}
|
||||
});
|
||||
|
||||
it('routes sql_dialect_notes through the dialect-notes port', async () => {
|
||||
const fake = makeFakeServer();
|
||||
const contextTools = makeAllContextTools();
|
||||
createKtxMcpServer({
|
||||
server: fake.server,
|
||||
userContext: { userId: 'mcp-user' },
|
||||
contextTools,
|
||||
});
|
||||
|
||||
const result = await getTool(fake.tools, 'sql_dialect_notes').handler({ connectionId: 'warehouse' });
|
||||
expect(contextTools.dialectNotes!.read).toHaveBeenCalledWith({ connectionId: 'warehouse' });
|
||||
expect(result).toMatchObject({
|
||||
structuredContent: { connectionId: 'warehouse', dialect: 'postgres' },
|
||||
});
|
||||
});
|
||||
|
||||
it('exposes annotations and output schemas through the SDK tools/list response', async () => {
|
||||
const result = await listToolsThroughSdk(makeAllContextTools());
|
||||
const toolNames = result.tools.map((tool) => tool.name).sort();
|
||||
|
|
@ -1332,3 +1365,179 @@ describe('createKtxMcpServer', () => {
|
|||
}
|
||||
});
|
||||
});
|
||||
|
||||
describe('MCP tool-call logging', () => {
|
||||
afterEach(() => {
|
||||
vi.unstubAllEnvs();
|
||||
vi.restoreAllMocks();
|
||||
});
|
||||
|
||||
function loggerCapture() {
|
||||
let buf = '';
|
||||
const io = { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } };
|
||||
return {
|
||||
io,
|
||||
logger: createMcpLogger(io, { isTTY: false }),
|
||||
text: () => buf,
|
||||
lines: () =>
|
||||
buf
|
||||
.split('\n')
|
||||
.filter((line) => line.trim().startsWith('{'))
|
||||
.map((line) => JSON.parse(line) as Record<string, unknown>),
|
||||
};
|
||||
}
|
||||
|
||||
it('logs tool.start before the handler runs and a matching tool.end on completion', async () => {
|
||||
const cap = loggerCapture();
|
||||
const fake = makeFakeServer();
|
||||
createKtxMcpServer({
|
||||
server: fake.server,
|
||||
userContext: { userId: 'local' },
|
||||
logger: cap.logger,
|
||||
contextTools: {
|
||||
sqlExecution: {
|
||||
execute: vi
|
||||
.fn<KtxSqlExecutionMcpPort['execute']>()
|
||||
.mockResolvedValue({ headers: ['count'], rows: [[1]], rowCount: 1 }),
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
await getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select 1' });
|
||||
|
||||
const lines = cap.lines();
|
||||
const start = lines.find((line) => line.msg === 'tool.start');
|
||||
const end = lines.find((line) => line.msg === 'tool.end');
|
||||
expect(start).toMatchObject({
|
||||
tool: 'sql_execution',
|
||||
params: { connectionId: 'warehouse', sql: 'select 1' },
|
||||
level: 30,
|
||||
});
|
||||
expect(typeof start?.callId).toBe('string');
|
||||
expect(end).toMatchObject({ tool: 'sql_execution', callId: start?.callId, outcome: 'ok', level: 30 });
|
||||
expect(typeof end?.durationMs).toBe('number');
|
||||
expect(end?.resultSize as number).toBeGreaterThan(0);
|
||||
});
|
||||
|
||||
it('leaves a tool.start carrying the SQL with no matching tool.end when a handler never returns', () => {
|
||||
const cap = loggerCapture();
|
||||
const fake = makeFakeServer();
|
||||
createKtxMcpServer({
|
||||
server: fake.server,
|
||||
userContext: { userId: 'local' },
|
||||
logger: cap.logger,
|
||||
contextTools: {
|
||||
sqlExecution: { execute: () => new Promise(() => {}) },
|
||||
},
|
||||
});
|
||||
|
||||
void getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select pg_sleep(99999)' });
|
||||
|
||||
const lines = cap.lines();
|
||||
const start = lines.find((line) => line.msg === 'tool.start');
|
||||
expect(start).toMatchObject({ tool: 'sql_execution', params: { sql: 'select pg_sleep(99999)' } });
|
||||
expect(lines.some((line) => line.msg === 'tool.end' && line.callId === start?.callId)).toBe(false);
|
||||
});
|
||||
|
||||
it('emits tool.end at warn when a completed call exceeds the slow threshold', async () => {
|
||||
vi.stubEnv('KTX_MCP_SLOW_TOOL_MS', '0');
|
||||
const cap = loggerCapture();
|
||||
const fake = makeFakeServer();
|
||||
createKtxMcpServer({
|
||||
server: fake.server,
|
||||
userContext: { userId: 'local' },
|
||||
logger: cap.logger,
|
||||
contextTools: {
|
||||
sqlExecution: {
|
||||
execute: async () => {
|
||||
await new Promise((resolve) => setTimeout(resolve, 5));
|
||||
return { headers: ['count'], rows: [[1]], rowCount: 1 };
|
||||
},
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
await getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select 1' });
|
||||
|
||||
const end = cap.lines().find((line) => line.msg === 'tool.end');
|
||||
expect(end).toMatchObject({ outcome: 'ok', level: 40 });
|
||||
});
|
||||
|
||||
it('logs a matched tool.start/tool.end(error) pair carrying the deadline message when a query times out', async () => {
|
||||
const cap = loggerCapture();
|
||||
const fake = makeFakeServer();
|
||||
createKtxMcpServer({
|
||||
server: fake.server,
|
||||
userContext: { userId: 'local' },
|
||||
logger: cap.logger,
|
||||
contextTools: {
|
||||
sqlExecution: {
|
||||
execute: vi.fn<KtxSqlExecutionMcpPort['execute']>().mockRejectedValue(new KtxQueryError('query exceeded 30s')),
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
await getTool(fake.tools, 'sql_execution').handler({
|
||||
connectionId: 'warehouse',
|
||||
sql: 'select min(time_id), max(time_id), count(*) from profits',
|
||||
});
|
||||
|
||||
const lines = cap.lines();
|
||||
const start = lines.find((line) => line.msg === 'tool.start');
|
||||
const end = lines.find((line) => line.msg === 'tool.end');
|
||||
expect(typeof start?.callId).toBe('string');
|
||||
expect(end).toMatchObject({ tool: 'sql_execution', callId: start?.callId, outcome: 'error', level: 50 });
|
||||
expect((end?.err as { message?: string }).message).toBe('query exceeded 30s');
|
||||
// No unmatched tool.start remains — the matched pair closes spec 15's hang gap for this case.
|
||||
expect(lines.filter((line) => line.msg === 'tool.start')).toHaveLength(1);
|
||||
expect(lines.filter((line) => line.msg === 'tool.end' && line.callId === start?.callId)).toHaveLength(1);
|
||||
expect(end?.durationMs as number).toBeGreaterThan(0);
|
||||
});
|
||||
|
||||
it('suppresses routine tool traffic at warn level but keeps errored calls', async () => {
|
||||
vi.stubEnv('KTX_MCP_LOG_LEVEL', 'warn');
|
||||
const cap = loggerCapture();
|
||||
const fake = makeFakeServer();
|
||||
createKtxMcpServer({
|
||||
server: fake.server,
|
||||
userContext: { userId: 'local' },
|
||||
logger: cap.logger,
|
||||
contextTools: {
|
||||
knowledge: {
|
||||
search: vi.fn<KtxKnowledgeMcpPort['search']>().mockRejectedValue(new Error('wiki index unavailable')),
|
||||
read: vi.fn<KtxKnowledgeMcpPort['read']>().mockResolvedValue(null),
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
await getTool(fake.tools, 'wiki_search').handler({ query: 'revenue', limit: 5 });
|
||||
|
||||
const lines = cap.lines();
|
||||
expect(lines.some((line) => line.msg === 'tool.start')).toBe(false);
|
||||
const end = lines.find((line) => line.msg === 'tool.end');
|
||||
expect(end).toMatchObject({ outcome: 'error', level: 50 });
|
||||
expect((end?.err as { message?: string }).message).toContain('wiki index unavailable');
|
||||
});
|
||||
|
||||
it('does not log tool calls when no logger is provided', async () => {
|
||||
const fake = makeFakeServer();
|
||||
const io = makeIo(false);
|
||||
createKtxMcpServer({
|
||||
server: fake.server,
|
||||
userContext: { userId: 'local' },
|
||||
io,
|
||||
contextTools: {
|
||||
sqlExecution: {
|
||||
execute: vi
|
||||
.fn<KtxSqlExecutionMcpPort['execute']>()
|
||||
.mockResolvedValue({ headers: ['count'], rows: [[1]], rowCount: 1 }),
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
await getTool(fake.tools, 'sql_execution').handler({ connectionId: 'warehouse', sql: 'select 1' });
|
||||
|
||||
expect(io.stderrText()).not.toContain('tool.start');
|
||||
expect(io.stderrText()).not.toContain('tool.end');
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -86,6 +86,7 @@ connections:
|
|||
profileSampleRows: 10000,
|
||||
profileConcurrency: 4,
|
||||
validationConcurrency: 4,
|
||||
detectionBudgetMs: 600000,
|
||||
},
|
||||
},
|
||||
});
|
||||
|
|
@ -427,6 +428,7 @@ scan:
|
|||
profileConcurrency: 3
|
||||
validationConcurrency: 2
|
||||
validationBudget: 0
|
||||
detectionBudgetMs: 120000
|
||||
`);
|
||||
|
||||
expect(config.scan.relationships).toEqual({
|
||||
|
|
@ -441,6 +443,7 @@ scan:
|
|||
profileConcurrency: 3,
|
||||
validationConcurrency: 2,
|
||||
validationBudget: 0,
|
||||
detectionBudgetMs: 120000,
|
||||
});
|
||||
expect(serializeKtxProjectConfig(config)).toContain('enabled: false');
|
||||
expect(serializeKtxProjectConfig(config)).toContain('llmProposals: false');
|
||||
|
|
@ -453,6 +456,25 @@ scan:
|
|||
expect(serializeKtxProjectConfig(config)).toContain('profileConcurrency: 3');
|
||||
expect(serializeKtxProjectConfig(config)).toContain('validationConcurrency: 2');
|
||||
expect(serializeKtxProjectConfig(config)).toContain('validationBudget: 0');
|
||||
expect(serializeKtxProjectConfig(config)).toContain('detectionBudgetMs: 120000');
|
||||
});
|
||||
|
||||
it('defaults the relationship detection budget to ten minutes', () => {
|
||||
expect(buildDefaultKtxProjectConfig().scan.relationships.detectionBudgetMs).toBe(600000);
|
||||
});
|
||||
|
||||
it('rejects a non-positive or non-integer relationship detection budget', () => {
|
||||
for (const value of ['0', '-1', '1.5']) {
|
||||
const yaml = `
|
||||
scan:
|
||||
relationships:
|
||||
detectionBudgetMs: ${value}
|
||||
`;
|
||||
expect(() => parseKtxProjectConfig(yaml)).toThrow(/scan\.relationships\.detectionBudgetMs/);
|
||||
const validation = validateKtxProjectConfig(yaml);
|
||||
expect(validation.ok).toBe(false);
|
||||
expect(validation.issues.map((issue) => issue.path)).toContain('scan.relationships.detectionBudgetMs');
|
||||
}
|
||||
});
|
||||
|
||||
it('parses the scan relationship validation budget sentinel', () => {
|
||||
|
|
|
|||
|
|
@ -49,10 +49,10 @@ describe('ktx setup config helpers', () => {
|
|||
|
||||
it('merges setup-local gitignore entries without removing existing lines', () => {
|
||||
expect(mergeKtxSetupGitignoreEntries('cache/\ndb.sqlite\n')).toBe(
|
||||
['cache/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'secrets/', 'setup/', 'agents/', ''].join('\n'),
|
||||
['cache/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'logs/', 'secrets/', 'setup/', 'agents/', ''].join('\n'),
|
||||
);
|
||||
expect(mergeKtxSetupGitignoreEntries('cache/\nsecrets/\n')).toBe(
|
||||
['cache/', 'secrets/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'setup/', 'agents/', ''].join('\n'),
|
||||
['cache/', 'secrets/', 'db.sqlite', 'db.sqlite-*', 'ingest-transcripts/', 'logs/', 'setup/', 'agents/', ''].join('\n'),
|
||||
);
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -1,4 +1,8 @@
|
|||
import { describe, expect, it, vi } from 'vitest';
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
|
||||
import { type ChildProcess } from 'node:child_process';
|
||||
import { mkdtempSync, rmSync } from 'node:fs';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
|
||||
vi.mock('ai', async (importOriginal) => {
|
||||
const actual = await importOriginal<typeof import('ai')>();
|
||||
|
|
@ -14,6 +18,7 @@ import {
|
|||
KtxDescriptionGenerator,
|
||||
} from '../../../src/context/scan/description-generation.js';
|
||||
import { createKtxConnectorCapabilities, type KtxScanConnector } from '../../../src/context/scan/types.js';
|
||||
import { HANGING_CHILD, killTestChildren, spawnTestChild } from '../llm/subprocess-test-children.test-utils.js';
|
||||
|
||||
function createCache(initial: Record<string, string> = {}): KtxDescriptionCachePort {
|
||||
const data = new Map(Object.entries(initial));
|
||||
|
|
@ -41,6 +46,7 @@ function createLlmProvider(text = 'generated description') {
|
|||
}),
|
||||
generateObject: vi.fn(),
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
} as any;
|
||||
}
|
||||
|
||||
|
|
@ -57,6 +63,7 @@ function createFailingLlmProvider(message = 'timeout exceeded when trying to con
|
|||
}),
|
||||
generateObject: vi.fn(),
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
} as any;
|
||||
}
|
||||
|
||||
|
|
@ -492,7 +499,8 @@ describe('KtxDescriptionGenerator', () => {
|
|||
expect(result.tableDescription).toBeNull();
|
||||
expect(Object.fromEntries(result.columnDescriptions)).toEqual({ status: null });
|
||||
expect(warnings).toContain('enrichment_failed');
|
||||
expect(llmRuntime.generateObject).toHaveBeenCalledTimes(1);
|
||||
// A transient (non-timeout) failure retries up to the attempt limit (default 3).
|
||||
expect(llmRuntime.generateObject).toHaveBeenCalledTimes(3);
|
||||
expect(llmRuntime.generateText).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
|
@ -684,6 +692,41 @@ describe('KtxDescriptionGenerator resilience', () => {
|
|||
expect(warnings).toEqual([]);
|
||||
});
|
||||
|
||||
it('propagates a genuine context abort during the batched LLM call instead of degrading to null', async () => {
|
||||
const controller = new AbortController();
|
||||
const llmRuntime = createLlmProvider('unused');
|
||||
llmRuntime.generateObject = vi.fn(async () => {
|
||||
controller.abort();
|
||||
throw new Error('The operation was aborted');
|
||||
});
|
||||
const warnings: string[] = [];
|
||||
const generator = new KtxDescriptionGenerator({
|
||||
llmRuntime,
|
||||
onWarning: (warning) => warnings.push(warning.code),
|
||||
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
|
||||
});
|
||||
|
||||
await expect(
|
||||
generator.generateBatchedTableDescriptions({
|
||||
connectionId: 'conn-1',
|
||||
connector: createConnector(),
|
||||
context: { runId: 'run-1', signal: controller.signal },
|
||||
dataSourceType: 'POSTGRESQL',
|
||||
supportsNestedAnalysis: false,
|
||||
table: {
|
||||
catalog: null,
|
||||
db: 'public',
|
||||
name: 'orders',
|
||||
rawDescriptions: {},
|
||||
columns: [{ name: 'status', type: 'text' }],
|
||||
},
|
||||
}),
|
||||
).rejects.toThrow();
|
||||
|
||||
// A genuine cancellation must not be filed as a per-table failure/timeout.
|
||||
expect(warnings).toEqual([]);
|
||||
});
|
||||
|
||||
it('generates column descriptions from rawDescriptions when sampleColumn is unavailable', async () => {
|
||||
const samplerWithoutColumn: KtxScanConnector = {
|
||||
...createConnector(),
|
||||
|
|
@ -782,3 +825,89 @@ describe('KtxDescriptionGenerator resilience', () => {
|
|||
expect(generateText).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
||||
describe('KtxDescriptionGenerator subprocess kill boundary', () => {
|
||||
const children: ChildProcess[] = [];
|
||||
let workDir: string;
|
||||
let priorTimeout: string | undefined;
|
||||
|
||||
beforeEach(() => {
|
||||
workDir = mkdtempSync(join(tmpdir(), 'ktx-enrich-'));
|
||||
priorTimeout = process.env.KTX_ENRICH_LLM_TIMEOUT_MS;
|
||||
process.env.KTX_ENRICH_LLM_TIMEOUT_MS = '300';
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
killTestChildren(children);
|
||||
children.length = 0;
|
||||
if (priorTimeout === undefined) {
|
||||
delete process.env.KTX_ENRICH_LLM_TIMEOUT_MS;
|
||||
} else {
|
||||
process.env.KTX_ENRICH_LLM_TIMEOUT_MS = priorTimeout;
|
||||
}
|
||||
rmSync(workDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
it('skips a wedged subprocess-backed table with enrichment_timeout and settles within deadline+grace', async () => {
|
||||
const pidFile = join(workDir, 'gc.pid');
|
||||
const llmRuntime = createLlmProvider('unused');
|
||||
llmRuntime.subprocessForkSpec = () => ({ backend: 'codex', projectDir: '/tmp', modelSlots: { default: 'codex' } });
|
||||
const warnings: string[] = [];
|
||||
const generator = new KtxDescriptionGenerator({
|
||||
llmRuntime,
|
||||
onWarning: (warning) => warnings.push(warning.code),
|
||||
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
|
||||
spawnSubprocessGenerateChild: () => spawnTestChild(children, HANGING_CHILD, { KTX_TEST_GC_PID_FILE: pidFile }),
|
||||
});
|
||||
|
||||
const start = Date.now();
|
||||
const result = await generator.generateBatchedTableDescriptions({
|
||||
connectionId: 'conn-1',
|
||||
connector: createConnector(),
|
||||
context: { runId: 'run-1' },
|
||||
dataSourceType: 'POSTGRESQL',
|
||||
supportsNestedAnalysis: false,
|
||||
table: { catalog: null, db: 'public', name: 'orders', columns: [{ name: 'status', type: 'text' }] },
|
||||
});
|
||||
|
||||
expect(Date.now() - start).toBeLessThan(5000);
|
||||
expect(result.tableDescription).toBeNull();
|
||||
expect(Object.fromEntries(result.columnDescriptions)).toEqual({ status: null });
|
||||
expect(warnings).toContain('enrichment_timeout');
|
||||
// One wedge = one timeout: the hung table is not retried.
|
||||
expect(children).toHaveLength(1);
|
||||
const child = children[0]!;
|
||||
await vi.waitFor(() => expect(child.exitCode !== null || child.signalCode !== null).toBe(true), { timeout: 5000 });
|
||||
});
|
||||
|
||||
it('runs HTTP-backed enrichment in-process without spawning a child', async () => {
|
||||
const spawnSpy = vi.fn(() => {
|
||||
throw new Error('HTTP backend must not spawn a kill-boundary child');
|
||||
});
|
||||
const llmRuntime = createLlmProvider('unused');
|
||||
llmRuntime.subprocessForkSpec = () => null;
|
||||
llmRuntime.generateObject = vi.fn(async () => ({
|
||||
tableDescription: 'Orders fact table',
|
||||
columns: [{ name: 'status', description: 'Order lifecycle status' }],
|
||||
}));
|
||||
const generator = new KtxDescriptionGenerator({
|
||||
llmRuntime,
|
||||
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
|
||||
spawnSubprocessGenerateChild: spawnSpy,
|
||||
});
|
||||
|
||||
const result = await generator.generateBatchedTableDescriptions({
|
||||
connectionId: 'conn-1',
|
||||
connector: createConnector(),
|
||||
context: { runId: 'run-1' },
|
||||
dataSourceType: 'POSTGRESQL',
|
||||
supportsNestedAnalysis: false,
|
||||
table: { catalog: null, db: 'public', name: 'orders', columns: [{ name: 'status', type: 'text' }] },
|
||||
});
|
||||
|
||||
expect(spawnSpy).not.toHaveBeenCalled();
|
||||
expect(llmRuntime.generateObject).toHaveBeenCalledTimes(1);
|
||||
expect(result.tableDescription).toBe('Orders fact table');
|
||||
expect(Object.fromEntries(result.columnDescriptions)).toEqual({ status: 'Order lifecycle status' });
|
||||
});
|
||||
});
|
||||
|
|
|
|||
264
packages/cli/test/context/scan/description-resume.test.ts
Normal file
264
packages/cli/test/context/scan/description-resume.test.ts
Normal file
|
|
@ -0,0 +1,264 @@
|
|||
import { mkdtemp, rm } from 'node:fs/promises';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
|
||||
import YAML from 'yaml';
|
||||
import type { KtxLlmRuntimePort } from '../../../src/context/llm/runtime-port.js';
|
||||
import { buildDefaultKtxProjectConfig, type KtxScanRelationshipConfig } from '../../../src/context/project/config.js';
|
||||
import { initKtxProject, type KtxLocalProject } from '../../../src/context/project/project.js';
|
||||
import {
|
||||
createKtxScanDescriptionResumeStore,
|
||||
writeLocalScanManifestShards,
|
||||
} from '../../../src/context/scan/local-enrichment-artifacts.js';
|
||||
import { runLocalScanEnrichment, type KtxLocalScanEnrichmentResult } from '../../../src/context/scan/local-enrichment.js';
|
||||
import { SqliteLocalScanEnrichmentStateStore } from '../../../src/context/scan/sqlite-local-enrichment-state-store.js';
|
||||
import { createKtxConnectorCapabilities, type KtxScanConnector, type KtxSchemaSnapshot } from '../../../src/context/scan/types.js';
|
||||
|
||||
const PROGRESS_PATH = 'raw-sources/warehouse/live-database/enrichment-progress/descriptions.json';
|
||||
const SHARD_PATH = 'semantic-layer/warehouse/_schema/public.yaml';
|
||||
|
||||
function column(name: string) {
|
||||
return {
|
||||
name,
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer' as const,
|
||||
dimensionType: 'number' as const,
|
||||
nullable: false,
|
||||
primaryKey: name === 'id',
|
||||
comment: null,
|
||||
};
|
||||
}
|
||||
|
||||
function table(name: string) {
|
||||
return {
|
||||
catalog: null,
|
||||
db: 'public',
|
||||
name,
|
||||
kind: 'table' as const,
|
||||
comment: null,
|
||||
estimatedRows: 1,
|
||||
foreignKeys: [],
|
||||
columns: [column('id'), column('value')],
|
||||
};
|
||||
}
|
||||
|
||||
const snapshot: KtxSchemaSnapshot = {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'postgres',
|
||||
extractedAt: '2026-04-29T12:00:00.000Z',
|
||||
scope: { schemas: ['public'] },
|
||||
metadata: {},
|
||||
tables: [table('customers'), table('orders'), table('products')],
|
||||
};
|
||||
|
||||
function connector(): KtxScanConnector {
|
||||
return {
|
||||
id: 'test:warehouse',
|
||||
driver: 'postgres',
|
||||
capabilities: createKtxConnectorCapabilities({ tableSampling: true, columnSampling: true }),
|
||||
introspect: vi.fn(async () => snapshot),
|
||||
listSchemas: vi.fn(async () => []),
|
||||
listTables: vi.fn(async () => []),
|
||||
sampleTable: vi.fn(async () => ({ headers: ['id', 'value'], rows: [[1, 2]], totalRows: 1 })),
|
||||
sampleColumn: vi.fn(async () => ({ values: ['1', '2'], nullCount: 0, distinctCount: 2 })),
|
||||
};
|
||||
}
|
||||
|
||||
function countingRuntime() {
|
||||
let calls = 0;
|
||||
const runtime: KtxLlmRuntimePort = {
|
||||
generateText: vi.fn(async () => 'AI column description'),
|
||||
generateObject: vi.fn(async () => {
|
||||
calls += 1;
|
||||
return { tableDescription: 'AI table description', columns: [] };
|
||||
}) as KtxLlmRuntimePort['generateObject'],
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
};
|
||||
return { runtime, calls: () => calls };
|
||||
}
|
||||
|
||||
function relationshipsDisabled(): KtxScanRelationshipConfig {
|
||||
return { ...buildDefaultKtxProjectConfig().scan.relationships, enabled: false };
|
||||
}
|
||||
|
||||
describe('descriptions stage incremental persistence + resume', () => {
|
||||
let tempDir: string;
|
||||
let project: KtxLocalProject;
|
||||
|
||||
async function runEnrichment(runId: string): Promise<{ result: KtxLocalScanEnrichmentResult; calls: number }> {
|
||||
const llm = countingRuntime();
|
||||
const result = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
connector: connector(),
|
||||
snapshot,
|
||||
context: { runId },
|
||||
providers: { llmRuntime: llm.runtime, embedding: null },
|
||||
descriptionResumeStore: createKtxScanDescriptionResumeStore({
|
||||
project,
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-1',
|
||||
driver: 'postgres',
|
||||
}),
|
||||
syncId: 'sync-1',
|
||||
relationshipSettings: relationshipsDisabled(),
|
||||
});
|
||||
return { result, calls: llm.calls() };
|
||||
}
|
||||
|
||||
async function readProgress(): Promise<{ inputHash: string; descriptions: Array<{ table: { name: string } }> }> {
|
||||
return JSON.parse((await project.fileStore.readFile(PROGRESS_PATH)).content);
|
||||
}
|
||||
|
||||
async function writeProgress(record: unknown): Promise<void> {
|
||||
await project.fileStore.writeFile(PROGRESS_PATH, `${JSON.stringify(record, null, 2)}\n`, 'ktx', 'ktx@example.com', 'edit');
|
||||
}
|
||||
|
||||
beforeEach(async () => {
|
||||
tempDir = await mkdtemp(join(tmpdir(), 'ktx-desc-resume-'));
|
||||
project = await initKtxProject({ projectDir: join(tempDir, 'project') });
|
||||
});
|
||||
|
||||
afterEach(async () => {
|
||||
await rm(tempDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
it('flushes durable descriptions + ai manifest descriptions on a fresh run', async () => {
|
||||
const { calls } = await runEnrichment('run-1');
|
||||
expect(calls).toBe(3);
|
||||
|
||||
const progress = await readProgress();
|
||||
expect(progress.descriptions.map((entry) => entry.table.name).sort()).toEqual(['customers', 'orders', 'products']);
|
||||
|
||||
const shard = YAML.parse((await project.fileStore.readFile(SHARD_PATH)).content) as {
|
||||
tables: Record<string, { descriptions?: { ai?: string } }>;
|
||||
};
|
||||
expect(shard.tables.customers?.descriptions?.ai).toBe('AI table description');
|
||||
expect(shard.tables.products?.descriptions?.ai).toBe('AI table description');
|
||||
});
|
||||
|
||||
it('re-issues no LLM calls when every table is already enriched (matching inputHash)', async () => {
|
||||
await runEnrichment('run-1');
|
||||
const { result, calls } = await runEnrichment('run-2');
|
||||
|
||||
expect(calls).toBe(0);
|
||||
expect(result.descriptionUpdates).toHaveLength(3);
|
||||
expect(result.descriptionUpdates.every((update) => update.tableDescription === 'AI table description')).toBe(true);
|
||||
});
|
||||
|
||||
it('re-enriches only the tables missing from the durable record', async () => {
|
||||
await runEnrichment('run-1');
|
||||
const progress = await readProgress();
|
||||
progress.descriptions = progress.descriptions.filter((entry) => entry.table.name !== 'orders');
|
||||
await writeProgress(progress);
|
||||
|
||||
const { result, calls } = await runEnrichment('run-2');
|
||||
|
||||
expect(calls).toBe(1);
|
||||
expect(result.descriptionUpdates.map((update) => update.table.name).sort()).toEqual([
|
||||
'customers',
|
||||
'orders',
|
||||
'products',
|
||||
]);
|
||||
});
|
||||
|
||||
it('recomputes the whole stage when the durable record inputHash differs', async () => {
|
||||
await runEnrichment('run-1');
|
||||
const progress = await readProgress();
|
||||
await writeProgress({ ...progress, inputHash: 'stale-input-hash' });
|
||||
|
||||
const { calls } = await runEnrichment('run-2');
|
||||
expect(calls).toBe(3);
|
||||
});
|
||||
|
||||
it('persists the other tables and completes the stage when one table fails', async () => {
|
||||
const stateStore = new SqliteLocalScanEnrichmentStateStore({ dbPath: join(tempDir, 'state.sqlite') });
|
||||
let calls = 0;
|
||||
const runtime: KtxLlmRuntimePort = {
|
||||
generateText: vi.fn(async () => 'AI column description'),
|
||||
generateObject: vi.fn(async (input: { prompt: string }) => {
|
||||
calls += 1;
|
||||
if (input.prompt.includes('orders')) {
|
||||
throw new Error('backend overloaded');
|
||||
}
|
||||
return { tableDescription: 'AI table description', columns: [] };
|
||||
}) as KtxLlmRuntimePort['generateObject'],
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
};
|
||||
|
||||
const result = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
connector: connector(),
|
||||
snapshot,
|
||||
context: { runId: 'run-skip' },
|
||||
providers: { llmRuntime: runtime, embedding: null },
|
||||
descriptionResumeStore: createKtxScanDescriptionResumeStore({
|
||||
project,
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-1',
|
||||
driver: 'postgres',
|
||||
}),
|
||||
stateStore,
|
||||
syncId: 'sync-1',
|
||||
relationshipSettings: relationshipsDisabled(),
|
||||
});
|
||||
|
||||
// orders retries to the attempt limit (3) then fails; customers + products succeed once each.
|
||||
expect(calls).toBe(5);
|
||||
// The failed table is a single missing description, not the whole stage's loss.
|
||||
const byName = new Map(result.descriptionUpdates.map((update) => [update.table.name, update]));
|
||||
expect(byName.get('orders')?.tableDescription).toBeNull();
|
||||
expect(byName.get('customers')?.tableDescription).toBe('AI table description');
|
||||
expect(byName.get('products')?.tableDescription).toBe('AI table description');
|
||||
|
||||
// The stage completed (a completed row exists, not zero).
|
||||
const stages = await stateStore.listRunStages('run-skip');
|
||||
expect(stages.some((stage) => stage.stage === 'descriptions' && stage.status === 'completed')).toBe(true);
|
||||
|
||||
// The good tables are durable: progress record + ai: in the manifest; the failed one is absent.
|
||||
const progress = await readProgress();
|
||||
expect(progress.descriptions.map((entry) => entry.table.name).sort()).toEqual(['customers', 'products']);
|
||||
const shard = YAML.parse((await project.fileStore.readFile(SHARD_PATH)).content) as {
|
||||
tables: Record<string, { descriptions?: { ai?: string } }>;
|
||||
};
|
||||
expect(shard.tables.customers?.descriptions?.ai).toBe('AI table description');
|
||||
expect(shard.tables.orders?.descriptions?.ai).toBeUndefined();
|
||||
});
|
||||
|
||||
it('rewrites only the manifest shards that gained a changed table', async () => {
|
||||
const multiDb: KtxSchemaSnapshot = {
|
||||
...snapshot,
|
||||
tables: [
|
||||
{ ...table('customers'), db: 'sales' },
|
||||
{ ...table('orders'), db: 'ops' },
|
||||
],
|
||||
};
|
||||
await writeLocalScanManifestShards({
|
||||
project,
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-1',
|
||||
driver: 'postgres',
|
||||
snapshot: multiDb,
|
||||
dryRun: false,
|
||||
});
|
||||
|
||||
const flushed = await writeLocalScanManifestShards({
|
||||
project,
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-1',
|
||||
driver: 'postgres',
|
||||
snapshot: multiDb,
|
||||
dryRun: false,
|
||||
descriptionUpdates: [
|
||||
{ table: { catalog: null, db: 'sales', name: 'customers' }, tableDescription: 'desc', columnDescriptions: {} },
|
||||
],
|
||||
onlyChangedTableNames: new Set(['customers']),
|
||||
});
|
||||
|
||||
expect(flushed.manifestShards).toHaveLength(1);
|
||||
expect(flushed.manifestShards[0]).toContain('sales');
|
||||
});
|
||||
});
|
||||
24
packages/cli/test/context/scan/enabled-tables.test.ts
Normal file
24
packages/cli/test/context/scan/enabled-tables.test.ts
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
import { describe, expect, it } from 'vitest';
|
||||
import { resolveEnabledTables } from '../../../src/context/scan/enabled-tables.js';
|
||||
import { tableRefKey } from '../../../src/context/scan/table-ref.js';
|
||||
|
||||
describe('resolveEnabledTables', () => {
|
||||
it('returns null when enabled_tables is absent or empty', () => {
|
||||
expect(resolveEnabledTables(undefined)).toBeNull();
|
||||
expect(resolveEnabledTables({ driver: 'sqlite' })).toBeNull();
|
||||
expect(resolveEnabledTables({ driver: 'sqlite', enabled_tables: [] })).toBeNull();
|
||||
});
|
||||
|
||||
it('treats sqlite "main.<name>" as equivalent to the bare "<name>"', () => {
|
||||
const qualified = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['main.customers'] });
|
||||
const bare = resolveEnabledTables({ driver: 'sqlite', enabled_tables: ['customers'] });
|
||||
const expected = tableRefKey({ catalog: null, db: null, name: 'customers' });
|
||||
expect([...(qualified ?? [])]).toEqual([expected]);
|
||||
expect([...(bare ?? [])]).toEqual([expected]);
|
||||
});
|
||||
|
||||
it('keeps the schema qualifier for non-sqlite drivers', () => {
|
||||
const scope = resolveEnabledTables({ driver: 'postgres', enabled_tables: ['public.customers'] });
|
||||
expect([...(scope ?? [])]).toEqual([tableRefKey({ catalog: null, db: 'public', name: 'customers' })]);
|
||||
});
|
||||
});
|
||||
|
|
@ -1,15 +1,26 @@
|
|||
import { mkdtemp, rm } from 'node:fs/promises';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
import Database from 'better-sqlite3';
|
||||
import { afterEach, beforeEach, describe, expect, it } from 'vitest';
|
||||
import {
|
||||
completedKtxScanEnrichmentStateSummary,
|
||||
computeKtxScanEnrichmentInputHash,
|
||||
computeKtxDescriptionsStageHash,
|
||||
computeKtxEmbeddingsStageHash,
|
||||
computeKtxRelationshipsStageHash,
|
||||
computeKtxScanDescriptionDigest,
|
||||
type KtxScanEmbeddingIdentity,
|
||||
type KtxScanLlmIdentity,
|
||||
summarizeKtxScanEnrichmentState,
|
||||
} from '../../../src/context/scan/enrichment-state.js';
|
||||
import { SqliteLocalScanEnrichmentStateStore } from '../../../src/context/scan/sqlite-local-enrichment-state-store.js';
|
||||
import { buildDefaultKtxProjectConfig } from '../../../src/context/project/config.js';
|
||||
import type { KtxSchemaSnapshot } from '../../../src/context/scan/types.js';
|
||||
|
||||
const llmIdentity: KtxScanLlmIdentity = { model: 'opus', baseUrlConfigured: false };
|
||||
const embeddingIdentity: KtxScanEmbeddingIdentity = { model: 'minilm', dimensions: 384, batchSize: 64 };
|
||||
const relationshipSettings = buildDefaultKtxProjectConfig().scan.relationships;
|
||||
|
||||
const snapshot: KtxSchemaSnapshot = {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'postgres',
|
||||
|
|
@ -53,28 +64,19 @@ describe('scan enrichment state', () => {
|
|||
await rm(tempDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
it('computes stable input hashes without depending on object key order', () => {
|
||||
const first = computeKtxScanEnrichmentInputHash({
|
||||
snapshot,
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
providerIdentity: { provider: 'local-heuristic', llmModel: 'a' },
|
||||
});
|
||||
const second = computeKtxScanEnrichmentInputHash({
|
||||
it('computes stable per-stage hashes without depending on object key order', () => {
|
||||
const first = computeKtxDescriptionsStageHash({ snapshot, llmIdentity });
|
||||
const second = computeKtxDescriptionsStageHash({
|
||||
snapshot: { ...snapshot, metadata: {} },
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
providerIdentity: { llmModel: 'a', provider: 'local-heuristic' },
|
||||
llmIdentity: { baseUrlConfigured: false, model: 'opus' },
|
||||
});
|
||||
const firstTable = snapshot.tables[0];
|
||||
if (!firstTable) {
|
||||
throw new Error('Expected test snapshot table');
|
||||
}
|
||||
const changed = computeKtxScanEnrichmentInputHash({
|
||||
const changed = computeKtxDescriptionsStageHash({
|
||||
snapshot: { ...snapshot, tables: [{ ...firstTable, name: 'orders_v2' }] },
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
providerIdentity: { provider: 'local-heuristic', llmModel: 'a' },
|
||||
llmIdentity,
|
||||
});
|
||||
|
||||
expect(first).toMatch(/^[a-f0-9]{64}$/);
|
||||
|
|
@ -82,13 +84,48 @@ describe('scan enrichment state', () => {
|
|||
expect(changed).not.toBe(first);
|
||||
});
|
||||
|
||||
it('isolates per-stage invalidation: one input changes only its own stage', () => {
|
||||
const descriptionDigest = computeKtxScanDescriptionDigest(['orders.id (integer)']);
|
||||
const descriptions = computeKtxDescriptionsStageHash({ snapshot, llmIdentity });
|
||||
const embeddings = computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest });
|
||||
const relationships = computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity });
|
||||
|
||||
// Switching the description LLM re-keys descriptions + relationships (both
|
||||
// depend on llmIdentity) but NOT embeddings.
|
||||
const otherLlm: KtxScanLlmIdentity = { model: 'sonnet', baseUrlConfigured: false };
|
||||
expect(computeKtxDescriptionsStageHash({ snapshot, llmIdentity: otherLlm })).not.toBe(descriptions);
|
||||
expect(computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity: otherLlm })).not.toBe(
|
||||
relationships,
|
||||
);
|
||||
expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest })).toBe(embeddings);
|
||||
|
||||
// Swapping the embeddings model re-keys only embeddings.
|
||||
const otherEmbedding: KtxScanEmbeddingIdentity = { model: 'mpnet', dimensions: 768, batchSize: 64 };
|
||||
expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity: otherEmbedding, descriptionDigest })).not.toBe(
|
||||
embeddings,
|
||||
);
|
||||
expect(computeKtxDescriptionsStageHash({ snapshot, llmIdentity })).toBe(descriptions);
|
||||
expect(computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity })).toBe(relationships);
|
||||
|
||||
// A description-content change (new digest) re-keys only embeddings;
|
||||
// relationships are deliberately decoupled from description content (D5).
|
||||
const otherDigest = computeKtxScanDescriptionDigest(['orders.id (integer). A primary key.']);
|
||||
expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest: otherDigest })).not.toBe(
|
||||
embeddings,
|
||||
);
|
||||
expect(computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity })).toBe(relationships);
|
||||
|
||||
// Flipping llmProposals re-keys only relationships.
|
||||
const otherRelationships = { ...relationshipSettings, llmProposals: !relationshipSettings.llmProposals };
|
||||
expect(
|
||||
computeKtxRelationshipsStageHash({ snapshot, relationshipSettings: otherRelationships, llmIdentity }),
|
||||
).not.toBe(relationships);
|
||||
expect(computeKtxDescriptionsStageHash({ snapshot, llmIdentity })).toBe(descriptions);
|
||||
expect(computeKtxEmbeddingsStageHash({ snapshot, embeddingIdentity, descriptionDigest })).toBe(embeddings);
|
||||
});
|
||||
|
||||
it('persists completed stages and ignores stale hashes', async () => {
|
||||
const inputHash = computeKtxScanEnrichmentInputHash({
|
||||
snapshot,
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
providerIdentity: { provider: 'local-heuristic' },
|
||||
});
|
||||
const inputHash = computeKtxDescriptionsStageHash({ snapshot, llmIdentity });
|
||||
|
||||
await store.saveCompletedStage({
|
||||
runId: 'scan-run-1',
|
||||
|
|
@ -103,7 +140,7 @@ describe('scan enrichment state', () => {
|
|||
|
||||
await expect(
|
||||
store.findCompletedStage({
|
||||
runId: 'scan-run-1',
|
||||
connectionId: 'warehouse',
|
||||
stage: 'descriptions',
|
||||
inputHash,
|
||||
}),
|
||||
|
|
@ -116,13 +153,51 @@ describe('scan enrichment state', () => {
|
|||
|
||||
await expect(
|
||||
store.findCompletedStage({
|
||||
runId: 'scan-run-1',
|
||||
connectionId: 'warehouse',
|
||||
stage: 'descriptions',
|
||||
inputHash: 'different-hash',
|
||||
}),
|
||||
).resolves.toBeNull();
|
||||
});
|
||||
|
||||
it('resolves a completed stage across a fresh run id by content identity', async () => {
|
||||
const inputHash = computeKtxDescriptionsStageHash({ snapshot, llmIdentity });
|
||||
|
||||
await store.saveCompletedStage({
|
||||
runId: 'scan-run-first',
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-first',
|
||||
mode: 'enriched',
|
||||
stage: 'descriptions',
|
||||
inputHash,
|
||||
output: [{ table: { catalog: null, db: 'public', name: 'orders' }, tableDescription: 'first' }],
|
||||
updatedAt: '2026-04-29T12:00:00.000Z',
|
||||
});
|
||||
// A later run with the SAME content identity overwrites in place (the
|
||||
// primary key no longer includes run_id), and the lookup resolves it
|
||||
// without ever knowing the run id that produced it.
|
||||
await store.saveCompletedStage({
|
||||
runId: 'scan-run-second',
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-second',
|
||||
mode: 'enriched',
|
||||
stage: 'descriptions',
|
||||
inputHash,
|
||||
output: [{ table: { catalog: null, db: 'public', name: 'orders' }, tableDescription: 'second' }],
|
||||
updatedAt: '2026-04-29T12:05:00.000Z',
|
||||
});
|
||||
|
||||
const resolved = await store.findCompletedStage({
|
||||
connectionId: 'warehouse',
|
||||
stage: 'descriptions',
|
||||
inputHash,
|
||||
});
|
||||
expect(resolved?.runId).toBe('scan-run-second');
|
||||
expect(resolved?.output).toEqual([
|
||||
{ table: { catalog: null, db: 'public', name: 'orders' }, tableDescription: 'second' },
|
||||
]);
|
||||
});
|
||||
|
||||
it('records failed stages without making them reusable', async () => {
|
||||
await store.saveFailedStage({
|
||||
runId: 'scan-run-2',
|
||||
|
|
@ -137,7 +212,7 @@ describe('scan enrichment state', () => {
|
|||
|
||||
await expect(
|
||||
store.findCompletedStage({
|
||||
runId: 'scan-run-2',
|
||||
connectionId: 'warehouse',
|
||||
stage: 'embeddings',
|
||||
inputHash: 'hash-2',
|
||||
}),
|
||||
|
|
@ -153,6 +228,47 @@ describe('scan enrichment state', () => {
|
|||
]);
|
||||
});
|
||||
|
||||
it('recreates the resume cache when an older primary key shape is found', async () => {
|
||||
const dbPath = join(tempDir, 'legacy.sqlite');
|
||||
const legacy = new Database(dbPath);
|
||||
legacy.exec(`
|
||||
CREATE TABLE local_scan_enrichment_stages (
|
||||
run_id TEXT NOT NULL,
|
||||
stage TEXT NOT NULL,
|
||||
input_hash TEXT NOT NULL,
|
||||
connection_id TEXT NOT NULL,
|
||||
sync_id TEXT NOT NULL,
|
||||
mode TEXT NOT NULL,
|
||||
status TEXT NOT NULL,
|
||||
output_json TEXT,
|
||||
error_message TEXT,
|
||||
updated_at TEXT NOT NULL,
|
||||
PRIMARY KEY (run_id, stage)
|
||||
);
|
||||
INSERT INTO local_scan_enrichment_stages
|
||||
VALUES ('old-run', 'descriptions', 'hash', 'warehouse', 'sync', 'enriched', 'completed', 'null', NULL, '2026-01-01T00:00:00.000Z');
|
||||
`);
|
||||
legacy.close();
|
||||
|
||||
const recreated = new SqliteLocalScanEnrichmentStateStore({ dbPath });
|
||||
// The legacy row is dropped with the old table; the new key shape is in
|
||||
// force, so a fresh save + lookup round-trips cleanly.
|
||||
await recreated.saveCompletedStage({
|
||||
runId: 'new-run',
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync',
|
||||
mode: 'enriched',
|
||||
stage: 'descriptions',
|
||||
inputHash: 'hash',
|
||||
output: ['fresh'],
|
||||
updatedAt: '2026-02-01T00:00:00.000Z',
|
||||
});
|
||||
await expect(
|
||||
recreated.findCompletedStage({ connectionId: 'warehouse', stage: 'descriptions', inputHash: 'hash' }),
|
||||
).resolves.toMatchObject({ runId: 'new-run', output: ['fresh'] });
|
||||
await expect(recreated.listRunStages('old-run')).resolves.toEqual([]);
|
||||
});
|
||||
|
||||
it('summarizes resumed, completed, and failed stages for reports', () => {
|
||||
expect(
|
||||
summarizeKtxScanEnrichmentState({
|
||||
|
|
|
|||
|
|
@ -5,7 +5,11 @@ import { afterEach, beforeEach, describe, expect, it } from 'vitest';
|
|||
import YAML from 'yaml';
|
||||
import { initKtxProject, type KtxLocalProject } from '../../../src/context/project/project.js';
|
||||
import type { KtxLocalScanEnrichmentResult } from '../../../src/context/scan/local-enrichment.js';
|
||||
import { writeLocalScanEnrichmentArtifacts, writeLocalScanManifestShards } from '../../../src/context/scan/local-enrichment-artifacts.js';
|
||||
import {
|
||||
loadOnDiskDescriptionUpdates,
|
||||
writeLocalScanEnrichmentArtifacts,
|
||||
writeLocalScanManifestShards,
|
||||
} from '../../../src/context/scan/local-enrichment-artifacts.js';
|
||||
import type { KtxSchemaSnapshot } from '../../../src/context/scan/types.js';
|
||||
|
||||
const snapshot: KtxSchemaSnapshot = {
|
||||
|
|
@ -220,6 +224,7 @@ function enrichment(): KtxLocalScanEnrichmentResult {
|
|||
},
|
||||
],
|
||||
compositeRelationships: null,
|
||||
relationshipPartial: null,
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -238,6 +243,86 @@ describe('writeLocalScanEnrichmentArtifacts', () => {
|
|||
await rm(tempDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
it('scopes manifest descriptions by full table identity across same-named tables in different schemas', async () => {
|
||||
const multiSchemaSnapshot: KtxSchemaSnapshot = {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'postgres',
|
||||
extractedAt: '2026-04-29T12:00:00.000Z',
|
||||
scope: { schemas: ['analytics', 'staging'] },
|
||||
metadata: {},
|
||||
tables: ['analytics', 'staging'].map((schema) => ({
|
||||
catalog: null,
|
||||
db: schema,
|
||||
name: 'orders',
|
||||
kind: 'table',
|
||||
comment: null,
|
||||
estimatedRows: 1,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: true,
|
||||
comment: null,
|
||||
},
|
||||
],
|
||||
})),
|
||||
};
|
||||
const descriptionUpdates = [
|
||||
{
|
||||
table: { catalog: null, db: 'analytics', name: 'orders' },
|
||||
tableDescription: 'Curated analytics orders',
|
||||
columnDescriptions: { id: 'Analytics order id' },
|
||||
},
|
||||
{
|
||||
table: { catalog: null, db: 'staging', name: 'orders' },
|
||||
tableDescription: 'Raw staging orders',
|
||||
columnDescriptions: { id: 'Staging order id' },
|
||||
},
|
||||
];
|
||||
|
||||
await writeLocalScanManifestShards({
|
||||
project,
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-multi',
|
||||
driver: 'postgres',
|
||||
snapshot: multiSchemaSnapshot,
|
||||
descriptionUpdates,
|
||||
dryRun: false,
|
||||
});
|
||||
|
||||
type Shard = {
|
||||
tables: Record<
|
||||
string,
|
||||
{ descriptions?: Record<string, string>; columns: Array<{ name: string; descriptions?: Record<string, string> }> }
|
||||
>;
|
||||
};
|
||||
const analyticsShard = YAML.parse(
|
||||
await readFile(join(project.projectDir, 'semantic-layer/warehouse/_schema/analytics.yaml'), 'utf-8'),
|
||||
) as Shard;
|
||||
const stagingShard = YAML.parse(
|
||||
await readFile(join(project.projectDir, 'semantic-layer/warehouse/_schema/staging.yaml'), 'utf-8'),
|
||||
) as Shard;
|
||||
|
||||
expect(analyticsShard.tables.orders?.descriptions?.ai).toBe('Curated analytics orders');
|
||||
expect(stagingShard.tables.orders?.descriptions?.ai).toBe('Raw staging orders');
|
||||
expect(analyticsShard.tables.orders?.columns[0]?.descriptions?.ai).toBe('Analytics order id');
|
||||
expect(stagingShard.tables.orders?.columns[0]?.descriptions?.ai).toBe('Staging order id');
|
||||
|
||||
// The on-disk reconstruction (used by selective `--stages` runs that skip the
|
||||
// descriptions stage) must also resolve per identity, not collapse names.
|
||||
const reconstructed = await loadOnDiskDescriptionUpdates(project, 'warehouse', multiSchemaSnapshot);
|
||||
const analytics = reconstructed.find((update) => update.table.db === 'analytics');
|
||||
const staging = reconstructed.find((update) => update.table.db === 'staging');
|
||||
expect(analytics?.tableDescription).toBe('Curated analytics orders');
|
||||
expect(staging?.tableDescription).toBe('Raw staging orders');
|
||||
expect(analytics?.columnDescriptions.id).toBe('Analytics order id');
|
||||
expect(staging?.columnDescriptions.id).toBe('Staging order id');
|
||||
});
|
||||
|
||||
it('writes enrichment artifacts and manifest shards while preserving external descriptions', async () => {
|
||||
await project.fileStore.writeFile(
|
||||
'semantic-layer/warehouse/_schema/public.yaml',
|
||||
|
|
@ -291,6 +376,7 @@ describe('writeLocalScanEnrichmentArtifacts', () => {
|
|||
profileSampleRows: 500,
|
||||
profileConcurrency: 3,
|
||||
validationConcurrency: 2,
|
||||
detectionBudgetMs: 600000,
|
||||
},
|
||||
});
|
||||
|
||||
|
|
@ -476,6 +562,7 @@ describe('writeLocalScanEnrichmentArtifacts', () => {
|
|||
profileSampleRows: 10000,
|
||||
profileConcurrency: 4,
|
||||
validationConcurrency: 4,
|
||||
detectionBudgetMs: 600000,
|
||||
},
|
||||
dryRun: false,
|
||||
});
|
||||
|
|
@ -746,6 +833,7 @@ describe('writeLocalScanEnrichmentArtifacts', () => {
|
|||
profileSampleRows: 10000,
|
||||
profileConcurrency: 4,
|
||||
validationConcurrency: 4,
|
||||
detectionBudgetMs: 600000,
|
||||
},
|
||||
dryRun: false,
|
||||
});
|
||||
|
|
|
|||
|
|
@ -1,6 +1,15 @@
|
|||
import { mkdtemp, readFile, rm } from 'node:fs/promises';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
import Database from 'better-sqlite3';
|
||||
import { describe, expect, it, vi } from 'vitest';
|
||||
import YAML from 'yaml';
|
||||
import { buildDefaultKtxProjectConfig } from '../../../src/context/project/config.js';
|
||||
import { initKtxProject } from '../../../src/context/project/project.js';
|
||||
import {
|
||||
loadOnDiskDescriptionUpdates,
|
||||
writeLocalScanEnrichmentArtifacts,
|
||||
} from '../../../src/context/scan/local-enrichment-artifacts.js';
|
||||
import type {
|
||||
KtxScanEnrichmentCompletedStage,
|
||||
KtxScanEnrichmentFailedStage,
|
||||
|
|
@ -201,15 +210,24 @@ function noDeclaredRelationshipSnapshot(): KtxSchemaSnapshot {
|
|||
|
||||
function memoryEnrichmentStateStore(): KtxScanEnrichmentStateStore {
|
||||
const records = new Map<string, KtxScanEnrichmentCompletedStage | KtxScanEnrichmentFailedStage>();
|
||||
const key = (input: Pick<KtxScanEnrichmentStageLookup, 'runId' | 'stage'>) => `${input.runId}:${input.stage}`;
|
||||
const key = (input: Pick<KtxScanEnrichmentStageLookup, 'connectionId' | 'stage' | 'inputHash'>) =>
|
||||
`${input.connectionId}:${input.stage}:${input.inputHash}`;
|
||||
return {
|
||||
async findCompletedStage<TOutput>(input: KtxScanEnrichmentStageLookup) {
|
||||
const record = records.get(key(input));
|
||||
if (!record || record.status !== 'completed' || record.inputHash !== input.inputHash) {
|
||||
if (!record || record.status !== 'completed') {
|
||||
return null;
|
||||
}
|
||||
return record as KtxScanEnrichmentCompletedStage<TOutput>;
|
||||
},
|
||||
async findLatestCompletedStage(input) {
|
||||
const matches = [...records.values()].filter(
|
||||
(record): record is KtxScanEnrichmentCompletedStage =>
|
||||
record.status === 'completed' && record.connectionId === input.connectionId && record.stage === input.stage,
|
||||
);
|
||||
matches.sort((left, right) => (left.updatedAt < right.updatedAt ? 1 : -1));
|
||||
return matches[0] ?? null;
|
||||
},
|
||||
async saveCompletedStage(input) {
|
||||
records.set(key(input), {
|
||||
...input,
|
||||
|
|
@ -246,6 +264,57 @@ describe('local scan enrichment', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('scopes descriptions by full table identity across same-named tables in different schemas', () => {
|
||||
const multiSchemaSnapshot: KtxSchemaSnapshot = {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'postgres',
|
||||
extractedAt: '2026-04-29T12:00:00.000Z',
|
||||
scope: { schemas: ['analytics', 'staging'] },
|
||||
metadata: {},
|
||||
tables: ['analytics', 'staging'].map((schema) => ({
|
||||
catalog: null,
|
||||
db: schema,
|
||||
name: 'orders',
|
||||
kind: 'table',
|
||||
comment: null,
|
||||
estimatedRows: 1,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: true,
|
||||
comment: null,
|
||||
},
|
||||
],
|
||||
})),
|
||||
};
|
||||
const descriptions = [
|
||||
{
|
||||
table: { catalog: null, db: 'analytics', name: 'orders' },
|
||||
tableDescription: 'Curated analytics orders',
|
||||
columnDescriptions: { id: 'Analytics order id' },
|
||||
},
|
||||
{
|
||||
table: { catalog: null, db: 'staging', name: 'orders' },
|
||||
tableDescription: 'Raw staging orders',
|
||||
columnDescriptions: { id: 'Staging order id' },
|
||||
},
|
||||
];
|
||||
|
||||
const schema = snapshotToKtxEnrichedSchema(multiSchemaSnapshot, new Map(), descriptions);
|
||||
|
||||
const analytics = schema.tables.find((table) => table.id === 'analytics.orders');
|
||||
const staging = schema.tables.find((table) => table.id === 'staging.orders');
|
||||
expect(analytics?.descriptions.ai).toBe('Curated analytics orders');
|
||||
expect(staging?.descriptions.ai).toBe('Raw staging orders');
|
||||
expect(analytics?.columns[0]?.descriptions.ai).toBe('Analytics order id');
|
||||
expect(staging?.columns[0]?.descriptions.ai).toBe('Staging order id');
|
||||
});
|
||||
|
||||
it('maps snapshot foreign keys into formal schema relationships', () => {
|
||||
const source = noDeclaredRelationshipSnapshot();
|
||||
const snapshotWithForeignKey = {
|
||||
|
|
@ -617,8 +686,8 @@ describe('local scan enrichment', () => {
|
|||
|
||||
expect(events).toEqual(
|
||||
expect.arrayContaining([
|
||||
expect.objectContaining({ message: 'Generating descriptions 1/2 tables', transient: true }),
|
||||
expect.objectContaining({ message: 'Generating descriptions 2/2 tables', transient: true }),
|
||||
expect.objectContaining({ message: 'Generating descriptions 1/2 (customers, 1 cols)', transient: true }),
|
||||
expect.objectContaining({ message: 'Generating descriptions 2/2 (orders, 2 cols)', transient: true }),
|
||||
expect.objectContaining({ message: 'Building embeddings 1/1 batches', transient: true }),
|
||||
expect.objectContaining({ message: 'Detecting relationships' }),
|
||||
]),
|
||||
|
|
@ -711,7 +780,7 @@ describe('local scan enrichment', () => {
|
|||
expect(embedBatch.mock.calls.map(([texts]) => texts).map((texts) => texts.length)).toEqual([2, 2, 1]);
|
||||
});
|
||||
|
||||
it('reuses completed description and embedding stages for the same run id and snapshot hash', async () => {
|
||||
it('reuses completed description and embedding stages across a fresh run id by content identity', async () => {
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const scanConnector = connector();
|
||||
const providers = {
|
||||
|
|
@ -728,21 +797,25 @@ describe('local scan enrichment', () => {
|
|||
providers,
|
||||
stateStore,
|
||||
syncId: 'sync-resume-1',
|
||||
providerIdentity: { provider: 'fake', embeddingDimensions: 6 },
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 },
|
||||
});
|
||||
|
||||
const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject');
|
||||
const embedBatch = vi.spyOn(providers.embedding, 'embedBatch');
|
||||
// A re-run mints a brand-new runId/syncId (as a real interrupted ingest
|
||||
// would); resume must still hit the cache via (connectionId, stage, inputHash).
|
||||
const second = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'scan-run-resume-1' },
|
||||
context: { runId: 'scan-run-resume-2' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'sync-resume-1',
|
||||
providerIdentity: { provider: 'fake', embeddingDimensions: 6 },
|
||||
syncId: 'sync-resume-2',
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 },
|
||||
});
|
||||
|
||||
expect(first.state.completedStages).toEqual(['descriptions', 'embeddings', 'relationships']);
|
||||
|
|
@ -756,6 +829,159 @@ describe('local scan enrichment', () => {
|
|||
expect(second.relationships).toEqual(first.relationships);
|
||||
});
|
||||
|
||||
it('marks a budget-truncated relationship stage partial, persists it, and re-runs only when the budget is raised', async () => {
|
||||
const executor = new InMemorySqliteExecutor();
|
||||
try {
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id) VALUES (1), (2);
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
const scanConnector = {
|
||||
...connector(),
|
||||
driver: 'sqlite' as const,
|
||||
capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }),
|
||||
introspect: vi.fn(async () => noDeclaredRelationshipSnapshot()),
|
||||
executeReadOnly: executor.executeReadOnly.bind(executor),
|
||||
};
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const base = Date.parse('2026-06-01T00:00:00.000Z');
|
||||
let calls = 0;
|
||||
// A clock that jumps a second per read against a 1ms budget trips at the
|
||||
// first table-profile boundary.
|
||||
const advancingNow = () => new Date(base + calls++ * 1000);
|
||||
const tightSettings = {
|
||||
...buildDefaultKtxProjectConfig().scan.relationships,
|
||||
detectionBudgetMs: 1,
|
||||
};
|
||||
|
||||
const first = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'relationships',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'budget-run-1' },
|
||||
providers: null,
|
||||
stateStore,
|
||||
syncId: 'sync-budget-1',
|
||||
relationshipSettings: tightSettings,
|
||||
now: advancingNow,
|
||||
});
|
||||
|
||||
expect(first.relationshipPartial).toEqual({ reason: 'budget' });
|
||||
expect(first.warnings.map((warning) => warning.code)).toContain('relationship_detection_partial');
|
||||
expect(first.state.completedStages).toContain('relationships');
|
||||
|
||||
// A re-run with a fresh runId resumes the saved partial from cache.
|
||||
const second = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'relationships',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'budget-run-2' },
|
||||
providers: null,
|
||||
stateStore,
|
||||
syncId: 'sync-budget-2',
|
||||
relationshipSettings: tightSettings,
|
||||
});
|
||||
expect(second.state.resumedStages).toContain('relationships');
|
||||
|
||||
// Raising the budget changes the content identity, forcing a fuller run.
|
||||
const third = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'relationships',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'budget-run-3' },
|
||||
providers: null,
|
||||
stateStore,
|
||||
syncId: 'sync-budget-3',
|
||||
relationshipSettings: { ...tightSettings, detectionBudgetMs: 600_000 },
|
||||
});
|
||||
expect(third.state.resumedStages).not.toContain('relationships');
|
||||
expect(third.relationshipPartial).toBeNull();
|
||||
} finally {
|
||||
executor.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('checkpoints descriptions and embeddings before the relationship stage queries the database', async () => {
|
||||
const executor = new InMemorySqliteExecutor();
|
||||
try {
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id) VALUES (1), (2);
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
const checkpoints: Array<Awaited<ReturnType<typeof runLocalScanEnrichment>>> = [];
|
||||
let sawRelationshipQuery = false;
|
||||
let relationshipQueryRanAfterCheckpoint = true;
|
||||
const scanConnector = {
|
||||
...connector(),
|
||||
driver: 'sqlite' as const,
|
||||
capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }),
|
||||
introspect: vi.fn(async () => noDeclaredRelationshipSnapshot()),
|
||||
executeReadOnly: (input: KtxReadOnlyQueryInput, ctx: KtxScanContext) => {
|
||||
sawRelationshipQuery = true;
|
||||
if (checkpoints.length === 0) {
|
||||
relationshipQueryRanAfterCheckpoint = false;
|
||||
}
|
||||
return executor.executeReadOnly(input, ctx);
|
||||
},
|
||||
};
|
||||
|
||||
const result = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'checkpoint-order' },
|
||||
providers: {
|
||||
...createDeterministicLocalScanEnrichmentProviders(),
|
||||
embedding: fakeScanEmbedding({ dimensions: 6 }),
|
||||
},
|
||||
onCheckpoint: async (checkpoint) => {
|
||||
checkpoints.push(checkpoint);
|
||||
},
|
||||
});
|
||||
|
||||
expect(checkpoints).toHaveLength(1);
|
||||
const checkpoint = checkpoints[0];
|
||||
if (!checkpoint) {
|
||||
throw new Error('Expected a checkpoint');
|
||||
}
|
||||
expect(checkpoint.summary.tableDescriptions).toBe('completed');
|
||||
expect(checkpoint.summary.embeddings).toBe('completed');
|
||||
expect(checkpoint.descriptionUpdates.length).toBeGreaterThan(0);
|
||||
expect(checkpoint.embeddingUpdates.length).toBeGreaterThan(0);
|
||||
// The relationship-specific outputs are deliberately absent at checkpoint time.
|
||||
expect(checkpoint.relationshipUpdate).toBeNull();
|
||||
expect(checkpoint.relationshipProfile).toBeNull();
|
||||
expect(sawRelationshipQuery).toBe(true);
|
||||
expect(relationshipQueryRanAfterCheckpoint).toBe(true);
|
||||
// The final result still carries the relationship outputs.
|
||||
expect(result.relationshipProfile).not.toBeNull();
|
||||
} finally {
|
||||
executor.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('does not checkpoint when relationship detection is skipped', async () => {
|
||||
const onCheckpoint = vi.fn(async () => {});
|
||||
await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
connector: connector(),
|
||||
context: { runId: 'no-checkpoint' },
|
||||
providers: createDeterministicLocalScanEnrichmentProviders(),
|
||||
relationshipSettings: { ...buildDefaultKtxProjectConfig().scan.relationships, enabled: false },
|
||||
onCheckpoint,
|
||||
});
|
||||
expect(onCheckpoint).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('does not reuse completed stages when the snapshot changes', async () => {
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const providers = {
|
||||
|
|
@ -773,7 +999,8 @@ describe('local scan enrichment', () => {
|
|||
providers,
|
||||
stateStore,
|
||||
syncId: 'sync-resume-hash',
|
||||
providerIdentity: { provider: 'fake', embeddingDimensions: 6 },
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 },
|
||||
});
|
||||
|
||||
const firstTable = snapshot.tables[0];
|
||||
|
|
@ -798,7 +1025,8 @@ describe('local scan enrichment', () => {
|
|||
providers,
|
||||
stateStore,
|
||||
syncId: 'sync-resume-hash',
|
||||
providerIdentity: { provider: 'fake', embeddingDimensions: 6 },
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 },
|
||||
});
|
||||
|
||||
expect(result.state.resumedStages).toEqual([]);
|
||||
|
|
@ -868,4 +1096,653 @@ describe('local scan enrichment', () => {
|
|||
}
|
||||
});
|
||||
|
||||
it('merges ai descriptions into the enriched relationship schema', () => {
|
||||
const schema = snapshotToKtxEnrichedSchema(snapshot, new Map(), [
|
||||
{
|
||||
table: { catalog: null, db: 'public', name: 'orders' },
|
||||
tableDescription: 'All customer orders',
|
||||
columnDescriptions: { customer_id: 'FK to the owning customer' },
|
||||
},
|
||||
]);
|
||||
const orders = schema.tables.find((table) => table.ref.name === 'orders');
|
||||
expect(orders?.descriptions).toMatchObject({ db: 'Customer orders', ai: 'All customer orders' });
|
||||
expect(orders?.columns.find((column) => column.name === 'customer_id')?.descriptions).toMatchObject({
|
||||
db: 'Customer id',
|
||||
ai: 'FK to the owning customer',
|
||||
});
|
||||
});
|
||||
|
||||
it('force-reruns a named stage past the completed-row short-circuit and leaves unselected stages untouched', async () => {
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const scanConnector = connector();
|
||||
const providers = {
|
||||
...createDeterministicLocalScanEnrichmentProviders(),
|
||||
embedding: fakeScanEmbedding({ dimensions: 6 }),
|
||||
};
|
||||
const identity = {
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 },
|
||||
};
|
||||
|
||||
await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'force-1' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'force-s1',
|
||||
...identity,
|
||||
});
|
||||
|
||||
const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject');
|
||||
const embedBatch = vi.spyOn(providers.embedding, 'embedBatch');
|
||||
|
||||
const rerun = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'force-2' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'force-s2',
|
||||
stages: ['descriptions'],
|
||||
...identity,
|
||||
});
|
||||
|
||||
// Only descriptions ran, and it recomputed (not resumed) despite a matching
|
||||
// completed row; embeddings + relationships were left untouched.
|
||||
expect(rerun.state.completedStages).toEqual(['descriptions']);
|
||||
expect(rerun.state.resumedStages).toEqual([]);
|
||||
expect(generateObject).toHaveBeenCalled();
|
||||
expect(embedBatch).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('naming every stage forces a full recompute rather than a no-op resume', async () => {
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const scanConnector = connector();
|
||||
const providers = {
|
||||
...createDeterministicLocalScanEnrichmentProviders(),
|
||||
embedding: fakeScanEmbedding({ dimensions: 6 }),
|
||||
};
|
||||
const identity = {
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
embeddingIdentity: { model: 'fake-embed', dimensions: 6, batchSize: 64 },
|
||||
};
|
||||
|
||||
await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'full-1' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'full-s1',
|
||||
...identity,
|
||||
});
|
||||
|
||||
const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject');
|
||||
const embedBatch = vi.spyOn(providers.embedding, 'embedBatch');
|
||||
|
||||
const rerun = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'full-2' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'full-s2',
|
||||
stages: ['descriptions', 'embeddings', 'relationships'],
|
||||
...identity,
|
||||
});
|
||||
|
||||
expect(rerun.state.resumedStages).toEqual([]);
|
||||
expect(rerun.state.completedStages).toEqual(['descriptions', 'embeddings', 'relationships']);
|
||||
expect(generateObject).toHaveBeenCalled();
|
||||
expect(embedBatch).toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('isolates per-stage invalidation: changing the embedding identity re-runs only embeddings', async () => {
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const scanConnector = connector();
|
||||
const providers = {
|
||||
...createDeterministicLocalScanEnrichmentProviders(),
|
||||
embedding: fakeScanEmbedding({ dimensions: 6 }),
|
||||
};
|
||||
const llmIdentity = { model: 'fake', baseUrlConfigured: false };
|
||||
|
||||
await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'iso-1' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'iso-s1',
|
||||
llmIdentity,
|
||||
embeddingIdentity: { model: 'embed-v1', dimensions: 6, batchSize: 64 },
|
||||
});
|
||||
|
||||
const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject');
|
||||
const embedBatch = vi.spyOn(providers.embedding, 'embedBatch');
|
||||
|
||||
const rerun = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'iso-2' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'iso-s2',
|
||||
llmIdentity,
|
||||
embeddingIdentity: { model: 'embed-v2', dimensions: 6, batchSize: 64 },
|
||||
});
|
||||
|
||||
// Only the embeddings hash moved: descriptions + relationships resume from
|
||||
// cache, embeddings recompute. No LLM description/proposal calls fire.
|
||||
expect(rerun.state.resumedStages).toEqual(['descriptions', 'relationships']);
|
||||
expect(rerun.state.completedStages).toEqual(['descriptions', 'embeddings', 'relationships']);
|
||||
expect(generateObject).not.toHaveBeenCalled();
|
||||
expect(embedBatch).toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('warns when a selected stage cannot run because its prerequisite is missing', async () => {
|
||||
const result = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: false,
|
||||
connector: connector(),
|
||||
context: { runId: 'prereq-1' },
|
||||
// No embedding provider configured.
|
||||
providers: createDeterministicLocalScanEnrichmentProviders(),
|
||||
stages: ['embeddings'],
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
});
|
||||
|
||||
expect(result.summary.embeddings).toBe('skipped');
|
||||
expect(result.warnings).toContainEqual(
|
||||
expect.objectContaining({ code: 'enrichment_stage_skipped', metadata: { stage: 'embeddings' } }),
|
||||
);
|
||||
});
|
||||
|
||||
it('feeds on-disk descriptions into the llmProposals prompt on a relationships-only run', async () => {
|
||||
const executor = new InMemorySqliteExecutor();
|
||||
try {
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id) VALUES (1), (2);
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
const scanConnector = {
|
||||
...connector(),
|
||||
driver: 'sqlite' as const,
|
||||
capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }),
|
||||
introspect: vi.fn(async () => noDeclaredRelationshipSnapshot()),
|
||||
executeReadOnly: executor.executeReadOnly.bind(executor),
|
||||
};
|
||||
const providers = createDeterministicLocalScanEnrichmentProviders();
|
||||
const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject');
|
||||
const onDiskDescriptions: Array<{
|
||||
table: { catalog: null; db: null; name: string };
|
||||
tableDescription: string | null;
|
||||
columnDescriptions: Record<string, string | null>;
|
||||
}> = [
|
||||
{
|
||||
table: { catalog: null, db: null, name: 'orders' },
|
||||
tableDescription: 'Customer purchase orders',
|
||||
columnDescriptions: { id: 'Order identifier', account_id: 'The owning account reference' },
|
||||
},
|
||||
{
|
||||
table: { catalog: null, db: null, name: 'accounts' },
|
||||
tableDescription: 'Account records',
|
||||
columnDescriptions: { id: 'Account identifier' },
|
||||
},
|
||||
];
|
||||
|
||||
await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'rel-only-hydration' },
|
||||
providers,
|
||||
stages: ['relationships'],
|
||||
llmIdentity: { model: 'fake', baseUrlConfigured: false },
|
||||
loadPriorDescriptions: async () => onDiskDescriptions,
|
||||
});
|
||||
|
||||
// The relationship-proposal prompt (the only generateObject calls on a
|
||||
// relationships-only run) carries the on-disk descriptions, not just names.
|
||||
const prompts = generateObject.mock.calls.map((call) => String((call[0] as { prompt: string }).prompt));
|
||||
expect(prompts.length).toBeGreaterThan(0);
|
||||
expect(prompts.some((prompt) => prompt.includes('The owning account reference'))).toBe(true);
|
||||
} finally {
|
||||
executor.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('resume record still skips already-enriched tables when a forced descriptions rerun re-enters compute', async () => {
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const scanConnector = connector();
|
||||
const providers = createDeterministicLocalScanEnrichmentProviders();
|
||||
const identity = { llmIdentity: { model: 'fake', baseUrlConfigured: false } };
|
||||
const resumeStore = {
|
||||
load: vi.fn(async () => [
|
||||
{
|
||||
table: { catalog: null, db: 'public', name: 'customers' },
|
||||
tableDescription: 'Recovered customers description',
|
||||
columnDescriptions: { id: 'Recovered id' },
|
||||
},
|
||||
]),
|
||||
flush: vi.fn(async () => {}),
|
||||
};
|
||||
|
||||
// Populate a completed descriptions row so a non-forced run would short-circuit.
|
||||
await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: false,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'resume-force-1' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'resume-force-s1',
|
||||
...identity,
|
||||
});
|
||||
|
||||
const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject');
|
||||
const rerun = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: false,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'resume-force-2' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'resume-force-s2',
|
||||
stages: ['descriptions'],
|
||||
descriptionResumeStore: resumeStore,
|
||||
...identity,
|
||||
});
|
||||
|
||||
// Forced compute re-entered, consulted the resume record, recovered
|
||||
// 'customers', and only re-issued the LLM for the un-recovered 'orders'.
|
||||
expect(resumeStore.load).toHaveBeenCalled();
|
||||
expect(generateObject).toHaveBeenCalledTimes(1);
|
||||
expect(rerun.descriptionUpdates.find((update) => update.table.name === 'customers')?.tableDescription).toBe(
|
||||
'Recovered customers description',
|
||||
);
|
||||
expect(rerun.state.resumedStages).toEqual([]);
|
||||
});
|
||||
|
||||
it('resumes per table identity, re-enriching a same-named table in another schema', async () => {
|
||||
const multiSchemaSnapshot: KtxSchemaSnapshot = {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'postgres',
|
||||
extractedAt: '2026-04-29T12:00:00.000Z',
|
||||
scope: { schemas: ['analytics', 'staging'] },
|
||||
metadata: {},
|
||||
tables: ['analytics', 'staging'].map((schema) => ({
|
||||
catalog: null,
|
||||
db: schema,
|
||||
name: 'orders',
|
||||
kind: 'table',
|
||||
comment: null,
|
||||
estimatedRows: 1,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: true,
|
||||
comment: null,
|
||||
},
|
||||
],
|
||||
})),
|
||||
};
|
||||
const scanConnector = connector();
|
||||
const providers = createDeterministicLocalScanEnrichmentProviders();
|
||||
const generateObject = vi.spyOn(providers.llmRuntime, 'generateObject');
|
||||
// Only the analytics.orders description was flushed before the interruption.
|
||||
const resumeStore = {
|
||||
load: vi.fn(async () => [
|
||||
{
|
||||
table: { catalog: null, db: 'analytics', name: 'orders' },
|
||||
tableDescription: 'Recovered analytics orders',
|
||||
columnDescriptions: { id: 'Recovered analytics id' },
|
||||
},
|
||||
]),
|
||||
flush: vi.fn(async () => {}),
|
||||
};
|
||||
|
||||
const result = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: false,
|
||||
connector: scanConnector,
|
||||
snapshot: multiSchemaSnapshot,
|
||||
context: { runId: 'resume-identity' },
|
||||
providers,
|
||||
descriptionResumeStore: resumeStore,
|
||||
relationshipSettings: { ...buildDefaultKtxProjectConfig().scan.relationships, enabled: false },
|
||||
});
|
||||
|
||||
// staging.orders is not recovered (different identity), so it is re-enriched
|
||||
// exactly once; analytics.orders keeps its recovered description.
|
||||
expect(generateObject).toHaveBeenCalledTimes(1);
|
||||
const analytics = result.descriptionUpdates.find((update) => update.table.db === 'analytics');
|
||||
const staging = result.descriptionUpdates.find((update) => update.table.db === 'staging');
|
||||
expect(analytics?.tableDescription).toBe('Recovered analytics orders');
|
||||
expect(staging?.tableDescription).not.toBe('Recovered analytics orders');
|
||||
expect(staging?.tableDescription).toBeTruthy();
|
||||
});
|
||||
|
||||
it('flags an unselected stage stale when its inputs changed, names the cascade, and clears after re-running it', async () => {
|
||||
const stateStore = memoryEnrichmentStateStore();
|
||||
const scanConnector = connector();
|
||||
const providers = {
|
||||
...createDeterministicLocalScanEnrichmentProviders(),
|
||||
embedding: fakeScanEmbedding({ dimensions: 6 }),
|
||||
};
|
||||
const llmIdentity = { model: 'fake', baseUrlConfigured: false };
|
||||
const embeddingV1 = { model: 'embed-v1', dimensions: 6, batchSize: 64 };
|
||||
const embeddingV2 = { model: 'embed-v2', dimensions: 6, batchSize: 64 };
|
||||
|
||||
// Full run captures embeddings + relationships keyed on the v1 embedding model.
|
||||
const full = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'stale-1' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'stale-s1',
|
||||
llmIdentity,
|
||||
embeddingIdentity: embeddingV1,
|
||||
});
|
||||
// Stand in for the persisted _schema so embeddings-only runs see the same
|
||||
// descriptions the descriptions stage produces (deterministic content).
|
||||
const loadPriorDescriptions = async () => full.descriptionUpdates;
|
||||
|
||||
// The embedding model changed in config, but the operator re-ran only descriptions.
|
||||
const reDescribe = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'stale-2' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'stale-s2',
|
||||
stages: ['descriptions'],
|
||||
loadPriorDescriptions,
|
||||
llmIdentity,
|
||||
embeddingIdentity: embeddingV2,
|
||||
});
|
||||
const stale = reDescribe.warnings.filter((warning) => warning.code === 'enrichment_stage_stale');
|
||||
expect(stale.map((warning) => warning.metadata?.stage)).toEqual(['embeddings']);
|
||||
expect(stale[0]?.message).toContain('--stages embeddings');
|
||||
|
||||
// Re-embedding on v2 stores the fresh embeddings hash, clearing the staleness.
|
||||
await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'stale-3' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'stale-s3',
|
||||
stages: ['embeddings'],
|
||||
loadPriorDescriptions,
|
||||
llmIdentity,
|
||||
embeddingIdentity: embeddingV2,
|
||||
});
|
||||
const afterReembed = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'stale-4' },
|
||||
providers,
|
||||
stateStore,
|
||||
syncId: 'stale-s4',
|
||||
stages: ['descriptions'],
|
||||
loadPriorDescriptions,
|
||||
llmIdentity,
|
||||
embeddingIdentity: embeddingV2,
|
||||
});
|
||||
expect(afterReembed.warnings.filter((warning) => warning.code === 'enrichment_stage_stale')).toEqual([]);
|
||||
});
|
||||
|
||||
const enrichedFixtureSnapshot = (): KtxSchemaSnapshot => ({
|
||||
connectionId: 'warehouse',
|
||||
driver: 'sqlite',
|
||||
extractedAt: '2026-05-07T00:00:00.000Z',
|
||||
scope: {},
|
||||
metadata: {},
|
||||
tables: [
|
||||
{
|
||||
catalog: null,
|
||||
db: null,
|
||||
name: 'accounts',
|
||||
kind: 'table',
|
||||
comment: 'DB accounts',
|
||||
estimatedRows: 2,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'INTEGER',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: false,
|
||||
comment: 'DB accounts id',
|
||||
},
|
||||
],
|
||||
},
|
||||
{
|
||||
catalog: null,
|
||||
db: null,
|
||||
name: 'orders',
|
||||
kind: 'table',
|
||||
comment: 'DB orders',
|
||||
estimatedRows: 3,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'INTEGER',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: false,
|
||||
comment: 'DB orders id',
|
||||
},
|
||||
{
|
||||
name: 'account_id',
|
||||
nativeType: 'INTEGER',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: false,
|
||||
comment: 'DB account ref',
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
const countKeyOccurrences = (text: string, key: string): number =>
|
||||
(text.match(new RegExp(`\\b${key}:`, 'g')) ?? []).length;
|
||||
|
||||
// Regression (spec 21 defect, 2026-06-24): a --stages subset that omits a stage
|
||||
// must not delete that stage's on-disk artifacts from the written _schema.
|
||||
it('a --stages relationships run preserves on-disk descriptions while adding joins', async () => {
|
||||
const tempDir = await mkdtemp(join(tmpdir(), 'ktx-stage-preserve-rel-'));
|
||||
const executor = new InMemorySqliteExecutor();
|
||||
try {
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id) VALUES (1), (2);
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
const project = await initKtxProject({ projectDir: join(tempDir, 'project') });
|
||||
const shardPath = 'semantic-layer/warehouse/_schema/public.yaml';
|
||||
// Enriched fixture: full ai + db descriptions, zero joins.
|
||||
await project.fileStore.writeFile(
|
||||
shardPath,
|
||||
YAML.stringify(
|
||||
{
|
||||
tables: {
|
||||
accounts: {
|
||||
table: 'accounts',
|
||||
descriptions: { ai: 'AI accounts table', db: 'DB accounts' },
|
||||
columns: [{ name: 'id', type: 'number', descriptions: { ai: 'AI accounts id', db: 'DB accounts id' } }],
|
||||
},
|
||||
orders: {
|
||||
table: 'orders',
|
||||
descriptions: { ai: 'AI orders table', db: 'DB orders' },
|
||||
columns: [
|
||||
{ name: 'id', type: 'number', descriptions: { ai: 'AI orders id', db: 'DB orders id' } },
|
||||
{ name: 'account_id', type: 'number', descriptions: { ai: 'AI account ref', db: 'DB account ref' } },
|
||||
],
|
||||
},
|
||||
},
|
||||
},
|
||||
{ indent: 2, lineWidth: 0 },
|
||||
),
|
||||
'ktx',
|
||||
'ktx@example.com',
|
||||
'Seed enriched fixture',
|
||||
);
|
||||
const before = await readFile(join(project.projectDir, shardPath), 'utf-8');
|
||||
const aiBefore = countKeyOccurrences(before, 'ai');
|
||||
const dbBefore = countKeyOccurrences(before, 'db');
|
||||
expect(aiBefore).toBeGreaterThan(0);
|
||||
|
||||
const scanConnector = {
|
||||
...connector(),
|
||||
driver: 'sqlite' as const,
|
||||
capabilities: createKtxConnectorCapabilities({ readOnlySql: true, columnStats: true }),
|
||||
introspect: vi.fn(async () => enrichedFixtureSnapshot()),
|
||||
executeReadOnly: executor.executeReadOnly.bind(executor),
|
||||
};
|
||||
const result = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'preserve-rel-1' },
|
||||
providers: createDeterministicLocalScanEnrichmentProviders(),
|
||||
stages: ['relationships'],
|
||||
syncId: 'sync-preserve-rel',
|
||||
loadPriorDescriptions: (snap) => loadOnDiskDescriptionUpdates(project, 'warehouse', snap),
|
||||
});
|
||||
await writeLocalScanEnrichmentArtifacts({
|
||||
project,
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-preserve-rel',
|
||||
driver: 'sqlite',
|
||||
enrichment: result,
|
||||
dryRun: false,
|
||||
});
|
||||
|
||||
const after = await readFile(join(project.projectDir, shardPath), 'utf-8');
|
||||
// Every prior ai:/db: description survived the relationships-only run...
|
||||
expect(countKeyOccurrences(after, 'ai')).toBe(aiBefore);
|
||||
expect(countKeyOccurrences(after, 'db')).toBe(dbBefore);
|
||||
expect(after).toContain('AI orders table');
|
||||
expect(after).toContain('AI account ref');
|
||||
// ...and the relationships stage actually added joins (it was 0 before).
|
||||
expect(result.relationships.accepted).toBeGreaterThan(0);
|
||||
const shard = YAML.parse(after) as { tables: Record<string, { joins?: unknown[] }> };
|
||||
expect(Object.values(shard.tables).some((table) => (table.joins ?? []).length > 0)).toBe(true);
|
||||
} finally {
|
||||
executor.close();
|
||||
await rm(tempDir, { recursive: true, force: true });
|
||||
}
|
||||
});
|
||||
|
||||
it('a --stages descriptions run preserves on-disk joins while refreshing descriptions', async () => {
|
||||
const tempDir = await mkdtemp(join(tmpdir(), 'ktx-stage-preserve-desc-'));
|
||||
try {
|
||||
const project = await initKtxProject({ projectDir: join(tempDir, 'project') });
|
||||
const shardPath = 'semantic-layer/warehouse/_schema/public.yaml';
|
||||
// Fixture: an inferred join present, descriptions absent.
|
||||
await project.fileStore.writeFile(
|
||||
shardPath,
|
||||
YAML.stringify(
|
||||
{
|
||||
tables: {
|
||||
accounts: { table: 'accounts', columns: [{ name: 'id', type: 'number' }] },
|
||||
orders: {
|
||||
table: 'orders',
|
||||
columns: [
|
||||
{ name: 'id', type: 'number' },
|
||||
{ name: 'account_id', type: 'number' },
|
||||
],
|
||||
joins: [
|
||||
{ to: 'accounts', on: 'orders.account_id = accounts.id', relationship: 'many_to_one', source: 'inferred' },
|
||||
],
|
||||
},
|
||||
},
|
||||
},
|
||||
{ indent: 2, lineWidth: 0 },
|
||||
),
|
||||
'ktx',
|
||||
'ktx@example.com',
|
||||
'Seed joins fixture',
|
||||
);
|
||||
|
||||
const scanConnector = {
|
||||
...connector(),
|
||||
driver: 'sqlite' as const,
|
||||
introspect: vi.fn(async () => enrichedFixtureSnapshot()),
|
||||
};
|
||||
const result = await runLocalScanEnrichment({
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector: scanConnector,
|
||||
context: { runId: 'preserve-desc-1' },
|
||||
providers: createDeterministicLocalScanEnrichmentProviders(),
|
||||
stages: ['descriptions'],
|
||||
syncId: 'sync-preserve-desc',
|
||||
loadPriorDescriptions: (snap) => loadOnDiskDescriptionUpdates(project, 'warehouse', snap),
|
||||
});
|
||||
await writeLocalScanEnrichmentArtifacts({
|
||||
project,
|
||||
connectionId: 'warehouse',
|
||||
syncId: 'sync-preserve-desc',
|
||||
driver: 'sqlite',
|
||||
enrichment: result,
|
||||
dryRun: false,
|
||||
});
|
||||
|
||||
const after = await readFile(join(project.projectDir, shardPath), 'utf-8');
|
||||
const shard = YAML.parse(after) as {
|
||||
tables: Record<string, { joins?: Array<{ to: string; source: string }> }>;
|
||||
};
|
||||
// The inferred join survived the descriptions-only run...
|
||||
expect(shard.tables.orders?.joins?.some((join) => join.to === 'accounts' && join.source === 'inferred')).toBe(true);
|
||||
// ...and the descriptions stage (re)wrote ai descriptions.
|
||||
expect(countKeyOccurrences(after, 'ai')).toBeGreaterThan(0);
|
||||
} finally {
|
||||
await rm(tempDir, { recursive: true, force: true });
|
||||
}
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -96,6 +96,7 @@ function deterministicLlmRuntime(): KtxLlmRuntimePort {
|
|||
generateText: vi.fn(async (input) => `Deterministic description for ${input.prompt.slice(0, 64).trim() || 'data source'}`),
|
||||
generateObject: vi.fn(async () => ({ pkCandidates: [], fkCandidates: [] }) as never),
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -1672,6 +1673,111 @@ describe('local scan', () => {
|
|||
expect(persistedReport).toContain('embedding service timed out');
|
||||
});
|
||||
|
||||
it('keeps AI descriptions in the queryable _schema when the relationship stage fails after enrichment', async () => {
|
||||
// Durability: the paid descriptions are checkpointed into the queryable
|
||||
// manifest before relationship detection runs, so a relationship-stage
|
||||
// failure degrades to "no joins", never "no descriptions".
|
||||
project.config.scan.enrichment = { mode: 'deterministic' };
|
||||
const connector = {
|
||||
id: 'test:warehouse',
|
||||
driver: 'postgres' as const,
|
||||
capabilities: {
|
||||
structuralIntrospection: true as const,
|
||||
tableSampling: true,
|
||||
columnSampling: true,
|
||||
columnStats: true,
|
||||
readOnlySql: true,
|
||||
nestedAnalysis: false,
|
||||
eventStreamDiscovery: false,
|
||||
formalForeignKeys: false,
|
||||
estimatedRowCounts: true,
|
||||
},
|
||||
...connectorScopeListing,
|
||||
async introspect() {
|
||||
return {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'postgres' as const,
|
||||
extractedAt: '2026-04-29T09:00:00.000Z',
|
||||
scope: { schemas: ['public'] },
|
||||
metadata: {},
|
||||
tables: [
|
||||
{
|
||||
catalog: null,
|
||||
db: 'public',
|
||||
name: 'customers',
|
||||
kind: 'table' as const,
|
||||
comment: 'Customer accounts',
|
||||
estimatedRows: 100,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number' as const,
|
||||
nullable: false,
|
||||
primaryKey: true,
|
||||
comment: 'Customer id',
|
||||
},
|
||||
],
|
||||
},
|
||||
{
|
||||
catalog: null,
|
||||
db: 'public',
|
||||
name: 'orders',
|
||||
kind: 'table' as const,
|
||||
comment: 'Customer orders',
|
||||
estimatedRows: 1000,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'customer_id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number' as const,
|
||||
nullable: false,
|
||||
primaryKey: false,
|
||||
comment: 'Owning customer',
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
};
|
||||
},
|
||||
async sampleTable() {
|
||||
return { headers: ['id'], rows: [[1]], totalRows: 1 };
|
||||
},
|
||||
async sampleColumn() {
|
||||
return { values: ['1'], nullCount: 0, distinctCount: 1 };
|
||||
},
|
||||
// Profiling succeeds; the coverage probe in the relationship stage throws,
|
||||
// standing in for a relationship-stage interruption after enrichment.
|
||||
async executeReadOnly(input: KtxReadOnlyQueryInput) {
|
||||
return relationshipSqlResult(input, { throwOnCoverage: true });
|
||||
},
|
||||
};
|
||||
|
||||
const result = await runLocalScan({
|
||||
project,
|
||||
adapters: [fetchOnlyAdapter({ snapshot: await connector.introspect() })],
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
connector,
|
||||
jobId: 'scan-checkpoint-durability-1',
|
||||
now: () => new Date('2026-04-29T09:20:00.000Z'),
|
||||
});
|
||||
|
||||
expect(result.report.warnings.map((warning) => warning.code)).toContain('enrichment_failed');
|
||||
|
||||
const manifestRaw = await readFile(
|
||||
join(project.projectDir, 'semantic-layer/warehouse/_schema/public.yaml'),
|
||||
'utf-8',
|
||||
);
|
||||
expect(manifestRaw).toContain('ai: |-');
|
||||
expect(manifestRaw).toContain('Deterministic description');
|
||||
});
|
||||
|
||||
it('resumes completed local enrichment stages when an enriched scan run is retried', async () => {
|
||||
let embeddingAttempts = 0;
|
||||
const connector = {
|
||||
|
|
@ -1928,6 +2034,147 @@ describe('local scan', () => {
|
|||
'raw-sources/warehouse/live-database/2026-04-29-160000-scan-run-sqlserver/scan-report.json',
|
||||
);
|
||||
});
|
||||
|
||||
// Regression (spec 21 defect, 2026-06-24): the structural manifest write that runs
|
||||
// BEFORE enrichment must not let a `--stages` subset delete the prior on-disk
|
||||
// descriptions. This goes through the full runLocalScan path (the unit-level
|
||||
// enrichment test could not catch the structural-pre-write ordering).
|
||||
it('a --stages relationships scan preserves on-disk descriptions while adding joins', async () => {
|
||||
const snapshot: KtxSchemaSnapshot = {
|
||||
connectionId: 'warehouse',
|
||||
driver: 'postgres',
|
||||
extractedAt: '2026-05-07T09:00:00.000Z',
|
||||
scope: {},
|
||||
metadata: {},
|
||||
tables: [
|
||||
{
|
||||
catalog: null,
|
||||
db: null,
|
||||
name: 'accounts',
|
||||
kind: 'table',
|
||||
comment: null,
|
||||
estimatedRows: 2,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: false,
|
||||
comment: null,
|
||||
},
|
||||
],
|
||||
},
|
||||
{
|
||||
catalog: null,
|
||||
db: null,
|
||||
name: 'orders',
|
||||
kind: 'table',
|
||||
comment: null,
|
||||
estimatedRows: 3,
|
||||
foreignKeys: [],
|
||||
columns: [
|
||||
{
|
||||
name: 'id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: false,
|
||||
comment: null,
|
||||
},
|
||||
{
|
||||
name: 'account_id',
|
||||
nativeType: 'integer',
|
||||
normalizedType: 'integer',
|
||||
dimensionType: 'number',
|
||||
nullable: false,
|
||||
primaryKey: false,
|
||||
comment: null,
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
};
|
||||
// Enriched fixture already on disk: ai descriptions, zero joins.
|
||||
await project.fileStore.writeFile(
|
||||
'semantic-layer/warehouse/_schema/public.yaml',
|
||||
YAML.stringify(
|
||||
{
|
||||
tables: {
|
||||
accounts: {
|
||||
table: 'accounts',
|
||||
descriptions: { ai: 'AI accounts table' },
|
||||
columns: [{ name: 'id', type: 'number', descriptions: { ai: 'AI accounts id' } }],
|
||||
},
|
||||
orders: {
|
||||
table: 'orders',
|
||||
descriptions: { ai: 'AI orders table' },
|
||||
columns: [
|
||||
{ name: 'id', type: 'number', descriptions: { ai: 'AI orders id' } },
|
||||
{ name: 'account_id', type: 'number', descriptions: { ai: 'AI account ref' } },
|
||||
],
|
||||
},
|
||||
},
|
||||
},
|
||||
{ indent: 2, lineWidth: 0 },
|
||||
),
|
||||
'ktx',
|
||||
'ktx@example.com',
|
||||
'Seed enriched fixture',
|
||||
);
|
||||
const shardPath = 'semantic-layer/warehouse/_schema/public.yaml';
|
||||
const aiBefore = ((await project.fileStore.readFile(shardPath)).content.match(/\bai:/g) ?? []).length;
|
||||
expect(aiBefore).toBe(5);
|
||||
|
||||
const connector: KtxScanConnector = {
|
||||
id: 'test:warehouse',
|
||||
driver: 'postgres',
|
||||
capabilities: {
|
||||
structuralIntrospection: true,
|
||||
tableSampling: false,
|
||||
columnSampling: false,
|
||||
columnStats: true,
|
||||
readOnlySql: true,
|
||||
nestedAnalysis: false,
|
||||
eventStreamDiscovery: false,
|
||||
formalForeignKeys: false,
|
||||
estimatedRowCounts: true,
|
||||
},
|
||||
...connectorScopeListing,
|
||||
introspect: vi.fn(async () => snapshot),
|
||||
async executeReadOnly(input: KtxReadOnlyQueryInput) {
|
||||
return relationshipSqlResult(input);
|
||||
},
|
||||
};
|
||||
|
||||
const result = await runLocalScan({
|
||||
project,
|
||||
adapters: [fetchOnlyAdapter({ snapshot })],
|
||||
connectionId: 'warehouse',
|
||||
mode: 'enriched',
|
||||
detectRelationships: true,
|
||||
stages: ['relationships'],
|
||||
connector,
|
||||
enrichmentProviders: { llmRuntime: deterministicLlmRuntime() },
|
||||
jobId: 'scan-stages-relationships-preserve',
|
||||
now: () => new Date('2026-05-07T09:30:00.000Z'),
|
||||
});
|
||||
|
||||
// The relationships stage actually ran and accepted a join...
|
||||
expect(result.report.relationships.accepted).toBe(1);
|
||||
const after = (await project.fileStore.readFile(shardPath)).content;
|
||||
// ...and every prior ai description survived the structural + enrichment writes.
|
||||
expect((after.match(/\bai:/g) ?? []).length).toBe(aiBefore);
|
||||
expect(after).toContain('AI orders table');
|
||||
expect(after).toContain('AI account ref');
|
||||
const manifest = YAML.parse(after) as {
|
||||
tables: Record<string, { joins?: Array<{ to: string; source: string }> }>;
|
||||
};
|
||||
expect(manifest.tables.orders?.joins?.some((join) => join.to === 'accounts')).toBe(true);
|
||||
});
|
||||
});
|
||||
|
||||
describe('resolveEnabledTables', () => {
|
||||
|
|
|
|||
47
packages/cli/test/context/scan/object-introspection.test.ts
Normal file
47
packages/cli/test/context/scan/object-introspection.test.ts
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
import { describe, expect, it } from 'vitest';
|
||||
import { tryIntrospectObject } from '../../../src/context/scan/object-introspection.js';
|
||||
|
||||
describe('tryIntrospectObject', () => {
|
||||
it('returns the read value when introspection succeeds', async () => {
|
||||
await expect(tryIntrospectObject({ object: 'customers' }, () => ({ name: 'customers' }))).resolves.toEqual({
|
||||
ok: true,
|
||||
table: { name: 'customers' },
|
||||
});
|
||||
});
|
||||
|
||||
it('skips with a recoverable warning when the object read throws', async () => {
|
||||
const outcome = await tryIntrospectObject({ object: 'broken_view', db: 'main' }, () => {
|
||||
throw new Error('no such column: ehp.start_date');
|
||||
});
|
||||
|
||||
expect(outcome).toEqual({
|
||||
ok: false,
|
||||
warning: {
|
||||
code: 'object_introspection_failed',
|
||||
message: 'no such column: ehp.start_date',
|
||||
table: 'broken_view',
|
||||
recoverable: true,
|
||||
metadata: { object: 'main.broken_view', db: 'main' },
|
||||
},
|
||||
});
|
||||
});
|
||||
|
||||
it('rethrows native programming faults instead of masking them as object skips', async () => {
|
||||
await expect(
|
||||
tryIntrospectObject({ object: 'customers' }, () => {
|
||||
throw new TypeError('cannot read properties of undefined');
|
||||
}),
|
||||
).rejects.toBeInstanceOf(TypeError);
|
||||
});
|
||||
|
||||
it('builds a fully-qualified object label for warehouse objects', async () => {
|
||||
const outcome = await tryIntrospectObject({ object: 'orders', db: 'sales', catalog: 'warehouse' }, () => {
|
||||
throw new Error('permission denied');
|
||||
});
|
||||
expect(outcome.ok).toBe(false);
|
||||
if (!outcome.ok) {
|
||||
expect(outcome.warning.table).toBe('orders');
|
||||
expect(outcome.warning.metadata).toEqual({ object: 'warehouse.sales.orders', db: 'sales', catalog: 'warehouse' });
|
||||
}
|
||||
});
|
||||
});
|
||||
|
|
@ -0,0 +1,72 @@
|
|||
import { describe, expect, it } from 'vitest';
|
||||
import {
|
||||
createKtxRelationshipDetectionBudget,
|
||||
mapWithBudget,
|
||||
} from '../../../src/context/scan/relationship-detection-budget.js';
|
||||
|
||||
describe('relationship detection budget', () => {
|
||||
it('reports no stop while inside the wall-clock budget', () => {
|
||||
let clock = 1000;
|
||||
const budget = createKtxRelationshipDetectionBudget({ budgetMs: 500, now: () => clock });
|
||||
expect(budget.check()).toBeNull();
|
||||
clock = 1400;
|
||||
expect(budget.check()).toBeNull();
|
||||
expect(budget.stopReason()).toBeNull();
|
||||
});
|
||||
|
||||
it('trips on budget exhaustion and records it stickily', () => {
|
||||
let clock = 0;
|
||||
const budget = createKtxRelationshipDetectionBudget({ budgetMs: 100, now: () => clock });
|
||||
clock = 150;
|
||||
expect(budget.check()).toBe('budget');
|
||||
// Even after a notional clock rewind the recorded reason persists.
|
||||
clock = 10;
|
||||
expect(budget.stopReason()).toBe('budget');
|
||||
});
|
||||
|
||||
it('prefers abort over budget when the signal fires', () => {
|
||||
const controller = new AbortController();
|
||||
let clock = 0;
|
||||
const budget = createKtxRelationshipDetectionBudget({
|
||||
budgetMs: 1_000,
|
||||
signal: controller.signal,
|
||||
now: () => clock,
|
||||
});
|
||||
expect(budget.check()).toBeNull();
|
||||
controller.abort();
|
||||
expect(budget.check()).toBe('aborted');
|
||||
expect(budget.stopReason()).toBe('aborted');
|
||||
});
|
||||
|
||||
it('maps every item and stays unmarked when the budget is never exhausted', async () => {
|
||||
const budget = createKtxRelationshipDetectionBudget({ budgetMs: 1_000, now: () => 0 });
|
||||
const { results, processedCount } = await mapWithBudget({
|
||||
inputs: [1, 2, 3, 4],
|
||||
concurrency: 2,
|
||||
budget,
|
||||
mapOne: async (value) => value * 10,
|
||||
});
|
||||
expect(processedCount).toBe(4);
|
||||
expect(results).toEqual([10, 20, 30, 40]);
|
||||
expect(budget.stopReason()).toBeNull();
|
||||
});
|
||||
|
||||
it('stops claiming new items once the budget trips and leaves the rest undefined', async () => {
|
||||
let clock = 0;
|
||||
const budget = createKtxRelationshipDetectionBudget({ budgetMs: 25, now: () => clock });
|
||||
const started: number[] = [];
|
||||
const { results, processedCount } = await mapWithBudget({
|
||||
inputs: [0, 1, 2, 3, 4],
|
||||
concurrency: 1,
|
||||
budget,
|
||||
onStart: (index) => {
|
||||
started.push(index);
|
||||
clock += 10; // each unit advances the clock; the budget elapses partway through
|
||||
},
|
||||
mapOne: async (value) => value,
|
||||
});
|
||||
expect(processedCount).toBeLessThan(5);
|
||||
expect(results.slice(processedCount).every((value) => value === undefined)).toBe(true);
|
||||
expect(budget.stopReason()).toBe('budget');
|
||||
});
|
||||
});
|
||||
|
|
@ -315,6 +315,26 @@ describe('relationship diagnostics artifacts', () => {
|
|||
expect(diagnostics.summary).toEqual({ accepted: 0, review: 0, rejected: 0, skipped: 0 });
|
||||
expect(diagnostics.noAcceptedReason).toBe('no candidate pairs passed type compatibility');
|
||||
expect(diagnostics.candidateCountsBySource).toEqual({});
|
||||
expect(diagnostics.partial).toBe(false);
|
||||
expect(diagnostics.partialReason).toBeNull();
|
||||
});
|
||||
|
||||
it('marks the diagnostics partial with its stop reason when relationship detection was truncated', () => {
|
||||
const artifacts = buildKtxRelationshipArtifacts({ connectionId: 'warehouse' });
|
||||
const diagnostics = buildKtxRelationshipDiagnostics({
|
||||
connectionId: 'warehouse',
|
||||
generatedAt: '2026-05-07T12:00:00.000Z',
|
||||
artifacts,
|
||||
profile: emptyKtxRelationshipProfileArtifact({
|
||||
connectionId: 'warehouse',
|
||||
driver: 'sqlite',
|
||||
reason: 'relationship_profiling_not_run',
|
||||
}),
|
||||
partial: { reason: 'budget' },
|
||||
});
|
||||
|
||||
expect(diagnostics.partial).toBe(true);
|
||||
expect(diagnostics.partialReason).toBe('budget');
|
||||
});
|
||||
|
||||
it('records composite relationship endpoints in relationship artifacts', () => {
|
||||
|
|
|
|||
|
|
@ -224,6 +224,7 @@ function llmRuntime(output: unknown): KtxLlmRuntimePort {
|
|||
generateText: vi.fn(),
|
||||
generateObject: vi.fn(async () => output) as KtxLlmRuntimePort['generateObject'],
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -338,6 +339,126 @@ describe('production relationship discovery', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('emits per-table profiling and per-candidate validation progress', async () => {
|
||||
executor = new InMemorySqliteExecutor();
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex');
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
const messages: string[] = [];
|
||||
const progress = {
|
||||
async update(_progress: number, message?: string) {
|
||||
if (message) {
|
||||
messages.push(message);
|
||||
}
|
||||
},
|
||||
startPhase() {
|
||||
return progress;
|
||||
},
|
||||
};
|
||||
|
||||
const result = await discoverKtxRelationships({
|
||||
connectionId: 'warehouse',
|
||||
dialect: getSqlDialectForDriver('sqlite'),
|
||||
connector: connector(executor),
|
||||
schema: snapshotToKtxEnrichedSchema(snapshot()),
|
||||
context: { runId: 'relationship-progress' },
|
||||
settings: relationshipSettings(),
|
||||
progress,
|
||||
});
|
||||
|
||||
expect(result.partial).toBeNull();
|
||||
expect(messages).toContain('Profiling table 1/2');
|
||||
expect(messages).toContain('Profiling table 2/2');
|
||||
expect(messages.some((message) => message.startsWith('Validating candidate '))).toBe(true);
|
||||
});
|
||||
|
||||
it('returns a partial result when the wall-clock budget is exhausted, without throwing', async () => {
|
||||
executor = new InMemorySqliteExecutor();
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex');
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
// A clock that jumps a full second per read against a 1ms budget exhausts
|
||||
// the budget at the very first unit boundary.
|
||||
let calls = 0;
|
||||
const now = () => calls++ * 1000;
|
||||
|
||||
const result = await discoverKtxRelationships({
|
||||
connectionId: 'warehouse',
|
||||
dialect: getSqlDialectForDriver('sqlite'),
|
||||
connector: connector(executor),
|
||||
schema: snapshotToKtxEnrichedSchema(snapshot()),
|
||||
context: { runId: 'relationship-budget' },
|
||||
settings: { ...relationshipSettings(), detectionBudgetMs: 1 },
|
||||
now,
|
||||
});
|
||||
|
||||
expect(result.partial).toEqual({ reason: 'budget' });
|
||||
expect(result.relationships.accepted).toBe(0);
|
||||
});
|
||||
|
||||
it('does not start the LLM relationship proposal once the budget is exhausted', async () => {
|
||||
executor = new InMemorySqliteExecutor();
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex');
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
let calls = 0;
|
||||
const now = () => calls++ * 1000;
|
||||
const generateObject = vi.fn(async () => ({ pkCandidates: [], fkCandidates: [] }));
|
||||
const runtime: KtxLlmRuntimePort = {
|
||||
generateText: vi.fn(),
|
||||
generateObject: generateObject as KtxLlmRuntimePort['generateObject'],
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
};
|
||||
|
||||
const result = await discoverKtxRelationships({
|
||||
connectionId: 'warehouse',
|
||||
dialect: getSqlDialectForDriver('sqlite'),
|
||||
connector: connector(executor),
|
||||
schema: snapshotToKtxEnrichedSchema(snapshot()),
|
||||
context: { runId: 'relationship-budget-llm' },
|
||||
settings: { ...relationshipSettings(), detectionBudgetMs: 1 },
|
||||
llmRuntime: runtime,
|
||||
now,
|
||||
});
|
||||
|
||||
expect(result.partial).toEqual({ reason: 'budget' });
|
||||
expect(result.llmRelationshipValidation).toBe('skipped');
|
||||
expect(generateObject).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('returns a partial result when the scan signal is already aborted', async () => {
|
||||
executor = new InMemorySqliteExecutor();
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER NOT NULL, name TEXT NOT NULL);
|
||||
CREATE TABLE orders (id INTEGER NOT NULL, account_id INTEGER NOT NULL);
|
||||
INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex');
|
||||
INSERT INTO orders (id, account_id) VALUES (10, 1), (11, 1), (12, 2);
|
||||
`);
|
||||
|
||||
const result = await discoverKtxRelationships({
|
||||
connectionId: 'warehouse',
|
||||
dialect: getSqlDialectForDriver('sqlite'),
|
||||
connector: connector(executor),
|
||||
schema: snapshotToKtxEnrichedSchema(snapshot()),
|
||||
context: { runId: 'relationship-aborted', signal: AbortSignal.abort() },
|
||||
settings: relationshipSettings(),
|
||||
});
|
||||
|
||||
expect(result.partial).toEqual({ reason: 'aborted' });
|
||||
// A stop-before-completion must not be reported as completed statistical validation.
|
||||
expect(result.statisticalValidation).toBe('skipped');
|
||||
});
|
||||
|
||||
it('accepts a profile-driven natural-key relationship without declared metadata', async () => {
|
||||
executor = new InMemorySqliteExecutor();
|
||||
executor.db.exec(`
|
||||
|
|
|
|||
|
|
@ -9,6 +9,7 @@ function llmRuntime(output?: unknown): KtxLlmRuntimePort {
|
|||
generateText: vi.fn(),
|
||||
generateObject: vi.fn(async () => output) as KtxLlmRuntimePort['generateObject'],
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -202,6 +203,7 @@ describe('relationship LLM proposals', () => {
|
|||
throw new Error('model unavailable');
|
||||
}),
|
||||
runAgentLoop: vi.fn(),
|
||||
subprocessForkSpec: () => null,
|
||||
},
|
||||
});
|
||||
expect(failed).toMatchObject({ candidates: [], llmCalls: 1, summary: 'failed' });
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
import Database from 'better-sqlite3';
|
||||
import { afterEach, describe, expect, it } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { getSqlDialectForDriver } from '../../../src/context/connections/dialects.js';
|
||||
import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable } from '../../../src/context/scan/enrichment-types.js';
|
||||
import { generateKtxRelationshipDiscoveryCandidates } from '../../../src/context/scan/relationship-candidates.js';
|
||||
|
|
@ -139,6 +140,54 @@ describe('relationship validation', () => {
|
|||
expect(validated[0]?.score).toBeGreaterThanOrEqual(0.85);
|
||||
});
|
||||
|
||||
it('sends a candidate to review (not source-fatal) when its validation query times out', async () => {
|
||||
executor = new InMemorySqliteExecutor();
|
||||
executor.db.exec(`
|
||||
CREATE TABLE accounts (id INTEGER, name TEXT);
|
||||
CREATE TABLE users (id INTEGER, account_id INTEGER);
|
||||
CREATE TABLE invoices (id INTEGER, account_id INTEGER);
|
||||
INSERT INTO accounts (id, name) VALUES (1, 'Acme'), (2, 'Globex'), (3, 'Initech');
|
||||
INSERT INTO users (id, account_id) VALUES (10, 1), (11, 2), (12, 3);
|
||||
INSERT INTO invoices (id, account_id) VALUES (20, 1), (21, 2), (22, 999);
|
||||
`);
|
||||
const testSchema = schema();
|
||||
const profiles = await profileKtxRelationshipSchema({
|
||||
connectionId: 'warehouse',
|
||||
driver: 'sqlite',
|
||||
dialect: getSqlDialectForDriver('sqlite'),
|
||||
schema: testSchema,
|
||||
executor,
|
||||
ctx: { runId: 'validate-test' },
|
||||
});
|
||||
const candidates = generateKtxRelationshipDiscoveryCandidates(testSchema).filter(
|
||||
(candidate) => candidate.from.table.name === 'users',
|
||||
);
|
||||
|
||||
const warnings: string[] = [];
|
||||
const timingOutExecutor = {
|
||||
executeReadOnly: () => Promise.reject(new KtxQueryError('query exceeded 30s')),
|
||||
};
|
||||
const validated = await validateKtxRelationshipDiscoveryCandidates({
|
||||
connectionId: 'warehouse',
|
||||
dialect: getSqlDialectForDriver('sqlite'),
|
||||
candidates,
|
||||
profiles,
|
||||
executor: timingOutExecutor,
|
||||
ctx: {
|
||||
runId: 'validate-test',
|
||||
logger: { debug() {}, info() {}, warn: (message) => warnings.push(message), error() {} },
|
||||
},
|
||||
tableCount: testSchema.tables.length,
|
||||
});
|
||||
|
||||
expect(validated).toHaveLength(1);
|
||||
expect(validated[0]).toMatchObject({
|
||||
status: 'review',
|
||||
validation: { reasons: ['validation_query_failed'] },
|
||||
});
|
||||
expect(warnings.some((message) => message.includes('query exceeded 30s'))).toBe(true);
|
||||
});
|
||||
|
||||
it('rejects a candidate with missing parent values and records the deterministic reason', async () => {
|
||||
executor = new InMemorySqliteExecutor();
|
||||
executor.db.exec(`
|
||||
|
|
|
|||
|
|
@ -6,10 +6,12 @@ import { initKtxProject, type KtxLocalProject } from '../../../src/context/proje
|
|||
import {
|
||||
listLocalKnowledgePageKeys,
|
||||
listLocalKnowledgePages,
|
||||
listReferencedConnectionIds,
|
||||
readLocalKnowledgePage,
|
||||
searchLocalKnowledgePages,
|
||||
writeLocalKnowledgePage,
|
||||
} from '../../../src/context/wiki/local-knowledge.js';
|
||||
import { SqliteKnowledgeIndex } from '../../../src/context/wiki/sqlite-knowledge-index.js';
|
||||
|
||||
class FakeEmbeddingPort {
|
||||
readonly maxBatchSize = 16;
|
||||
|
|
@ -284,6 +286,203 @@ describe('local knowledge helpers', () => {
|
|||
expect(raw.content).toContain(['fingerprints:', ' - fp_paid_orders'].join('\n'));
|
||||
});
|
||||
|
||||
it('round-trips a connections list through write, read, and list', async () => {
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-sales-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Orders concept for the sales database',
|
||||
content: 'In sales_db, orders are recognized when paid.',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
|
||||
const raw = await project.fileStore.readFile('wiki/global/orders-sales-db.md');
|
||||
expect(raw.content).toContain(['connections:', ' - sales_db'].join('\n'));
|
||||
|
||||
await expect(readLocalKnowledgePage(project, { key: 'orders-sales-db', userId: 'local' })).resolves.toMatchObject({
|
||||
key: 'orders-sales-db',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
});
|
||||
|
||||
it('normalizes a single connections string to a list at parse time', async () => {
|
||||
await project.fileStore.writeFile(
|
||||
'wiki/global/single-scoped.md',
|
||||
'---\nsummary: Single connection as scalar\nusage_mode: auto\nconnections: events_db\n---\n\nBody\n',
|
||||
'Test',
|
||||
'test@example.com',
|
||||
'Write scalar connections page',
|
||||
);
|
||||
|
||||
await expect(readLocalKnowledgePage(project, { key: 'single-scoped', userId: 'local' })).resolves.toMatchObject({
|
||||
key: 'single-scoped',
|
||||
connections: ['events_db'],
|
||||
});
|
||||
});
|
||||
|
||||
it('treats an absent connections field as unscoped (empty list)', async () => {
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'fiscal-year',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Org-wide fiscal year',
|
||||
content: 'Fiscal year starts in February.',
|
||||
});
|
||||
|
||||
await expect(readLocalKnowledgePage(project, { key: 'fiscal-year', userId: 'local' })).resolves.toMatchObject({
|
||||
key: 'fiscal-year',
|
||||
connections: [],
|
||||
});
|
||||
});
|
||||
|
||||
it('scopes search to unscoped pages plus pages listing the requested connection', async () => {
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-sales-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Sales DB orders',
|
||||
content: 'Orders are paid in the sales database.',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-events-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Events DB orders',
|
||||
content: 'Orders are paid in the events database.',
|
||||
connections: ['events_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-global',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Org-wide orders note',
|
||||
content: 'Orders are paid everywhere in the org.',
|
||||
});
|
||||
|
||||
const scoped = await searchLocalKnowledgePages(project, {
|
||||
query: 'orders paid',
|
||||
userId: 'local',
|
||||
connectionId: 'sales_db',
|
||||
});
|
||||
const keys = scoped.map((result) => result.key).sort();
|
||||
expect(keys).toEqual(['orders-global', 'orders-sales-db']);
|
||||
expect(keys).not.toContain('orders-events-db');
|
||||
|
||||
const unfiltered = await searchLocalKnowledgePages(project, { query: 'orders paid', userId: 'local' });
|
||||
expect(unfiltered.map((result) => result.key).sort()).toEqual([
|
||||
'orders-events-db',
|
||||
'orders-global',
|
||||
'orders-sales-db',
|
||||
]);
|
||||
});
|
||||
|
||||
it('keeps other-connection pages and embeddings in the sqlite index after a scoped search', async () => {
|
||||
const embedding = new FakeEmbeddingPort();
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-sales-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Sales DB orders',
|
||||
content: 'Orders are paid in the sales database.',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-events-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Events DB orders',
|
||||
content: 'Orders are paid in the events database.',
|
||||
connections: ['events_db'],
|
||||
});
|
||||
|
||||
const scoped = await searchLocalKnowledgePages(project, {
|
||||
query: 'orders paid',
|
||||
userId: 'local',
|
||||
connectionId: 'sales_db',
|
||||
embeddingService: embedding,
|
||||
});
|
||||
expect(scoped.map((result) => result.key)).toEqual(['orders-sales-db']);
|
||||
|
||||
// A connection-scoped search must not prune the other connection's page (or
|
||||
// its cached embedding) from the shared persistent index.
|
||||
const index = new SqliteKnowledgeIndex({ dbPath: join(project.projectDir, '.ktx', 'db.sqlite') });
|
||||
const indexed = index.getExistingPages();
|
||||
expect([...indexed.keys()].sort()).toEqual([
|
||||
'wiki/global/orders-events-db.md',
|
||||
'wiki/global/orders-sales-db.md',
|
||||
]);
|
||||
expect(indexed.get('wiki/global/orders-events-db.md')?.embedding).not.toBeNull();
|
||||
});
|
||||
|
||||
it('filters search per connection across lexical and token lanes when embeddings are disabled', async () => {
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'rfm-events-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'RFM definition for events_db',
|
||||
content: 'RFM segmentation rules for the events database.',
|
||||
connections: ['events_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'rfm-sales-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'RFM definition for sales_db',
|
||||
content: 'RFM segmentation rules for the sales database.',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
|
||||
const lexical = await searchLocalKnowledgePages(project, {
|
||||
query: 'rfm segmentation',
|
||||
userId: 'local',
|
||||
connectionId: 'events_db',
|
||||
});
|
||||
expect(lexical.map((result) => result.key)).toEqual(['rfm-events-db']);
|
||||
|
||||
const token = await searchLocalKnowledgePages(project, {
|
||||
query: 'segmentation---',
|
||||
userId: 'local',
|
||||
connectionId: 'events_db',
|
||||
});
|
||||
expect(token.map((result) => result.key)).toEqual(['rfm-events-db']);
|
||||
});
|
||||
|
||||
it('filters list output by connection while keeping unscoped pages', async () => {
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-sales-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Sales DB orders',
|
||||
content: 'Sales orders.',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-events-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Events DB orders',
|
||||
content: 'Events orders.',
|
||||
connections: ['events_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-global',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Org-wide orders',
|
||||
content: 'Global orders.',
|
||||
});
|
||||
|
||||
const scoped = await listLocalKnowledgePages(project, { userId: 'local', connectionId: 'sales_db' });
|
||||
expect(scoped.map((page) => page.key).sort()).toEqual(['orders-global', 'orders-sales-db']);
|
||||
});
|
||||
|
||||
it('keeps a page referencing an unconfigured connection searchable and readable', async () => {
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'rfm-removed-db',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'RFM for a since-removed database',
|
||||
content: 'RFM rules.',
|
||||
connections: ['removed_db'],
|
||||
});
|
||||
|
||||
await expect(readLocalKnowledgePage(project, { key: 'rfm-removed-db', userId: 'local' })).resolves.toMatchObject({
|
||||
key: 'rfm-removed-db',
|
||||
connections: ['removed_db'],
|
||||
});
|
||||
const search = await searchLocalKnowledgePages(project, { query: 'rfm rules', userId: 'local' });
|
||||
expect(search.map((result) => result.key)).toContain('rfm-removed-db');
|
||||
await expect(listReferencedConnectionIds(project, { userId: 'local' })).resolves.toEqual(['removed_db']);
|
||||
});
|
||||
|
||||
it('falls back to Markdown scanning when the config does not select sqlite-fts5', async () => {
|
||||
project.config.storage.search = 'postgres-hybrid';
|
||||
await writeLocalKnowledgePage(project, {
|
||||
|
|
|
|||
|
|
@ -142,6 +142,49 @@ describe('SqliteKnowledgeIndex', () => {
|
|||
]);
|
||||
});
|
||||
|
||||
it('restricts lexical candidates to the allowlist', () => {
|
||||
const index = new SqliteKnowledgeIndex({ dbPath });
|
||||
index.sync([
|
||||
page({ path: 'wiki/global/revenue.md', key: 'revenue' }),
|
||||
page({ path: 'wiki/global/support.md', key: 'support', content: 'Orders are paid by the support team.' }),
|
||||
]);
|
||||
|
||||
expect(
|
||||
index
|
||||
.searchLexicalCandidates({ queryText: 'paid', limit: 10, allowedPaths: ['wiki/global/support.md'] })
|
||||
.map((row) => row.path),
|
||||
).toEqual(['wiki/global/support.md']);
|
||||
});
|
||||
|
||||
it('applies the allowlist before the semantic limit so an in-scope match survives', () => {
|
||||
const index = new SqliteKnowledgeIndex({ dbPath });
|
||||
index.sync([
|
||||
page({ path: 'wiki/global/noise-a.md', key: 'noise-a', embedding: [1, 0] }),
|
||||
page({ path: 'wiki/global/noise-b.md', key: 'noise-b', embedding: [1, 0] }),
|
||||
page({ path: 'wiki/global/target.md', key: 'target', embedding: [1, 0] }),
|
||||
]);
|
||||
|
||||
// All three tie on similarity; a limit of 1 over the full corpus drops the target.
|
||||
expect(index.searchSemanticCandidates({ queryEmbedding: [1, 0], limit: 1 }).map((row) => row.path)).toEqual([
|
||||
'wiki/global/noise-a.md',
|
||||
]);
|
||||
|
||||
// Scoped to the target, the limit applies after the allowlist, so it survives.
|
||||
expect(
|
||||
index
|
||||
.searchSemanticCandidates({ queryEmbedding: [1, 0], limit: 1, allowedPaths: ['wiki/global/target.md'] })
|
||||
.map((row) => row.path),
|
||||
).toEqual(['wiki/global/target.md']);
|
||||
});
|
||||
|
||||
it('treats an empty allowlist as no page in scope', () => {
|
||||
const index = new SqliteKnowledgeIndex({ dbPath });
|
||||
index.sync([page({ embedding: [1, 0] })]);
|
||||
|
||||
expect(index.searchLexicalCandidates({ queryText: 'paid order', limit: 10, allowedPaths: [] })).toEqual([]);
|
||||
expect(index.searchSemanticCandidates({ queryEmbedding: [1, 0], limit: 10, allowedPaths: [] })).toEqual([]);
|
||||
});
|
||||
|
||||
it('returns an empty result for blank or punctuation-only queries', () => {
|
||||
const index = new SqliteKnowledgeIndex({ dbPath });
|
||||
index.rebuild([page()]);
|
||||
|
|
|
|||
|
|
@ -263,6 +263,108 @@ describe('WikiWriteTool', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('sets connections on a new page and normalizes a single string to a list', async () => {
|
||||
const { tool, wikiService } = makeTool();
|
||||
|
||||
await tool.call(
|
||||
{ key: 'orders-sales-db', summary: 'Sales orders', content: '# Orders', connections: 'sales_db' } as any,
|
||||
baseContext,
|
||||
);
|
||||
|
||||
expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] });
|
||||
});
|
||||
|
||||
it('applies REPLACE semantics for connections on update', async () => {
|
||||
const existing = {
|
||||
pageKey: 'orders',
|
||||
frontmatter: { summary: 'Orders', usage_mode: 'auto' as const, sort_order: 0, connections: ['sales_db'] },
|
||||
content: 'body',
|
||||
};
|
||||
// omit ⇒ keep existing connections
|
||||
{
|
||||
const { tool, wikiService } = makeTool({ wikiService: { readPage: vi.fn().mockResolvedValue(existing) } });
|
||||
await tool.call({ key: 'orders', summary: 'Orders', content: 'new body' } as any, baseContext);
|
||||
expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] });
|
||||
}
|
||||
// [] ⇒ clear to unscoped
|
||||
{
|
||||
const { tool, wikiService } = makeTool({ wikiService: { readPage: vi.fn().mockResolvedValue(existing) } });
|
||||
await tool.call({ key: 'orders', summary: 'Orders', content: 'new body', connections: [] } as any, baseContext);
|
||||
expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: [] });
|
||||
}
|
||||
// [ids] ⇒ set (broaden within overlap is allowed)
|
||||
{
|
||||
const { tool, wikiService } = makeTool({ wikiService: { readPage: vi.fn().mockResolvedValue(existing) } });
|
||||
await tool.call(
|
||||
{ key: 'orders', summary: 'Orders', content: 'new body', connections: ['sales_db', 'events_db'] } as any,
|
||||
baseContext,
|
||||
);
|
||||
expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db', 'events_db'] });
|
||||
}
|
||||
});
|
||||
|
||||
it('blocks a connection-scoped write whose key collides with a disjoint-connection page', async () => {
|
||||
const { tool, wikiService } = makeTool({
|
||||
wikiService: {
|
||||
readPage: vi.fn().mockResolvedValue({
|
||||
pageKey: 'orders',
|
||||
frontmatter: { summary: 'Events orders', usage_mode: 'auto', sort_order: 0, connections: ['events_db'] },
|
||||
content: 'events body',
|
||||
}),
|
||||
},
|
||||
});
|
||||
|
||||
const result = await tool.call(
|
||||
{ key: 'orders', summary: 'Sales orders', content: 'sales body', connections: ['sales_db'] } as any,
|
||||
baseContext,
|
||||
);
|
||||
|
||||
expect(result.structured).toEqual({ success: false, key: 'orders' });
|
||||
expect(result.markdown).toContain('already exists scoped to a different connection');
|
||||
expect(result.markdown).toContain('orders_sales_db');
|
||||
expect(wikiService.writePage).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('allows narrowing a connection-scoped page within its own scope', async () => {
|
||||
const { tool, wikiService } = makeTool({
|
||||
wikiService: {
|
||||
readPage: vi.fn().mockResolvedValue({
|
||||
pageKey: 'orders',
|
||||
frontmatter: { summary: 'Orders', usage_mode: 'auto', sort_order: 0, connections: ['sales_db', 'events_db'] },
|
||||
content: 'body',
|
||||
}),
|
||||
},
|
||||
});
|
||||
|
||||
const result = await tool.call(
|
||||
{ key: 'orders', summary: 'Orders', content: 'body', connections: ['sales_db'] } as any,
|
||||
baseContext,
|
||||
);
|
||||
|
||||
expect(result.structured).toMatchObject({ success: true, action: 'updated' });
|
||||
expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] });
|
||||
});
|
||||
|
||||
it('allows scoping a previously unscoped page (existing connections empty)', async () => {
|
||||
const { tool, wikiService } = makeTool({
|
||||
wikiService: {
|
||||
readPage: vi.fn().mockResolvedValue({
|
||||
pageKey: 'orders',
|
||||
frontmatter: { summary: 'Orders', usage_mode: 'auto', sort_order: 0 },
|
||||
content: 'body',
|
||||
}),
|
||||
},
|
||||
});
|
||||
|
||||
const result = await tool.call(
|
||||
{ key: 'orders', summary: 'Orders', content: 'body', connections: ['sales_db'] } as any,
|
||||
baseContext,
|
||||
);
|
||||
|
||||
expect(result.structured).toMatchObject({ success: true, action: 'updated' });
|
||||
expect(wikiService.writePage.mock.calls[0][3]).toMatchObject({ connections: ['sales_db'] });
|
||||
});
|
||||
|
||||
it('rejects frontmatter refs that target missing wiki pages', async () => {
|
||||
const { tool, wikiService } = makeTool({
|
||||
wikiService: {
|
||||
|
|
|
|||
|
|
@ -989,6 +989,33 @@ describe('runKtxCli', () => {
|
|||
expect(testIo.stderr()).toMatch(/--text\/--file does not accept a positional connection id/);
|
||||
});
|
||||
|
||||
it('threads --verbatim into the text ingest args', async () => {
|
||||
const textIngest = vi.fn(async () => 0);
|
||||
const testIo = makeIo();
|
||||
|
||||
await expect(
|
||||
runKtxCli(['--project-dir', tempDir, 'ingest', '--file', 'doc.md', '--verbatim', '--json'], testIo.io, {
|
||||
textIngest,
|
||||
}),
|
||||
).resolves.toBe(0);
|
||||
|
||||
expect(textIngest).toHaveBeenCalledWith(expect.objectContaining({ files: ['doc.md'], verbatim: true }), testIo.io);
|
||||
});
|
||||
|
||||
it('rejects --verbatim without --text or --file', async () => {
|
||||
const textIngest = vi.fn(async () => 0);
|
||||
const publicIngest = vi.fn(async () => 0);
|
||||
const testIo = makeIo();
|
||||
|
||||
await expect(
|
||||
runKtxCli(['--project-dir', tempDir, 'ingest', '--verbatim'], testIo.io, { textIngest, publicIngest }),
|
||||
).resolves.toBe(1);
|
||||
|
||||
expect(textIngest).not.toHaveBeenCalled();
|
||||
expect(publicIngest).not.toHaveBeenCalled();
|
||||
expect(testIo.stderr()).toMatch(/requires --text or --file/);
|
||||
});
|
||||
|
||||
it('treats bare ingest as ingest --all', async () => {
|
||||
const publicIngest = vi.fn().mockResolvedValue(0);
|
||||
const testIo = makeIo();
|
||||
|
|
|
|||
|
|
@ -3,8 +3,9 @@ import { tmpdir } from 'node:os';
|
|||
import { join } from 'node:path';
|
||||
import { stripVTControlCharacters } from 'node:util';
|
||||
import { initKtxProject, loadKtxProject } from '../src/context/project/project.js';
|
||||
import { serializeKtxProjectConfig } from '../src/context/project/config.js';
|
||||
import type { KtxEmbeddingPort } from '../src/context/core/embedding.js';
|
||||
import { writeLocalKnowledgePage } from '../src/context/wiki/local-knowledge.js';
|
||||
import { searchLocalKnowledgePages, writeLocalKnowledgePage } from '../src/context/wiki/local-knowledge.js';
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
|
||||
import { runKtxKnowledge } from '../src/knowledge.js';
|
||||
|
||||
|
|
@ -98,6 +99,118 @@ describe('runKtxKnowledge', () => {
|
|||
expect(searchIo.stdout()).toContain('metrics-revenue');
|
||||
});
|
||||
|
||||
it('scopes wiki list/search by --connection and rejects unknown ids', async () => {
|
||||
const projectDir = join(tempDir, 'connection-project');
|
||||
await initKtxProject({ projectDir });
|
||||
const project = await loadKtxProject({ projectDir });
|
||||
project.config.connections.sales_db = { driver: 'sqlite', url: 'file:sales.db' };
|
||||
project.config.connections.events_db = { driver: 'sqlite', url: 'file:events.db' };
|
||||
await project.fileStore.writeFile(
|
||||
'ktx.yaml',
|
||||
serializeKtxProjectConfig(project.config),
|
||||
'ktx',
|
||||
'ktx@example.com',
|
||||
'configure connections',
|
||||
);
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-sales',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Sales orders',
|
||||
content: 'Orders are paid in sales.',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-events',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Events orders',
|
||||
content: 'Orders are paid in events.',
|
||||
connections: ['events_db'],
|
||||
});
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-global',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Org-wide orders',
|
||||
content: 'Orders are paid everywhere.',
|
||||
});
|
||||
|
||||
const listIo = makeIo();
|
||||
await expect(
|
||||
runKtxKnowledge(
|
||||
{ command: 'list', projectDir, userId: 'local', connectionId: 'sales_db', cliVersion: '0.0.0-test' },
|
||||
listIo.io,
|
||||
),
|
||||
).resolves.toBe(0);
|
||||
expect(listIo.stdout()).toContain('orders-sales');
|
||||
expect(listIo.stdout()).toContain('orders-global');
|
||||
expect(listIo.stdout()).not.toContain('orders-events');
|
||||
|
||||
const searchIo = makeIo();
|
||||
await expect(
|
||||
runKtxKnowledge(
|
||||
{
|
||||
command: 'search',
|
||||
projectDir,
|
||||
query: 'orders paid',
|
||||
userId: 'local',
|
||||
connectionId: 'events_db',
|
||||
cliVersion: '0.0.0-test',
|
||||
},
|
||||
searchIo.io,
|
||||
),
|
||||
).resolves.toBe(0);
|
||||
expect(searchIo.stdout()).toContain('orders-events');
|
||||
expect(searchIo.stdout()).toContain('orders-global');
|
||||
expect(searchIo.stdout()).not.toContain('orders-sales');
|
||||
|
||||
const badIo = makeIo();
|
||||
await expect(
|
||||
runKtxKnowledge(
|
||||
{ command: 'search', projectDir, query: 'orders', userId: 'local', connectionId: 'warehouse', cliVersion: '0.0.0-test' },
|
||||
badIo.io,
|
||||
),
|
||||
).resolves.toBe(1);
|
||||
expect(badIo.stderr()).toContain('Unknown connection "warehouse". Configured connections: events_db, sales_db.');
|
||||
});
|
||||
|
||||
it('keeps a connection-scoped page that ranks below the lane candidate pool limit', async () => {
|
||||
const projectDir = join(tempDir, 'scoped-pool-project');
|
||||
await initKtxProject({ projectDir });
|
||||
const project = await loadKtxProject({ projectDir });
|
||||
|
||||
// The lane candidate pool floor is 25; seed >25 other-connection pages so the
|
||||
// single target-connection page only survives if scope is applied before the
|
||||
// lane limit, not after.
|
||||
for (let i = 0; i < 30; i++) {
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: `noise-${String(i).padStart(2, '0')}`,
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Revenue',
|
||||
content: 'Revenue is paid order value.',
|
||||
connections: ['noise_db'],
|
||||
});
|
||||
}
|
||||
// Path sorts after every noise page, so a slice-before-filter lane drops it.
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'zzz-target',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Revenue',
|
||||
content: 'Revenue is paid order value.',
|
||||
connections: ['target_db'],
|
||||
});
|
||||
|
||||
// "arr" matches the target only semantically (FakeEmbeddingPort), never by
|
||||
// literal token, so the token lane cannot mask a dropped semantic hit.
|
||||
const results = await searchLocalKnowledgePages(project, {
|
||||
query: 'arr',
|
||||
userId: 'local',
|
||||
connectionId: 'target_db',
|
||||
embeddingService: new FakeEmbeddingPort(),
|
||||
limit: 5,
|
||||
});
|
||||
|
||||
expect(results.map((result) => result.key)).toContain('zzz-target');
|
||||
});
|
||||
|
||||
it('reads a wiki page as raw markdown with frontmatter', async () => {
|
||||
const projectDir = join(tempDir, 'read-project');
|
||||
await initKtxProject({ projectDir });
|
||||
|
|
|
|||
|
|
@ -69,7 +69,7 @@ describe('createKtxCliScanConnector', () => {
|
|||
' driver: bigquery',
|
||||
' dataset_id: analytics',
|
||||
' max_bytes_billed: "987654321"',
|
||||
' job_timeout_ms: 30000',
|
||||
' query_timeout_ms: 30000',
|
||||
'',
|
||||
].join('\n'),
|
||||
'utf-8',
|
||||
|
|
@ -85,7 +85,7 @@ describe('createKtxCliScanConnector', () => {
|
|||
connectionId: 'warehouse',
|
||||
connection: expect.objectContaining({
|
||||
max_bytes_billed: '987654321',
|
||||
job_timeout_ms: 30000,
|
||||
query_timeout_ms: 30000,
|
||||
}),
|
||||
}),
|
||||
]);
|
||||
|
|
|
|||
|
|
@ -194,6 +194,32 @@ function createTestMcpServer() {
|
|||
};
|
||||
}
|
||||
|
||||
function capturingIo() {
|
||||
let buf = '';
|
||||
return {
|
||||
io: { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } },
|
||||
text: () => buf,
|
||||
json: () =>
|
||||
buf
|
||||
.split('\n')
|
||||
.filter((line) => line.trim().startsWith('{'))
|
||||
.map((line) => JSON.parse(line) as Record<string, unknown>),
|
||||
};
|
||||
}
|
||||
|
||||
function initializeBody() {
|
||||
return {
|
||||
jsonrpc: '2.0' as const,
|
||||
id: 1,
|
||||
method: 'initialize',
|
||||
params: {
|
||||
protocolVersion: '2025-06-18',
|
||||
capabilities: {},
|
||||
clientInfo: { name: 'vitest', version: '0.0.0' },
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
describe('runKtxMcpHttpServer', () => {
|
||||
it('serves /health with project metadata', async () => {
|
||||
const handle = await runKtxMcpHttpServer({
|
||||
|
|
@ -208,11 +234,14 @@ describe('runKtxMcpHttpServer', () => {
|
|||
const port = (handle.server.address() as AddressInfo).port;
|
||||
const response = await get(port, '/health');
|
||||
expect(response.status).toBe(200);
|
||||
expect(JSON.parse(response.body)).toEqual({
|
||||
const body = JSON.parse(response.body);
|
||||
expect(body).toMatchObject({
|
||||
status: 'ok',
|
||||
projectDir: '/tmp/ktx-project',
|
||||
port,
|
||||
});
|
||||
expect(typeof body.uptimeMs).toBe('number');
|
||||
expect(body.uptimeMs).toBeGreaterThanOrEqual(0);
|
||||
} finally {
|
||||
await handle.close();
|
||||
}
|
||||
|
|
@ -271,4 +300,55 @@ describe('runKtxMcpHttpServer', () => {
|
|||
await handle.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('logs session open and close with the session id', async () => {
|
||||
const cap = capturingIo();
|
||||
const handle = await runKtxMcpHttpServer({
|
||||
projectDir: '/tmp/ktx-project',
|
||||
host: '127.0.0.1',
|
||||
port: 0,
|
||||
allowedHosts: [],
|
||||
allowedOrigins: [],
|
||||
createMcpServer: createTestMcpServer(),
|
||||
io: cap.io,
|
||||
});
|
||||
let sessionId: string | undefined;
|
||||
try {
|
||||
const port = (handle.server.address() as AddressInfo).port;
|
||||
const response = await postJson(port, '/mcp', initializeBody());
|
||||
sessionId = response.headers['mcp-session-id'] as string;
|
||||
expect(sessionId).toBeTruthy();
|
||||
} finally {
|
||||
await handle.close();
|
||||
}
|
||||
|
||||
const lines = cap.json();
|
||||
expect(lines.find((line) => line.msg === 'session.open')?.sessionId).toBe(sessionId);
|
||||
expect(lines.some((line) => line.msg === 'session.close' && line.sessionId === sessionId)).toBe(true);
|
||||
});
|
||||
|
||||
it('never writes the bearer token to the log (headers are not logged)', async () => {
|
||||
const cap = capturingIo();
|
||||
const token = 'super-secret-token-value'; // pragma: allowlist secret
|
||||
const handle = await runKtxMcpHttpServer({
|
||||
projectDir: '/tmp/ktx-project',
|
||||
host: '127.0.0.1',
|
||||
port: 0,
|
||||
token,
|
||||
allowedHosts: [],
|
||||
allowedOrigins: [],
|
||||
createMcpServer: createTestMcpServer(),
|
||||
io: cap.io,
|
||||
});
|
||||
try {
|
||||
const port = (handle.server.address() as AddressInfo).port;
|
||||
const response = await postJson(port, '/mcp', initializeBody(), { authorization: `Bearer ${token}` });
|
||||
expect(response.status).toBe(200);
|
||||
} finally {
|
||||
await handle.close();
|
||||
}
|
||||
|
||||
expect(cap.json().some((line) => line.msg === 'session.open')).toBe(true);
|
||||
expect(cap.text()).not.toContain(token);
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -147,14 +147,21 @@ describe('createKtxMcpServerFactory', () => {
|
|||
);
|
||||
|
||||
expect(factory()).toEqual({ kind: 'mcp-server' });
|
||||
expect(createDefaultKtxMcpServer).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
contextTools: expect.objectContaining({
|
||||
context_tool: { name: 'context_tool' },
|
||||
memoryIngest: mocks.memoryIngest,
|
||||
}),
|
||||
}),
|
||||
);
|
||||
// memoryIngest is wrapped to validate an explicit connectionId before delegating,
|
||||
// so it is no longer the raw service object — assert it delegates instead.
|
||||
const contextTools = (vi.mocked(createDefaultKtxMcpServer).mock.calls[0]![0].contextTools ?? {}) as Record<
|
||||
string,
|
||||
unknown
|
||||
>;
|
||||
expect(contextTools.context_tool).toEqual({ name: 'context_tool' });
|
||||
const memoryIngestPort = contextTools.memoryIngest as
|
||||
| { ingest: (input: unknown) => unknown; status: (runId: string) => unknown }
|
||||
| undefined;
|
||||
expect(memoryIngestPort).toBeDefined();
|
||||
await memoryIngestPort?.ingest({ userId: 'local', chatId: 'c', userMessage: 'm', assistantMessage: 'a' });
|
||||
expect(mocks.memoryIngest.ingest).toHaveBeenCalled();
|
||||
await memoryIngestPort?.status('run-1');
|
||||
expect(mocks.memoryIngest.status).toHaveBeenCalledWith('run-1');
|
||||
});
|
||||
|
||||
it('uses null embedding ports when no configured provider is available', async () => {
|
||||
|
|
|
|||
53
packages/cli/test/mcp-stdio-server.test.ts
Normal file
53
packages/cli/test/mcp-stdio-server.test.ts
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
import { PassThrough } from 'node:stream';
|
||||
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
||||
import { describe, expect, it } from 'vitest';
|
||||
import { runKtxMcpStdioServer } from '../src/mcp-stdio-server.js';
|
||||
|
||||
function capturingIo() {
|
||||
let buf = '';
|
||||
return {
|
||||
io: { stdout: { write() {} }, stderr: { write(chunk: string) { buf += chunk; } } },
|
||||
json: () =>
|
||||
buf
|
||||
.split('\n')
|
||||
.filter((line) => line.trim().startsWith('{'))
|
||||
.map((line) => JSON.parse(line) as Record<string, unknown>),
|
||||
};
|
||||
}
|
||||
|
||||
function createTestMcpServer() {
|
||||
return () => {
|
||||
const server = new McpServer({ name: 'ktx-test', version: '0.0.0-test' });
|
||||
server.registerTool('ping', { inputSchema: {} }, async () => ({
|
||||
content: [{ type: 'text', text: 'pong' }],
|
||||
}));
|
||||
return server;
|
||||
};
|
||||
}
|
||||
|
||||
describe('runKtxMcpStdioServer logging', () => {
|
||||
it('routes a transport error through the logger as transport.error and marks the session open', async () => {
|
||||
const cap = capturingIo();
|
||||
const stdin = new PassThrough();
|
||||
const stdout = new PassThrough();
|
||||
|
||||
const run = runKtxMcpStdioServer({
|
||||
projectDir: '/tmp/ktx-project',
|
||||
createMcpServer: createTestMcpServer(),
|
||||
io: cap.io,
|
||||
stdin,
|
||||
stdout,
|
||||
});
|
||||
|
||||
// A malformed JSON-RPC line makes the SDK stdio transport surface onerror.
|
||||
stdin.write('this is not json-rpc\n');
|
||||
|
||||
await expect(run).rejects.toBeDefined();
|
||||
|
||||
const lines = cap.json();
|
||||
expect(lines.some((line) => line.msg === 'session.open')).toBe(true);
|
||||
const transportError = lines.find((line) => line.msg === 'transport.error');
|
||||
expect(transportError).toBeDefined();
|
||||
expect(transportError?.err).toBeDefined();
|
||||
});
|
||||
});
|
||||
146
packages/cli/test/skills/analytics-skill-content.test.ts
Normal file
146
packages/cli/test/skills/analytics-skill-content.test.ts
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
import { readFileSync } from 'node:fs';
|
||||
import { fileURLToPath } from 'node:url';
|
||||
import { describe, expect, it } from 'vitest';
|
||||
import { SkillsRegistryService } from '../../src/context/skills/skills-registry.service.js';
|
||||
|
||||
const skillPath = fileURLToPath(new URL('../../src/skills/analytics/SKILL.md', import.meta.url));
|
||||
const skill = readFileSync(skillPath, 'utf-8');
|
||||
|
||||
describe('analytics SKILL.md SQL craft', () => {
|
||||
it('keeps the frontmatter parseable as ktx-analytics', () => {
|
||||
const service = new SkillsRegistryService({ skillsDir: '/tmp' });
|
||||
expect(service.parseFrontmatter(skill).name).toBe('ktx-analytics');
|
||||
});
|
||||
|
||||
it('groups the craft under the five sub-headings', () => {
|
||||
expect(skill).toContain('<sql_craft>');
|
||||
expect(skill).toContain('</sql_craft>');
|
||||
expect(skill).toContain('**Schema discovery before writing SQL**');
|
||||
expect(skill).toContain('**Composition**');
|
||||
expect(skill).toContain('**Ordering & aggregation determinism**');
|
||||
expect(skill).toContain('**Numeric precision**');
|
||||
expect(skill).toContain('**Answer completeness / interpretation**');
|
||||
});
|
||||
|
||||
it('represents every craft behavior', () => {
|
||||
const phrases = [
|
||||
'Sample before you compose', // inspect representative rows
|
||||
'Cast to the real type before comparing', // string-vs-number compares
|
||||
'Build incrementally', // one CTE at a time
|
||||
'Avoid fan-out joins', // grain / pre-aggregate
|
||||
'the danger is cumulative', // multi-hop fan-out generalization
|
||||
'Verify the grain holds across each join', // affirmative grain-verification habit
|
||||
'Make the ordering deterministic', // window tie-breaker
|
||||
'Filter after the window, not before', // window-then-filter
|
||||
'Round only at the end', // precision + truncation
|
||||
'Macro vs micro average', // AVG(group) vs SUM/SUM
|
||||
'Top / highest / most / lowest', // winning row(s) only
|
||||
'For each X / per X / by X', // one row per X
|
||||
'Complete the panel', // full-domain spine for "each/every/all" panels
|
||||
'Default by additivity', // COALESCE 0 for additive, NULL otherwise
|
||||
'Keep the inputs to a derived value', // inputs alongside ratio
|
||||
'Project BOTH identity and label', // entity identifier
|
||||
'Diagnose empty results', // relax filters one at a time
|
||||
'Cumulative / running total', // explicit unbounded-preceding frame (spec 11)
|
||||
'Rolling window over calendar time', // calendar range, not row count (spec 11)
|
||||
'minimum periods', // emit NULL until the window is full (spec 11)
|
||||
'Period-over-period', // LAG + guarded growth ratio (spec 11)
|
||||
'Parse text-encoded numerics before doing math on them', // detect text-encoded numbers (spec 12)
|
||||
'Strip, scale, and cast in one early CTE', // parse/scale early (spec 12)
|
||||
'Confirm the parse covered every value', // failure-detecting cast coverage (spec 12)
|
||||
'Answer every requested output', // multi-part/multi-output umbrella over identity+inputs (spec 14)
|
||||
'Final completeness check', // re-read the question, confirm the projection covers all four facets (spec 14)
|
||||
"Don't over-project", // match the request exactly, no padding columns (spec 14)
|
||||
];
|
||||
for (const phrase of phrases) {
|
||||
expect(skill).toContain(phrase);
|
||||
}
|
||||
});
|
||||
|
||||
it('ships six dialect-agnostic worked examples: window-then-filter, multi-hop fan-out, panel-completeness spine, cumulative running total, text-encoded-numeric parse-and-scale, multi-part output completeness', () => {
|
||||
const sqlFences = skill.match(/```sql/g) ?? [];
|
||||
expect(sqlFences).toHaveLength(6);
|
||||
// window-then-filter (spec 07)
|
||||
expect(skill).toContain('WITH ranked AS');
|
||||
expect(skill).toContain('ROW_NUMBER() OVER');
|
||||
expect(skill).toContain('WHERE seq = 1');
|
||||
// multi-hop fan-out, pre-aggregated right side + count-only escape hatch (spec 09)
|
||||
expect(skill).toContain('WITH returned_orders AS');
|
||||
expect(skill).toContain('COUNT(DISTINCT o.order_id)');
|
||||
// panel completeness: distinct-dimension spine -> LEFT JOIN -> COALESCE (spec 10)
|
||||
expect(skill).toContain('SELECT DISTINCT region_id FROM regions');
|
||||
expect(skill).toContain('LEFT JOIN');
|
||||
expect(skill).toMatch(/COALESCE\(/);
|
||||
// cumulative running total: explicit unbounded-preceding frame + complete tie-breaker (spec 11)
|
||||
expect(skill).toContain('ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW');
|
||||
expect(skill).toContain('ORDER BY txn_date, txn_id');
|
||||
// text-encoded numeric: strip with chained REPLACE -> CASE suffix scale -> CAST (spec 12)
|
||||
expect(skill).toContain('WITH parsed AS');
|
||||
expect(skill).toContain('REPLACE(');
|
||||
expect(skill).toMatch(/AS DECIMAL\(/);
|
||||
expect(skill).toContain("LIKE '%K' THEN 1000");
|
||||
// multi-part output completeness: a column per clause + entity identity, at grain (spec 14)
|
||||
expect(skill).toContain('region_monthly');
|
||||
expect(skill).toContain('MAX(rm.monthly_orders)');
|
||||
expect(skill).toContain('MIN(rm.monthly_orders)');
|
||||
expect(skill).toContain('MAX(rm.monthly_orders) - MIN(rm.monthly_orders)');
|
||||
expect(skill).toContain('r.region_id, r.region_name');
|
||||
});
|
||||
|
||||
it('leaves the existing interactive guidance intact', () => {
|
||||
expect(skill).toContain('<workflow>');
|
||||
expect(skill).toContain('<rules>');
|
||||
expect(skill).toContain('<examples>');
|
||||
expect(skill).toContain('Always run `discover_data` before writing SQL.');
|
||||
expect(skill).toContain('Treat a `dictionary_search` miss as non-authoritative.');
|
||||
expect(skill).toContain('ARR is reported in cents');
|
||||
});
|
||||
|
||||
it('points to the dialect-notes tool without inlining dialect syntax (spec 08)', () => {
|
||||
// Engine-specific syntax lives behind the sql_dialect_notes MCP tool; the flat
|
||||
// skill only names the tool (the dialect-clean assertion above still holds).
|
||||
expect(skill).toContain('sql_dialect_notes');
|
||||
});
|
||||
|
||||
it('stays dialect-agnostic and free of any benchmark/grader reference', () => {
|
||||
const banned = [
|
||||
/\bQUALIFY\b/i,
|
||||
/strftime/i,
|
||||
/julianday/i,
|
||||
/generate_series/i, // postgres-only series generator — belongs in dialect notes, not the skill
|
||||
/GENERATE_DATE_ARRAY/i, // bigquery-only series generator — belongs in dialect notes, not the skill
|
||||
/\bRANGE\b[\s\S]{0,40}\bINTERVAL\b/i, // inline dialect range-interval frame — belongs in dialect notes, not the skill
|
||||
/\bSAFE_CAST\b/i, // bigquery failure-detecting cast — belongs in dialect notes, not the skill
|
||||
/\bTRY_CAST\b/i, // snowflake/tsql failure-detecting cast — belongs in dialect notes, not the skill
|
||||
/\bTRY_TO_NUMBER\b/i, // snowflake failure-detecting cast — belongs in dialect notes, not the skill
|
||||
/\bREGEXP_REPLACE\b/i, // dialect regex strip — the portable strip is chained REPLACE
|
||||
/toFloat64OrNull/i, // clickhouse failure-detecting cast — belongs in dialect notes, not the skill
|
||||
/\bGLOB\b/i, // sqlite numeric-pattern guard — belongs in dialect notes, not the skill
|
||||
/\bspider\b/i,
|
||||
/\bbenchmark\b/i,
|
||||
/\bgold\b/i,
|
||||
/\bgrader\b/i,
|
||||
];
|
||||
for (const pattern of banned) {
|
||||
expect(skill).not.toMatch(pattern);
|
||||
}
|
||||
// no BigQuery/Snowflake-style backtick-quoted three-part FQTN
|
||||
expect(skill).not.toMatch(/`[A-Za-z_]\w*\.[A-Za-z_]\w*\.[A-Za-z_]\w*`/);
|
||||
});
|
||||
|
||||
it('never anchors relative time to the data maximum date', () => {
|
||||
// Phrase-level guard (not a raw MAX() grep — MAX() is a legitimate aggregate):
|
||||
// no single line ties "recent"/"past N <unit>" to a MAX(...) over the data.
|
||||
const relativeTime = /(recent|past\s+\w+\s+(day|week|month|year)s?)/i;
|
||||
const maxCall = /\bMAX\s*\(/i;
|
||||
for (const line of skill.split('\n')) {
|
||||
if (maxCall.test(line)) {
|
||||
expect(line).not.toMatch(relativeTime);
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
it('stays comfortably within the skill size budget', () => {
|
||||
expect(skill.split('\n').length).toBeLessThan(500);
|
||||
});
|
||||
});
|
||||
|
|
@ -10,6 +10,11 @@ import {
|
|||
buildProjectStatus,
|
||||
renderProjectStatus,
|
||||
} from '../src/status-project.js';
|
||||
import { initKtxProject, loadKtxProject } from '../src/context/project/project.js';
|
||||
import { serializeKtxProjectConfig } from '../src/context/project/config.js';
|
||||
import { writeLocalKnowledgePage } from '../src/context/wiki/local-knowledge.js';
|
||||
|
||||
const stubClaudeCodeAuthProbeForFileBacked = async () => ({ ok: true as const });
|
||||
|
||||
function projectWithConfig(config: KtxProjectConfig): KtxLocalProject {
|
||||
return {
|
||||
|
|
@ -646,8 +651,8 @@ describe('buildLocalStatsStatus', () => {
|
|||
expect(stats.unavailable).toBeUndefined();
|
||||
expect(stats.ingest.totalCompletedRuns).toBe(3);
|
||||
expect(stats.ingest.perConnection).toEqual([
|
||||
{ connectionId: 'analytics', adapter: 'live-database', lastCompletedAt: '2026-05-10T10:00:00Z' },
|
||||
{ connectionId: 'docs', adapter: 'notion', lastCompletedAt: '2026-05-01T10:00:00Z' },
|
||||
{ connectionId: 'analytics', adapter: 'live-database', lastCompletedAt: '2026-05-10T10:00:00Z', skippedObjects: [] },
|
||||
{ connectionId: 'docs', adapter: 'notion', lastCompletedAt: '2026-05-01T10:00:00Z', skippedObjects: [] },
|
||||
]);
|
||||
expect(stats.wikiPages).toEqual([
|
||||
{ scope: 'GLOBAL', count: 2, embeddedCount: 1 },
|
||||
|
|
@ -691,6 +696,47 @@ describe('buildLocalStatsStatus', () => {
|
|||
expect(stats.wikiPages).toEqual([]);
|
||||
expect(stats.semanticLayer).toEqual([]);
|
||||
});
|
||||
|
||||
it('surfaces skipped objects from the latest report body', async () => {
|
||||
await mkdir(join(tempDir, '.ktx'), { recursive: true });
|
||||
const dbPath = join(tempDir, '.ktx', 'db.sqlite');
|
||||
const db = new Database(dbPath);
|
||||
const body = JSON.stringify({
|
||||
fetch: {
|
||||
status: 'partial',
|
||||
retryRecommended: false,
|
||||
warnings: [],
|
||||
skipped: [
|
||||
{ rawPath: '', entityType: 'database_object', entityId: 'emp_hire_periods_with_name', severity: 'warning', statusCode: null, message: 'no such column: ehp.start_date', retryRecommended: false },
|
||||
],
|
||||
},
|
||||
});
|
||||
db.exec(`
|
||||
CREATE TABLE local_ingest_reports (
|
||||
run_id TEXT PRIMARY KEY,
|
||||
adapter TEXT NOT NULL,
|
||||
connection_id TEXT NOT NULL,
|
||||
status TEXT NOT NULL,
|
||||
completed_at TEXT NOT NULL,
|
||||
raw_content_hashes_json TEXT NOT NULL,
|
||||
body_json TEXT NOT NULL
|
||||
);
|
||||
`);
|
||||
db.prepare(
|
||||
`INSERT INTO local_ingest_reports VALUES ('r1', 'live-database', 'oracle_sql', 'done', '2026-06-13T10:00:00Z', '{}', ?)`,
|
||||
).run(body);
|
||||
db.close();
|
||||
|
||||
const stats = await buildLocalStatsStatus(projectIn(tempDir));
|
||||
expect(stats.ingest.perConnection).toEqual([
|
||||
{
|
||||
connectionId: 'oracle_sql',
|
||||
adapter: 'live-database',
|
||||
lastCompletedAt: '2026-06-13T10:00:00Z',
|
||||
skippedObjects: [{ name: 'emp_hire_periods_with_name', reason: 'no such column: ehp.start_date' }],
|
||||
},
|
||||
]);
|
||||
});
|
||||
});
|
||||
|
||||
describe('renderProjectStatus Local data', () => {
|
||||
|
|
@ -701,7 +747,12 @@ describe('renderProjectStatus Local data', () => {
|
|||
ingest: {
|
||||
totalCompletedRuns: 3,
|
||||
perConnection: [
|
||||
{ connectionId: 'analytics', adapter: 'live-database', lastCompletedAt: new Date(Date.now() - 60 * 60 * 1000).toISOString() },
|
||||
{
|
||||
connectionId: 'analytics',
|
||||
adapter: 'live-database',
|
||||
lastCompletedAt: new Date(Date.now() - 60 * 60 * 1000).toISOString(),
|
||||
skippedObjects: [],
|
||||
},
|
||||
],
|
||||
},
|
||||
wikiPages: [
|
||||
|
|
@ -727,6 +778,7 @@ describe('renderProjectStatus Local data', () => {
|
|||
expect(rendered).toContain('Wiki');
|
||||
expect(rendered).not.toContain('Knowledge');
|
||||
expect(rendered).toContain('3 completed runs');
|
||||
expect(rendered).not.toContain('skipped —');
|
||||
expect(rendered).toContain('GLOBAL=2 (2 embedded)');
|
||||
expect(rendered).toContain('PROJECT=1 (0 embedded)');
|
||||
expect(rendered).toContain('12 sources (10 embedded) · 200 dictionary values');
|
||||
|
|
@ -736,6 +788,29 @@ describe('renderProjectStatus Local data', () => {
|
|||
expect(rendered).not.toMatch(/semantic-layer=\d+ yaml/);
|
||||
});
|
||||
|
||||
it('renders a per-connection skipped-objects line when the latest ingest skipped objects', async () => {
|
||||
const project = projectWithConfig(baseProjectConfig());
|
||||
const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbe });
|
||||
status.localStats = {
|
||||
ingest: {
|
||||
totalCompletedRuns: 1,
|
||||
perConnection: [
|
||||
{
|
||||
connectionId: 'oracle_sql',
|
||||
adapter: 'live-database',
|
||||
lastCompletedAt: new Date(Date.now() - 60 * 60 * 1000).toISOString(),
|
||||
skippedObjects: [{ name: 'emp_hire_periods_with_name', reason: 'no such column: ehp.start_date' }],
|
||||
},
|
||||
],
|
||||
},
|
||||
wikiPages: [],
|
||||
semanticLayer: [],
|
||||
projectDir: { dbSqliteBytes: 4096, ktxCacheBytes: 0, rawSources: { fileCount: 0, bytes: 0 } },
|
||||
};
|
||||
const rendered = renderProjectStatus(status, { useColor: false });
|
||||
expect(rendered).toContain('1 object skipped — emp_hire_periods_with_name: no such column: ehp.start_date');
|
||||
});
|
||||
|
||||
it('renders unavailable note when DB is missing', async () => {
|
||||
const project = projectWithConfig(baseProjectConfig());
|
||||
const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbe });
|
||||
|
|
@ -755,3 +830,67 @@ describe('renderProjectStatus Local data', () => {
|
|||
expect(rendered).toContain('no .ktx/db.sqlite yet');
|
||||
});
|
||||
});
|
||||
|
||||
describe('buildProjectStatus connection-scoped wiki warnings', () => {
|
||||
let tempDir: string;
|
||||
|
||||
beforeEach(async () => {
|
||||
tempDir = await mkdtemp(join(tmpdir(), 'ktx-status-connections-'));
|
||||
});
|
||||
|
||||
afterEach(async () => {
|
||||
await rm(tempDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
async function projectWithConnections(ids: string[]): Promise<KtxLocalProject> {
|
||||
const projectDir = join(tempDir, 'project');
|
||||
await initKtxProject({ projectDir });
|
||||
const project = await loadKtxProject({ projectDir });
|
||||
project.config.llm = { ...project.config.llm, provider: { backend: 'claude-code' }, models: { default: 'sonnet' } };
|
||||
for (const id of ids) {
|
||||
project.config.connections[id] = { driver: 'sqlite', url: `file:${id}.db` };
|
||||
}
|
||||
await project.fileStore.writeFile(
|
||||
'ktx.yaml',
|
||||
serializeKtxProjectConfig(project.config),
|
||||
'ktx',
|
||||
'ktx@example.com',
|
||||
'configure connections',
|
||||
);
|
||||
return loadKtxProject({ projectDir });
|
||||
}
|
||||
|
||||
it('warns when a wiki page references a connection id absent from ktx.yaml', async () => {
|
||||
const project = await projectWithConnections(['sales_db']);
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-removed',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Orders for a removed db',
|
||||
content: 'Orders.',
|
||||
connections: ['removed_db'],
|
||||
});
|
||||
|
||||
const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbeForFileBacked });
|
||||
expect(status.warnings).toEqual(
|
||||
expect.arrayContaining([
|
||||
expect.objectContaining({
|
||||
message: expect.stringContaining('reference connection id(s) not in ktx.yaml: removed_db'),
|
||||
}),
|
||||
]),
|
||||
);
|
||||
});
|
||||
|
||||
it('does not warn when all referenced connection ids are configured', async () => {
|
||||
const project = await projectWithConnections(['sales_db']);
|
||||
await writeLocalKnowledgePage(project, {
|
||||
key: 'orders-sales',
|
||||
scope: 'GLOBAL',
|
||||
summary: 'Sales orders',
|
||||
content: 'Orders.',
|
||||
connections: ['sales_db'],
|
||||
});
|
||||
|
||||
const status = await buildProjectStatus(project, { claudeCodeAuthProbe: stubClaudeCodeAuthProbeForFileBacked });
|
||||
expect(status.warnings.some((warning) => warning.message.includes('not in ktx.yaml'))).toBe(false);
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -61,6 +61,7 @@ describe('buildProjectStackSnapshotFields', () => {
|
|||
profileSampleRows: 10000,
|
||||
profileConcurrency: 4,
|
||||
validationConcurrency: 4,
|
||||
detectionBudgetMs: 600000,
|
||||
},
|
||||
},
|
||||
storage: {
|
||||
|
|
|
|||
|
|
@ -2,6 +2,26 @@ import { describe, expect, it, vi } from 'vitest';
|
|||
import type { MemoryIngestStatus } from '../src/context/memory/memory-runs.js';
|
||||
import type { KtxLocalProject } from '../src/context/project/project.js';
|
||||
import { runKtxTextIngest, type TextMemoryIngestPort } from '../src/text-ingest.js';
|
||||
import type { VerbatimIngestItem, VerbatimIngestorPort } from '../src/verbatim-ingest.js';
|
||||
|
||||
function fakeVerbatim(
|
||||
options: { calls?: VerbatimIngestItem[]; throwOn?: (item: VerbatimIngestItem) => boolean } = {},
|
||||
): VerbatimIngestorPort {
|
||||
return {
|
||||
ingest: vi.fn(async (item: VerbatimIngestItem) => {
|
||||
options.calls?.push(item);
|
||||
if (options.throwOn?.(item)) {
|
||||
throw new Error(`verbatim write failed for ${item.origin.kind}`);
|
||||
}
|
||||
return {
|
||||
pageKey: item.origin.kind === 'file' && item.origin.path ? 'haversine' : 'page',
|
||||
outcome: 'written' as const,
|
||||
connections: item.connectionId ? [item.connectionId] : [],
|
||||
commitHash: null,
|
||||
};
|
||||
}),
|
||||
};
|
||||
}
|
||||
|
||||
function makeIo(options: { isTTY?: boolean } = {}) {
|
||||
let stdout = '';
|
||||
|
|
@ -336,4 +356,102 @@ describe('runKtxTextIngest', () => {
|
|||
).resolves.toBe(1);
|
||||
expect(emptyIo.stderr()).toContain('Text item "text-1" is empty');
|
||||
});
|
||||
|
||||
it('routes verbatim file items to the verbatim ingestor instead of the memory agent', async () => {
|
||||
const io = makeIo();
|
||||
const calls: VerbatimIngestItem[] = [];
|
||||
const verbatim = fakeVerbatim({ calls });
|
||||
const createMemoryIngest = vi.fn(() => fakeIngest());
|
||||
|
||||
await expect(
|
||||
runKtxTextIngest(
|
||||
{
|
||||
projectDir: '/tmp/project',
|
||||
texts: [],
|
||||
files: ['/tmp/docs/haversine.md'],
|
||||
userId: 'local-cli',
|
||||
json: true,
|
||||
failFast: false,
|
||||
verbatim: true,
|
||||
},
|
||||
io.io,
|
||||
{
|
||||
loadProject: vi.fn(async () => fakeProject()),
|
||||
createMemoryIngest,
|
||||
createVerbatimIngestor: vi.fn(() => verbatim),
|
||||
readFile: vi.fn(async (path) => `file:${path}`),
|
||||
now: () => 1,
|
||||
},
|
||||
),
|
||||
).resolves.toBe(0);
|
||||
|
||||
expect(createMemoryIngest).not.toHaveBeenCalled();
|
||||
expect(verbatim.ingest).toHaveBeenCalledTimes(1);
|
||||
expect(calls[0]?.origin).toEqual({ kind: 'file', path: '/tmp/docs/haversine.md' });
|
||||
expect(calls[0]?.content).toBe('file:/tmp/docs/haversine.md');
|
||||
expect(JSON.parse(io.stdout())).toMatchObject({
|
||||
status: 'done',
|
||||
results: [{ status: 'done', captured: { wiki: ['haversine'] } }],
|
||||
});
|
||||
});
|
||||
|
||||
it('routes verbatim inline text with a text origin and forwards the connection id', async () => {
|
||||
const io = makeIo();
|
||||
const calls: VerbatimIngestItem[] = [];
|
||||
const verbatim = fakeVerbatim({ calls });
|
||||
|
||||
await expect(
|
||||
runKtxTextIngest(
|
||||
{
|
||||
projectDir: '/tmp/project',
|
||||
texts: ['# Title\n\nbody'],
|
||||
files: [],
|
||||
connectionId: 'db1',
|
||||
userId: 'local-cli',
|
||||
json: true,
|
||||
failFast: false,
|
||||
verbatim: true,
|
||||
},
|
||||
io.io,
|
||||
{
|
||||
loadProject: vi.fn(async () => fakeProject()),
|
||||
createVerbatimIngestor: vi.fn(() => verbatim),
|
||||
now: () => 1,
|
||||
},
|
||||
),
|
||||
).resolves.toBe(0);
|
||||
|
||||
expect(calls[0]?.origin).toEqual({ kind: 'text' });
|
||||
expect(calls[0]?.content).toBe('# Title\n\nbody');
|
||||
expect(calls[0]?.connectionId).toBe('db1');
|
||||
});
|
||||
|
||||
it('fails the run when a verbatim item throws and honors fail-fast', async () => {
|
||||
const io = makeIo();
|
||||
const calls: VerbatimIngestItem[] = [];
|
||||
const verbatim = fakeVerbatim({ calls, throwOn: () => true });
|
||||
|
||||
await expect(
|
||||
runKtxTextIngest(
|
||||
{
|
||||
projectDir: '/tmp/project',
|
||||
texts: [],
|
||||
files: ['/tmp/a.md', '/tmp/b.md'],
|
||||
userId: 'local-cli',
|
||||
json: true,
|
||||
failFast: true,
|
||||
verbatim: true,
|
||||
},
|
||||
io.io,
|
||||
{
|
||||
loadProject: vi.fn(async () => fakeProject()),
|
||||
createVerbatimIngestor: vi.fn(() => verbatim),
|
||||
readFile: vi.fn(async (path) => `file:${path}`),
|
||||
now: () => 1,
|
||||
},
|
||||
),
|
||||
).resolves.toBe(1);
|
||||
|
||||
expect(verbatim.ingest).toHaveBeenCalledTimes(1);
|
||||
});
|
||||
});
|
||||
|
|
|
|||
375
packages/cli/test/verbatim-ingest.test.ts
Normal file
375
packages/cli/test/verbatim-ingest.test.ts
Normal file
|
|
@ -0,0 +1,375 @@
|
|||
import { createHash } from 'node:crypto';
|
||||
import { mkdtemp, readFile, rm } from 'node:fs/promises';
|
||||
import { tmpdir } from 'node:os';
|
||||
import { join } from 'node:path';
|
||||
import { afterEach, beforeEach, describe, expect, it } from 'vitest';
|
||||
import type { KtxEmbeddingPort } from '../src/context/core/embedding.js';
|
||||
import type { KtxLlmRuntimePort } from '../src/context/llm/runtime-port.js';
|
||||
import { initKtxProject, loadKtxProject, type KtxLocalProject } from '../src/context/project/project.js';
|
||||
import { readLocalKnowledgePage, searchLocalKnowledgePages } from '../src/context/wiki/local-knowledge.js';
|
||||
import {
|
||||
buildVerbatimFrontmatter,
|
||||
createLocalProjectVerbatimIngestor,
|
||||
deriveDegradedSummary,
|
||||
deriveVerbatimPageKey,
|
||||
splitInputDocument,
|
||||
} from '../src/verbatim-ingest.js';
|
||||
|
||||
describe('splitInputDocument', () => {
|
||||
it('splits leading YAML frontmatter from the body', () => {
|
||||
const result = splitInputDocument('---\nsummary: In doc\neffective_date: 2024-01-01\n---\n\nBody here\n');
|
||||
expect(result.frontmatter).toEqual({ summary: 'In doc', effective_date: '2024-01-01' });
|
||||
expect(result.body).toBe('Body here');
|
||||
});
|
||||
|
||||
it('treats a document without frontmatter as an empty-frontmatter body', () => {
|
||||
const result = splitInputDocument('# Title\n\ncontent\n');
|
||||
expect(result.frontmatter).toEqual({});
|
||||
expect(result.body).toBe('# Title\n\ncontent');
|
||||
});
|
||||
});
|
||||
|
||||
describe('deriveVerbatimPageKey', () => {
|
||||
it('derives a file key from the basename without extension', () => {
|
||||
expect(deriveVerbatimPageKey({ kind: 'file', path: '/docs/haversine-formula.md' }, 'irrelevant')).toBe(
|
||||
'haversine-formula',
|
||||
);
|
||||
});
|
||||
|
||||
it('slugifies a messy file basename', () => {
|
||||
expect(deriveVerbatimPageKey({ kind: 'file', path: '/docs/RFM Buckets.md' }, 'irrelevant')).toBe('RFM-Buckets');
|
||||
});
|
||||
|
||||
it('derives an inline-text key from a leading Markdown heading', () => {
|
||||
expect(deriveVerbatimPageKey({ kind: 'text' }, '# Haversine Formula\n\ndetails')).toBe('Haversine-Formula');
|
||||
});
|
||||
|
||||
it('rejects inline text with no leading heading', () => {
|
||||
expect(() => deriveVerbatimPageKey({ kind: 'text' }, 'no heading here')).toThrow(/heading|--file/);
|
||||
});
|
||||
|
||||
it('derives a stdin key from a leading heading like inline text', () => {
|
||||
expect(deriveVerbatimPageKey({ kind: 'stdin' }, '## RFM Buckets\n\nrows')).toBe('RFM-Buckets');
|
||||
});
|
||||
});
|
||||
|
||||
describe('deriveDegradedSummary', () => {
|
||||
it('uses the leading heading text when present', () => {
|
||||
expect(deriveDegradedSummary('# Haversine Formula\n\nThe formula computes distance.')).toBe('Haversine Formula');
|
||||
});
|
||||
|
||||
it('falls back to the first non-empty sentence when there is no heading', () => {
|
||||
expect(deriveDegradedSummary('The haversine formula computes great-circle distance. More text.')).toBe(
|
||||
'The haversine formula computes great-circle distance.',
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
describe('buildVerbatimFrontmatter', () => {
|
||||
it('gap-fills absent fields with generated metadata and defaults usage_mode to auto', () => {
|
||||
const fm = buildVerbatimFrontmatter({
|
||||
inputFrontmatter: {},
|
||||
summary: 'generated summary',
|
||||
tags: ['finance'],
|
||||
slRefs: ['orders'],
|
||||
});
|
||||
expect(fm.summary).toBe('generated summary');
|
||||
expect(fm.tags).toEqual(['finance']);
|
||||
expect(fm.sl_refs).toEqual(['orders']);
|
||||
expect(fm.usage_mode).toBe('auto');
|
||||
});
|
||||
|
||||
it('preserves an explicit input summary instead of the generated one', () => {
|
||||
const fm = buildVerbatimFrontmatter({
|
||||
inputFrontmatter: { summary: 'authoritative summary' },
|
||||
summary: 'generated summary',
|
||||
tags: ['x'],
|
||||
slRefs: [],
|
||||
});
|
||||
expect(fm.summary).toBe('authoritative summary');
|
||||
});
|
||||
|
||||
it('passes through unknown frontmatter fields verbatim', () => {
|
||||
const fm = buildVerbatimFrontmatter({
|
||||
inputFrontmatter: { effective_date: '2024-01-01', version: 3, owner: 'data-team' },
|
||||
summary: 'generated summary',
|
||||
tags: [],
|
||||
slRefs: [],
|
||||
});
|
||||
expect(fm.effective_date).toBe('2024-01-01');
|
||||
expect(fm.version).toBe(3);
|
||||
expect(fm.owner).toBe('data-team');
|
||||
});
|
||||
|
||||
it('keeps an explicit usage_mode', () => {
|
||||
const fm = buildVerbatimFrontmatter({
|
||||
inputFrontmatter: { usage_mode: 'always' },
|
||||
summary: 'generated summary',
|
||||
tags: [],
|
||||
slRefs: [],
|
||||
});
|
||||
expect(fm.usage_mode).toBe('always');
|
||||
});
|
||||
|
||||
it('sets connections from the flag when the input declares none', () => {
|
||||
const fm = buildVerbatimFrontmatter({
|
||||
inputFrontmatter: {},
|
||||
summary: 's',
|
||||
tags: [],
|
||||
slRefs: [],
|
||||
connectionId: 'db1',
|
||||
});
|
||||
expect(fm.connections).toEqual(['db1']);
|
||||
});
|
||||
|
||||
it('keeps input connections when the flag matches', () => {
|
||||
const fm = buildVerbatimFrontmatter({
|
||||
inputFrontmatter: { connections: ['db1'] },
|
||||
summary: 's',
|
||||
tags: [],
|
||||
slRefs: [],
|
||||
connectionId: 'db1',
|
||||
});
|
||||
expect(fm.connections).toEqual(['db1']);
|
||||
});
|
||||
|
||||
it('keeps input connections when no flag is given', () => {
|
||||
const fm = buildVerbatimFrontmatter({
|
||||
inputFrontmatter: { connections: ['db2'] },
|
||||
summary: 's',
|
||||
tags: [],
|
||||
slRefs: [],
|
||||
});
|
||||
expect(fm.connections).toEqual(['db2']);
|
||||
});
|
||||
|
||||
it('errors when input connections differ from the flag', () => {
|
||||
expect(() =>
|
||||
buildVerbatimFrontmatter({
|
||||
inputFrontmatter: { connections: ['db2'] },
|
||||
summary: 's',
|
||||
tags: [],
|
||||
slRefs: [],
|
||||
connectionId: 'db1',
|
||||
}),
|
||||
).toThrow(/connection/i);
|
||||
});
|
||||
});
|
||||
|
||||
class FakeEmbeddingPort implements KtxEmbeddingPort {
|
||||
readonly maxBatchSize = 16;
|
||||
|
||||
async computeEmbedding(text: string): Promise<number[]> {
|
||||
return /haversine|distance|geospatial|sphere|proximity|great-circle/i.test(text) ? [1, 0] : [0, 1];
|
||||
}
|
||||
|
||||
async computeEmbeddingsBulk(texts: string[]): Promise<number[][]> {
|
||||
return Promise.all(texts.map((text) => this.computeEmbedding(text)));
|
||||
}
|
||||
}
|
||||
|
||||
function fakeLlmRuntime(metadata: { summary: string; tags: string[]; sl_refs: string[] }): KtxLlmRuntimePort {
|
||||
return {
|
||||
async generateText() {
|
||||
throw new Error('generateText is not used by verbatim ingest');
|
||||
},
|
||||
async generateObject(input) {
|
||||
return input.schema.parse(metadata);
|
||||
},
|
||||
async runAgentLoop() {
|
||||
throw new Error('runAgentLoop is not used by verbatim ingest');
|
||||
},
|
||||
subprocessForkSpec() {
|
||||
return null;
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
function throwingLlmRuntime(): KtxLlmRuntimePort {
|
||||
return {
|
||||
async generateText() {
|
||||
throw new Error('generateText is not used by verbatim ingest');
|
||||
},
|
||||
async generateObject() {
|
||||
throw new Error('rate limit exceeded');
|
||||
},
|
||||
async runAgentLoop() {
|
||||
throw new Error('runAgentLoop is not used by verbatim ingest');
|
||||
},
|
||||
subprocessForkSpec() {
|
||||
return null;
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
describe('LocalVerbatimIngestor', () => {
|
||||
let projectDir: string;
|
||||
let project: KtxLocalProject;
|
||||
|
||||
beforeEach(async () => {
|
||||
projectDir = await mkdtemp(join(tmpdir(), 'ktx-verbatim-'));
|
||||
await initKtxProject({ projectDir });
|
||||
project = await loadKtxProject({ projectDir });
|
||||
});
|
||||
|
||||
afterEach(async () => {
|
||||
await rm(projectDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
it('stores the document body byte-for-byte (after trim)', async () => {
|
||||
const body = '# Haversine Formula\n\nUse R = 6371 km. The DRS threshold = 0.5 and bucket boundary is [30, 60).';
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
const result = await ingestor.ingest({ origin: { kind: 'file', path: '/docs/haversine-formula.md' }, content: body });
|
||||
|
||||
expect(result.pageKey).toBe('haversine-formula');
|
||||
expect(result.outcome).toBe('written');
|
||||
const page = await readLocalKnowledgePage(project, { key: 'haversine-formula' });
|
||||
expect(page?.content).toBe(body.trim());
|
||||
expect(createHash('sha256').update(page!.content).digest('hex')).toBe(
|
||||
createHash('sha256').update(body.trim()).digest('hex'),
|
||||
);
|
||||
});
|
||||
|
||||
it('stores a document larger than the LLM clip limit in full', async () => {
|
||||
const body = `# Big Doc\n\n${'x'.repeat(60_000)}`;
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await ingestor.ingest({ origin: { kind: 'file', path: '/docs/big-doc.md' }, content: body });
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'big-doc' });
|
||||
expect(page!.content.length).toBeGreaterThanOrEqual(body.trim().length);
|
||||
});
|
||||
|
||||
it('is idempotent when re-ingesting the same document', async () => {
|
||||
const body = '# Doc\n\nstable body content';
|
||||
const item = { origin: { kind: 'file' as const, path: '/docs/doc.md' }, content: body };
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
|
||||
const first = await ingestor.ingest(item);
|
||||
expect(first.outcome).toBe('written');
|
||||
const second = await ingestor.ingest(item);
|
||||
expect(second.outcome).toBe('unchanged');
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'doc' });
|
||||
expect(page?.content).toBe(body.trim());
|
||||
});
|
||||
|
||||
it('hard-errors on a different body at the same key without modifying the existing page', async () => {
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await ingestor.ingest({ origin: { kind: 'file', path: '/docs/doc.md' }, content: '# Doc\n\nfirst body' });
|
||||
|
||||
await expect(
|
||||
ingestor.ingest({ origin: { kind: 'file', path: '/docs/doc.md' }, content: '# Doc\n\nsecond body' }),
|
||||
).rejects.toThrow(/doc/);
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'doc' });
|
||||
expect(page?.content).toContain('first body');
|
||||
expect(page?.content).not.toContain('second body');
|
||||
});
|
||||
|
||||
it('passes through unknown frontmatter and never overwrites an explicit summary', async () => {
|
||||
const content =
|
||||
'---\nsummary: Authoritative summary\neffective_date: 2024-01-01\n---\n\n# Metric Spec\n\nbody text';
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await ingestor.ingest({ origin: { kind: 'file', path: '/docs/metric-spec.md' }, content });
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'metric-spec' });
|
||||
expect(page?.summary).toBe('Authoritative summary');
|
||||
const raw = await readFile(join(projectDir, 'wiki/global/metric-spec.md'), 'utf-8');
|
||||
expect(raw).toContain('effective_date: 2024-01-01');
|
||||
});
|
||||
|
||||
it('derives a degraded summary and empty tags with no LLM backend', async () => {
|
||||
const body = '# RFM Buckets\n\nRecency 1-30 days is bucket A.';
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await ingestor.ingest({ origin: { kind: 'file', path: '/docs/rfm-buckets.md' }, content: body });
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'rfm-buckets' });
|
||||
expect(page?.summary).toBe('RFM Buckets');
|
||||
expect(page?.tags).toEqual([]);
|
||||
expect(page?.slRefs).toEqual([]);
|
||||
});
|
||||
|
||||
it('scopes the page to a configured connection via the flag', async () => {
|
||||
project.config.connections = { db1: { driver: 'sqlite' } };
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await ingestor.ingest({
|
||||
origin: { kind: 'file', path: '/docs/scoped.md' },
|
||||
content: '# Scoped\n\nbody',
|
||||
connectionId: 'db1',
|
||||
});
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'scoped' });
|
||||
expect(page?.connections).toEqual(['db1']);
|
||||
});
|
||||
|
||||
it('rejects an unknown connection id and lists the configured ids', async () => {
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await expect(
|
||||
ingestor.ingest({ origin: { kind: 'file', path: '/docs/x.md' }, content: '# X\n\nbody', connectionId: 'nope' }),
|
||||
).rejects.toThrow(/Configured connections/);
|
||||
});
|
||||
|
||||
it('errors when the flag connection disagrees with frontmatter connections', async () => {
|
||||
project.config.connections = { db1: { driver: 'sqlite' } };
|
||||
const content = '---\nconnections:\n - db2\n---\n\n# Amb\n\nbody';
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await expect(
|
||||
ingestor.ingest({ origin: { kind: 'file', path: '/docs/amb.md' }, content, connectionId: 'db1' }),
|
||||
).rejects.toThrow(/connection/i);
|
||||
});
|
||||
|
||||
it('errors on inline text without a leading heading', async () => {
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await expect(ingestor.ingest({ origin: { kind: 'text' }, content: 'no heading here' })).rejects.toThrow(
|
||||
/heading|--file/,
|
||||
);
|
||||
});
|
||||
|
||||
it('uses LLM-generated metadata to gap-fill absent fields', async () => {
|
||||
const runtime = fakeLlmRuntime({ summary: 'LLM summary', tags: ['t1'], sl_refs: ['orders'] });
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: runtime });
|
||||
await ingestor.ingest({ origin: { kind: 'file', path: '/docs/llm-doc.md' }, content: '# LLM Doc\n\nabout orders' });
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'llm-doc' });
|
||||
expect(page?.summary).toBe('LLM summary');
|
||||
expect(page?.tags).toEqual(['t1']);
|
||||
expect(page?.slRefs).toEqual(['orders']);
|
||||
});
|
||||
|
||||
it('fails the item on LLM error and writes no page when a backend is configured', async () => {
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: throwingLlmRuntime() });
|
||||
await expect(
|
||||
ingestor.ingest({ origin: { kind: 'file', path: '/docs/fail-doc.md' }, content: '# Fail Doc\n\nbody' }),
|
||||
).rejects.toThrow();
|
||||
|
||||
const page = await readLocalKnowledgePage(project, { key: 'fail-doc' });
|
||||
expect(page).toBeNull();
|
||||
});
|
||||
|
||||
it('is findable by a body phrase via the lexical lane', async () => {
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await ingestor.ingest({
|
||||
origin: { kind: 'file', path: '/docs/overtake.md' },
|
||||
content: '# Overtake Rule\n\nThe overtake rule grants DRS within one second.',
|
||||
});
|
||||
|
||||
const results = await searchLocalKnowledgePages(project, { query: 'overtake rule grants DRS' });
|
||||
expect(results.some((result) => result.key === 'overtake')).toBe(true);
|
||||
});
|
||||
|
||||
it('is findable by a topic paraphrase via the semantic lane when embeddings are enabled', async () => {
|
||||
const ingestor = createLocalProjectVerbatimIngestor(project, { llmRuntime: null });
|
||||
await ingestor.ingest({
|
||||
origin: { kind: 'file', path: '/docs/haversine.md' },
|
||||
content: '# Haversine\n\nThe haversine formula computes great-circle distance.',
|
||||
});
|
||||
|
||||
const results = await searchLocalKnowledgePages(project, {
|
||||
query: 'geospatial proximity',
|
||||
embeddingService: new FakeEmbeddingPort(),
|
||||
});
|
||||
const match = results.find((result) => result.key === 'haversine');
|
||||
expect(match).toBeDefined();
|
||||
expect(match?.matchReasons).toContain('semantic');
|
||||
});
|
||||
});
|
||||
Loading…
Add table
Add a link
Reference in a new issue