mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
feat: ktx batch — scan resilience, analytics SQL craft, connector hardening (#312)
* docs: add spider2-specs handoff directory for benchmark-driven feature specs
* feat(cli): connection-scoped wiki pages
Add an optional `connections` frontmatter field so database-specific wiki
knowledge can be scoped to a connection without polluting searches about other
databases, while page keys stay a flat, globally-unique namespace.
- connections: single string or list; absent/empty ⇒ unscoped (applies to all)
- wiki_search (MCP) and `ktx wiki --connection` return unscoped ∪ matching
pages, filtered at the disk-load seam so all three search lanes draw their
candidate pool from the already-scoped set (not a post-filter)
- wiki_write accepts connections with REPLACE semantics and rejects a
connection-scoped write whose key collides with a disjoint-connection page
(data-loss guard; hard error, no silent clobber)
- explicit connection-id args (wiki_search, memory_ingest, ktx wiki) are
validated against ktx.yaml via a shared assertConfiguredConnectionId, which
also closes the prior gap where memory_ingest's connectionId was unvalidated;
persisted ids absent from config warn (not fail) in `ktx status`
- prompt guidance in the wiki_capture skill and external-ingest prompt; the
session connectionId is surfaced to the memory agent and ingest work units
Implements spider2-specs/specs/01-connection-scoped-wiki.md; intake draft moved
to spider2-specs/done/.
* docs(spider2-specs): add specs/ refinement stage and composite-key join spec
Describe the todo/ → specs/ → done/ pipeline in the README (refined specs are
the durable artifact; intake drafts move to done/ on ship) and add a
MEDIUM-priority spec for multi-column composite-key join detection found during
the first sqlite smoke test.
* feat(cli): add --verbatim ingest mode for authoritative documents
Store each --text/--file document body unchanged as a GLOBAL wiki page
instead of routing it through the memory agent, which may rewrite,
condense, or re-title it. The LLM derives only metadata (summary, tags,
sl_refs) and only for frontmatter fields the document does not already
set; the stored body is written by code and never edited.
- Deterministic page key: files derive it from the filename, inline
text from its leading Markdown heading (headless inline text is
rejected — pass it as --file instead).
- Idempotent: re-running the same body is a no-op; a different body at
the same key fails loudly rather than overwriting.
- Works with llm.provider.backend: none, deriving a degraded summary
from the heading or first sentence.
- Existing frontmatter (including unmodeled fields like effective_date)
passes through untouched; --connection-id scopes the page.
* feat(cli): SQL-authoring craft and per-dialect notes tool for the analytics skill
Spec 07: add a dialect-agnostic <sql_craft> block to the ktx-analytics skill (schema discovery, composition, window-function correctness, numeric precision, answer completeness) with one worked window-then-filter example. Workflow steps gain pointers into it; existing guidance is unchanged.
Spec 08: add a read-only sql_dialect_notes MCP tool returning a connection's engine SQL conventions (FQTN form, identifier quoting/case, date/time, top-N idiom, JSON access), resolved through the existing sqlAnalysisDialectForDriver path. Notes are per-dialect markdown files under context/sql-analysis/dialects, served by the tool and copied to dist (package-internal, never installed). Non-SQL connections return a clear KtxExpectedError. The flat skill gains a one-line pointer to the tool.
Both spider2-specs intake drafts move to done/ with implementation notes.
* feat(cli): tolerate objects that fail introspection during scan
Isolate per-object introspection failures so one broken or inaccessible object no longer zeroes out a connection's whole semantic layer: the sqlite and bigquery connectors introspect each object defensively (tryIntrospectObject), the live-database adapter records a scan outcome and fetch report, and enabled_tables accepts catalog.db.name, db.name, or bare names with a clear no-match error. Includes matching ktx-daemon introspection changes, docs, and tests.
* docs(spider2-specs): add 06-scan-tolerate-broken-objects spec
* feat(cli): generalize analytics fan-out rule to multi-hop join chains
The ktx-analytics skill's fan-out rule only reliably caught single-hop
inflation; agents still silently fanned out on multi-hop chains where the
offending one-to-many join sits several hops below the SUM/COUNT and is easy
to miss.
Rewrite the Composition rule so the danger reads as cumulative across the whole
chain (pre-aggregate per measure-owning table), add an affirmative
grain-verification habit (default: pre-aggregate to grain; escape hatch:
COUNT(DISTINCT key) for pure counts only; SUM/AVG of a fanned-out measure must
pre-aggregate), and add one generic wrong-vs-right worked example. Content-only
and dialect-agnostic; no new tool, flag, or config.
Implements spider2-specs/specs/09 and annotates spec 07's one-example
constraint as superseded.
* feat(cli): add panel-completeness, time-series window, and text-encoded numeric SQL craft
Extend the analytics skill's <sql_craft> with three correctness habits and
route the dialect-specific halves through sql_dialect_notes:
- Panel completeness (spec 10): full-domain spine -> LEFT JOIN -> COALESCE for
"each/every/all/per" questions, defaulted by measure additivity.
- Time-series windows (spec 11): explicit cumulative frames, calendar-range
rolling windows with minimum-periods guards, and period-over-period via LAG.
- Text-encoded numerics (spec 12): sample distinct values, strip/scale/cast in
one early CTE, and confirm coverage with a failure-detecting cast.
Add per-dialect Series, Rolling window, and Safe cast notes to all seven
dialect files so the skill stays dialect-agnostic while the engine-specific
syntax lives in sql_dialect_notes. Tests updated and passing (19).
* docs(spider2-specs): add specs 10-12 for analytics SQL-craft additions
Refined specs and completion records for the panel-completeness spine (10),
time-series window recipes (11), and text-encoded numeric parsing (12)
implemented in the preceding commit.
* docs(spider2-specs): add backlog intake drafts 13-14
- 13: canonical authoritative-source measures
- 14: output-completeness final check
* skill(analytics): spec 14 output-completeness + iter1 (active column planning)
Bundles two changes (entangled in SKILL.md; future spider2 iterations land as
separate commits):
- spec 14 (output-completeness): multi-part "answer every requested output" rule
+ a "Final completeness check" in workflow Step 6 and <sql_craft>; analytics
skill-content test updated; intake draft -> done/, refined spec added.
- iter1 experiment: spec 14's passive end-check did not change behavior on the
benchmark's output-completeness failures, so (a) the Plan step now writes the
exact output-column list UP FRONT as a contract the final SELECT must match,
and (b) "expose identity" -> "project BOTH the entity id and its name" (covers
both omission directions). All generic craft.
Driven by the Spider 2.0-Lite failure analysis (incomplete output was the
largest failure bucket); benchmark only as motivation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* skill(analytics): iter2 — deterministic order in string/array aggregation
GROUP_CONCAT/string_agg/array_agg element order is undefined without an explicit
ORDER BY; also note SQLite's default text sort is binary/case-sensitive (uppercase
before lowercase) vs case-insensitive (COLLATE NOCASE). Generic SQLite craft.
Spider 2.0-Lite motivation: an ordered-ingredient-list question failed only on the
within-string element order (right elements, wrong order); benchmark as motivation only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(mcp): structured, leveled logging for the MCP server
Add one synchronous pino logger per MCP server process, written through the
io.stderr sink: plain JSON when stderr is not a TTY, colorized pino-pretty
(sync, in-process) when it is. Every tool call logs tool.start with its raw
params BEFORE the handler runs and tool.end after (info / warn past
KTX_MCP_SLOW_TOOL_MS / error), correlated by callId plus sessionId, so a
runaway sql_execution leaves a recoverable start line with its exact SQL and
no matching end. HTTP logs session.open/close and wires the previously-dead
transport.onerror to transport.error; stdio routes its transport error
through the logger. Level via KTX_MCP_LOG_LEVEL (default info). Existing
mcp_request_completed telemetry and registerParsedTool are unchanged; no
worker/async transport and no redaction in v1 (logs are local-only).
Implements spider2-specs/specs/15-mcp-server-structured-logging.md and moves
the intake draft to done/.
* feat(mcp): report uptimeMs in MCP server /health
The /health endpoint now includes uptimeMs (monotonic elapsed time since
the server started), mirroring the Python daemon's uptime_ms telemetry
field.
* feat(cli): bound read-query execution with a per-connection deadline
Enforce one shared query deadline (default 30s, overridable per connection via
query_timeout_ms) on every executeReadOnly path, so an accidentally-expensive
LLM-authored query returns a fast "query exceeded Ns" KtxQueryError instead of
hanging the MCP server.
- New shared contract context/connections/query-deadline.ts
(resolveQueryDeadlineMs, queryDeadlineExceededError); query_timeout_ms added to
the shared warehouse schema; BigQuery's job_timeout_ms removed.
- SQLite runs the read query in a short-lived forked child process and enforces
the deadline with SIGKILL. worker_threads + terminate() was tried first but
cannot interrupt a synchronous better-sqlite3 scan (the native loop never
yields); SIGKILL reclaims the process in ~2ms and keeps the event loop free.
- Remote connectors apply a real server-side statement timeout and re-wrap their
own timeout signal as KtxQueryError: Postgres statement_timeout/57014, MySQL
max_execution_time/3024, Snowflake STATEMENT_TIMEOUT_IN_SECONDS/604, ClickHouse
max_execution_time + aligned request_timeout/159, SQL Server requestTimeout/
ETIMEOUT, BigQuery jobTimeoutMs.
- Relationship validation skips a candidate to review on a deadline timeout
instead of aborting the pass; the deadline surfaces through the existing MCP
pino logger as a matched tool.start/tool.end(error) pair (no new logging code).
Also fixes a pre-existing, unrelated invalid cast in mcp-server-factory.test.ts
that was breaking tsc -p tsconfig.test.json.
* docs(spider2-specs): mark spec 16 (bounded query execution) done
Append Implementation notes to the refined spec (what shipped, where, and the
worker-thread -> child-process+SIGKILL deviation with its evidence) and move the
intake draft from todo/ to done/.
* skill(analytics): iter3 — measure-as-amount, inter-event gap, top-per-metric career
Three generic interpretation rules: a named business measure (sales/revenue/spend)
means its amount not a row count; "inter-event duration/gap" is LAG/LEAD time-between
events not a magnitude column; "highest across several achievements" aggregates per
metric over the whole history. All three demonstrably FIRE (verified on local008/003/152
SQL). local008 flips to correct (mechanism-aligned). 003/152 still fail on a different
axis (source-column / grouping). Generic craft; benchmark only as motivation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* skill(analytics): spine-for-extreme-selection + aggregate-over-selected-set
Two generic answer-completeness refinements:
- Selecting the extreme group (lowest/highest count over a period/category
domain) must rank over the COMPLETE spine, not only groups with fact rows —
an empty period is a genuine 0 and often the true minimum.
- An aggregate scoped to a per-entity selected set ('avg revenue per actor in
those top-3 films') is computed ACROSS that set, distinct from the per-item
value; project both.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter2 — sharpen extreme-selection spine + top-N ranking-measure
- spine-for-extreme: concrete cue that a zero-row period never appears in a
GROUP BY of the facts; generate the full calendar, LEFT JOIN, COALESCE, then rank.
- aggregate-over-selected-set: top-N selection ranks by the named ranking measure
(the item's own revenue), independent of the per-item share that feeds the aggregate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter3 — comparison-between-two-extremes is one wide row
Distinguishes a cross-item comparison ('the difference between the highest and
lowest month' -> single wide row, both extremes side by side + the comparison
column) from 'report a metric for each group' (-> stays long). Generic, question-
derived; targets the wide-vs-long shape gap without affecting per-group long output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter4 — anchor a period bucket to the named lifecycle event
When a record carries multiple lifecycle timestamps (created/placed, approved,
shipped, delivered, completed, settled) and the question counts/measures records
in a named *completed state* by period ("delivered orders by month", "shipped
items per week"), bucket the period by that named event's own timestamp, not the
record-creation timestamp; the state value is the qualifying filter, the matching
timestamp is the time anchor. Wording priority is explicit — purchased/placed/
created/submitted/ordered keep the start-event timestamp — and a non-temporal
state filter (counts by customer/city/seller with no period) introduces no anchor.
Generic analytics craft: counting completed-state records by their creation date
silently answers "records that later reached that state, grouped by when they
started" instead of the question asked. Surfaced via the spider2-autofix loop;
FAIR_PRODUCT (adversary-screened, restatable from question wording + schema/
semantic-layer lifecycle descriptions, no gold dependency).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter5 — canonicalize observed URL-path variants before page-level analysis
When a question groups/filters/sequences web pages by a path/url column, sample
its distinct values; if the data itself shows /route and /route/ variants for the
same page context, canonicalize in an early CTE (preserve / as root, strip trailing
slashes from non-root paths, map an observed empty path to / only when the column is
a URL path with blank root-page events) and use the canonical path everywhere above.
Explicitly forbids inventing aliases the data doesn't show: no merging different
route names, no stripping query/fragment/host/scheme, no lowercasing, and no
canonicalization when the question asks for raw URL/path or slash-vs-no-slash diffs.
Generic web-analytics craft: raw request logs routinely store the same user-visible
page with and without a trailing slash, so grouping raw labels silently splits one
page into several. Surfaced via the spider2-autofix loop (Codex runner, round r2);
FAIR_PRODUCT (adversary-screened, restatable from URL-path semantics + page-grain
question wording + solver-observed distinct values, no gold dependency). The rule
fired mechanism-aligned on both targets; flipped local330 (landing/exit page counts),
local331 residual is a separate sequence-semantics axis beyond canonicalization.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter6 — coverage over a selected group is a set-membership aggregate
When a question first selects a group of entities ("the top 5 actors", "these
products") and then asks what count/share/percentage of a DIFFERENT subject domain
relates to *these* selected entities ("what % of customers rented films featuring
these actors"), the subject set is the UNION across the whole group: count DISTINCT
subject ids once across the selected entities and return one collective value at the
subject-domain grain — not one row per selected entity (which double-counts subjects
related to more than one entity and answers a different question). Narrowly guarded:
emit one row per entity only when the wording says "for each / per / by / list" or
asks for each entity's own metric ("top 5 players and their batting averages").
The collective-coverage cousin of the existing per-entity selected-set rule. Generic
analytics craft (per-entity metric vs set-level coverage). Surfaced via the
spider2-autofix loop (Codex runner, round r3); FAIR_PRODUCT (adversary-screened,
restatable from wording alone, no gold dependency). Flipped local195 mechanism-aligned
(union COUNT(DISTINCT customer)/total, one scalar); 0 regression across 5 passing
per-entity top-N guards (local023/024/029/212/221 stayed long).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): label-only joins must LEFT JOIN — incomplete dims silently drop fact rows
Mirror of the existing fan-out rule for the DROP direction: an inner JOIN to a
dimension table used only to attach a display attribute silently discards every
fact row whose key has no parent when the dimension is incomplete (trimmed
catalogs, late-arriving / SCD-gap rows), shrinking counts/sums and the universe
over which shares/averages/medians are computed. Guidance: LEFT JOIN pure
enrichment; inner-join a dimension only when intended as a filter; key the
aggregate/GROUP BY on the fact column, not the dimension column.
Spider2 autofix round 'joindim': flips complex_oracle local050 (FAIL->PASS,
official scorer) — solver dropped the gratuitous products inner-join and
recovered the exact gold. local060/063 also adopt LEFT JOIN (rule fires) but
remain gold-convention-blocked. Guards local061/067 held.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(spider2-specs): add todo/17 — lifecycle-event metrics (semantic-layer)
Draft intake spec surfaced by the spider2-autofix loop (round r1): the model-layer
form of the shipped iter4 lifecycle-date-anchoring skill rule — infer per-state
lifecycle-event metrics (e.g. delivered_orders with defaultTimeDimension = the
delivery timestamp) during enrichment so the correct time anchor is the default for
any consumer, not only an agent that loaded the skill. Generic; FAIR_PRODUCT.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(connectors): accept leading underscore in connection/identifier ids
The safe-identifier validator regex /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/ allowed an
underscore everywhere except the first character, so a connection id / database
name that legitimately starts with '_' (valid in Snowflake, e.g. _1000_GENOMES)
could never be ingested or queried. Allow a leading underscore across all 16
duplicated validators (connection ids, source ids, page/wiki keys, warehouse-
verification tool schemas). Path-safety is unaffected — '.' and '/' remain
excluded, and assertSafePathToken still blocks traversal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): generic geospatial query guidance
Add a Snowflake ST_* dialect note (ST_MAKEPOINT lon-first, ST_DWITHIN/ST_CONTAINS/
ST_WITHIN/ST_INTERSECTS, bbox->polygon via ST_MAKEPOLYGON/ST_MAKELINE) and a
dialect-agnostic 'Spatial predicates' recipe in the analytics skill (resolve the
entity geometry, build an area-of-interest polygon, test with the engine's
containment/proximity/overlap predicate; mind lon/lat argument order). Steers the
solver off hand-rolled lat/lon BETWEEN boxes toward correct, index-assisted
geospatial predicates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): parse code/dependency text by language grammar
Add two generic <sql_craft> rules: (1) parse imported/required/loaded packages by
the language or manifest format (Java import keep-package-path allowing underscores/
mixed-case; Python import/from + alias stripping; R library/require; .ipynb parse
JSON cell source before language rules; JSON manifests flatten the dependency object
keys), stripping comments/prose and splitting multi-import lines; (2) on a
de-duplicated table with a documented copy/occurrence count, choose COUNT(*) vs the
weight column from the population the question names, not silently. Steers off one
broad regex that drops valid identifiers and matches prose.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): source filters/dates/measures from the owning fact grain
Add a <sql_craft> rule for joined fact tables at different grains (parent order
vs child line item): read each predicate, calendar bucket, and measure from the
table whose grain the question names, not whichever is in scope post-join. An
order-grain filter ("orders that are Complete", "the order's creation date")
must come from the parent even though the child carries its own status/created_at;
line price/cost come from the child. Mirror at metric grain: don't combine a
parent-grain count with child rows (num_of_item * SUM(line_price) per line) —
aggregate each measure at its own grain before combining.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): collapse multi-valued classes to one representative per entity before counting/concentration
When an entity carries a multi-valued classification array (IPC/CPC codes, tags)
and the methodology counts entities-per-class or a concentration/diversity metric
(HHI, originality, share), pick ONE representative per entity first (the array's
main/primary/first flag, else a defined fallback like most-frequent), then
aggregate; and use COUNT(DISTINCT entity) when the denominator is defined as a
count of entities. Unnesting the array otherwise multiplies an entity's weight by
its code count, inflating per-class frequencies and skewing the ranking/score.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(connectors): introspect BigQuery datasets hosted in foreign projects
A dataset_ids/dataset_id entry may now be written `project.dataset` to
introspect a dataset hosted in another project while query jobs still bill to
credentials.project_id. Entries are parsed once at the config boundary into
canonical {project, dataset} pairs; introspection, primary-key discovery,
testConnection, getTableRowCount, and listTables (grouped per project) all
resolve in the dataset's own project, and scanned tables are labeled with that
project so sampling, distinct-value, and read queries resolve. Bare entries are
unchanged.
Implements spider2-specs/specs/18-bigquery-cross-project-datasets.md.
* feat(scan): durable, resumable, bounded relationship detection during enrichment
Move the enrichment persistence boundary to the cost boundary and bound the
open-ended relationship stage (spec 19).
- Checkpoint descriptions + embeddings into the queryable `_schema` manifest
(and the raw enrichment artifacts) before relationship detection runs, via a
new `onCheckpoint` hook + `writeLocalScanEnrichmentCheckpoint`. An interrupted,
budget-truncated, or failed relationship stage now degrades to "no joins",
never "no descriptions".
- Resume the enrichment cache by content identity: re-key the SQLite stage store
on `(connection_id, stage, input_hash)` so a re-run with a fresh runId resumes
finished descriptions/embeddings instead of re-paying for LLM work. The
disposable cache recreates its table if the on-disk key shape differs.
- Make the relationship stage observable and bounded: a sticky wall-clock budget
(`scan.relationships.detectionBudgetMs`, default 600000 ms) + per-unit progress
+ honored `ctx.signal`, threaded through profiling, validation, and composite
detection. On exhaustion/abort it stops scheduling, finalizes, and returns a
partial result instead of throwing or hanging.
- Mark a budget/abort-truncated result partial (diagnostics `partial`/`partialReason`
+ recoverable `relationship_detection_partial` warning). A graceful partial saves
as a completed stage and resumes cheaply; raising the budget changes inputHash
and forces a fresh, fuller run. A process killed mid-stage saves nothing.
Document `detectionBudgetMs` in the ktx.yaml reference. Append implementation
notes to specs/19 and move the intake draft to done/.
Also carries the in-tree per-table enrichment LLM timeout work it builds on
(`description-generation.ts` + the `enrichment_timeout` warning code), which is
intertwined in `local-enrichment.ts`/`types.ts` and cannot be split into a
separately-building commit.
* feat(scan): bound + retry the per-table enrichment LLM call
The batched table-description call had no retry (sampleTable retried 3x, this did
not), so a single transient backend error (e.g. an overloaded/burst rejection when
many tables enrich concurrently) silently nulled a whole table's descriptions —
observed dropping ~70% of a db's tables during a bad window despite ample quota.
- Wrap generateObject in retryAsync (3 attempts + backoff; KTX_ENRICH_LLM_ATTEMPTS).
- Fresh per-attempt timeout (KTX_ENRICH_LLM_TIMEOUT_MS, default 120s) still bounds a
wedged wide table; a timeout is surfaced as KtxAbortedError so it is NOT retried
(one wedge stays one timeout, not 3x).
- Granular per-table progress + start/done/retry/timeout logging.
Composes with spec 19 (its non-goal #1): spec 19 makes completed descriptions durable;
this makes more of them complete.
* feat(scan): survive a hung LLM enrichment backend and resume descriptions
Two compounding failure modes on the per-table description-enrichment path (spec 20):
Enforced per-table timeout for subprocess backends. The runtime declares whether it owns an SDK subprocess (subprocessForkSpec on KtxLlmRuntimePort); codex/claude-code calls run behind a ktx-owned detached child that is tree-killed (SIGKILL of the process group on POSIX, taskkill /T on Windows) on the deadline or ctx.signal, reaping the wedged model grandchild. HTTP backends keep native fetch abort. Default stays 120s, one-wedge-one-timeout.
Incremental, resumable descriptions persistence. generateDescriptions flushes enriched tables per batch to an inputHash-tagged durable record (at a stable, non-syncId path) plus only the changed manifest shards, skips already-enriched tables on resume, and never lets one table's failure discard the stage (a skipped table costs one missing description, not the whole stage's output).
Spec 20 refined + intake draft moved to done/.
* feat(scan): selective enrichment stages (--stages) + per-stage cache keys
Split the single coarse enrichment cache key into per-stage hashes
(descriptions <- snapshot + LLM identity; embeddings <- snapshot + embedding
identity + description digest; relationships <- snapshot + relationship settings
+ LLM identity), so changing one stage's inputs invalidates only that stage and
never throws away the expensive per-table descriptions on an unrelated edit.
Add `ktx ingest --stages <list>` to force-re-run a chosen subset on an
already-ingested connection: a named stage bypasses the completed-stage
short-circuit while the per-table descriptions resume record still skips
already-enriched tables, and unselected stages are left untouched on disk. Feed
embeddings + relationships their description context from the on-disk _schema
when descriptions do not run this invocation, and carry descriptions into the
llmProposals evidence packet (closing a latent gap on the full-run path too).
Surface an enrichment_stage_stale warning when an unselected stage's inputs have
drifted, rather than silently cascading the work.
Implements spider2-specs/specs/21-selective-enrichment-stages.md.
* test(analytics): realign SKILL.md acceptance test with the evolved skill
Three assertions in analytics-skill-content.test.ts drifted from the analytics
SKILL.md as later iterations edited the skill without updating the test:
- the sub-heading was renamed Window functions -> Ordering & aggregation
determinism (iter2), so follow the source name;
- the rule "Expose identity, not just the label" was renamed to "Project BOTH
identity and label" (spec 14), so match the new wording;
- the dialect-FQTN guard false-positived on the Java package example
com.planet_ink.coffee_mud, whose backticks made a 3-segment package path read
as a BigQuery/Snowflake `a.b.c` table reference. Drop the backticks so the
guard stays at full strength without weakening it.
* fix(scan): --stages subset must not delete unselected stages' on-disk artifacts
A --stages subset that omitted descriptions wiped all on-disk ai/db descriptions
from the written _schema. runLocalScan writes the structural manifest shard from
the bare snapshot BEFORE enrichment runs, and the shard merge treats ai/db as
scan-managed and overwrites them with whatever the run emits — none, on a subset
that skips descriptions. Enrichment then read the already-wiped shard via
loadPriorDescriptions and had nothing to restore.
runLocalScanEnrichment now returns the best-available descriptions (fresh-this-run
if descriptions ran, else loaded from the on-disk _schema) instead of [], and
runLocalScan captures the prior descriptions before the structural write and feeds
them to both the structural write and enrichment, so an unselected stage's
artifacts survive. Joins were already preserved for --stages descriptions via the
manual/inferred preservedJoins path.
Tests: a full runLocalScan --stages relationships path test (RED without the fix,
GREEN with it — the earlier unit test missed the structural-pre-write ordering),
plus enrichment-layer contract tests for both directions. Validated live on
northwind: --stages relationships keeps all 110 descriptions + 22 joins (was
wiping to 0); --stages descriptions restores descriptions from the spec-20 resume
record (no LLM calls) while keeping joins.
* feat(dialects): bigquery nested-data (ARRAY/STRUCT/UNNEST), geospatial (GEOGRAPHY), SAFE_DIVIDE
bigquery.md lacked the two sections that define BigQuery analytics (present in snowflake.md):
- Nested & repeated data: UNNEST to flatten arrays of STRUCTs (GA360 hits, GA4 event_params),
dot-notation field access, key-value param scalar-subquery extraction, fan-out/COUNT(DISTINCT) guard.
- Geospatial (GEOGRAPHY): ST_GEOGPOINT (lon-first), containment/proximity/distance/intersection
predicates, areal allocation via ST_AREA(ST_INTERSECTION()).
- SAFE_DIVIDE for zero-denominator-safe rates; sharded-table shard-presence note.
Generic BigQuery craft surfaced by sql_dialect_notes; product-completeness (any BQ analyst benefits).
* feat(dialects): sqlite ROUND half-up FP-underflow note (+1e-9 before ROUND)
SQLite ROUND(x,n) rounds half-away-from-zero, but binary FP stores an exact
half-way value just below it, so ROUND(6.475,2) returns 6.47 not 6.48. Add a
dialect note: nudge by a tiny epsilon (1e-9) below display precision before
rounding for deterministic half-up, leaving non-boundary values unchanged.
Generic SQLite craft surfaced by sql_dialect_notes (any analyst rounding a
displayed average/rate/price benefits).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(analytics): list-as-delimited-string, answer-literally, drop free-text columns
Add SKILL.md guidance to emit list-valued answer cells as delimited
STRING (not ARRAY/repeated column), answer the literal ask without
unrequested transformations (HAVING for aggregate bounds), and avoid
projecting unrequested free-text columns that corrupt row-delimited output.
* fix(scan,mcp): gitignore runtime logs, budget-guard LLM proposal, validate enrich timeout
- gitignore `.ktx/logs/` in both scaffold + setup-merge lists: the managed MCP
daemon writes raw tool params (SQL, memory_ingest content) to mcp.log under a
version-controlled `.ktx/`, and snowflake.log already sat there unprotected.
- gate the LLM relationship proposal on the detection budget/abort signal so an
exhausted or aborted stage cannot start a fresh LLM call; document the boundary.
- validate KTX_ENRICH_LLM_TIMEOUT_MS (NaN/0 → 120s default) like enrichAttempts,
so a bad value no longer times out every table immediately.
- daemon introspection now warns on malformed column/FK rows instead of dropping
them silently, matching the table-row path and the "surface broken objects" goal.
- docs: document `ktx wiki -c/--connection`; fix the SQLite query-deadline schema
doc (forked-subprocess SIGKILL, not worker-thread termination).
* fix(scan,wiki,mcp): address PR #312 review findings
- scan: key the description pipeline (resume map, enriched-schema and
embedding-text lookups, manifest write/read) by full table identity via
tableRefKey/buildTableRef, so two same-named tables in different schemas no
longer cross-assign descriptions or skip a sibling on resume
- scan: re-throw a genuine context cancel during the batched description LLM
call so Ctrl-C resumes the stage instead of nulling tables and recording it
completed; per-table timeouts still degrade (context.signal not aborted)
- scan: report statisticalValidation 'skipped' (not 'completed') when a
budget/abort stop leaves relationship profiling partial
- wiki: sync the full page corpus into the sqlite index and filter only the
candidate/result set, so a connection-scoped search no longer prunes other
connections' pages and cached embeddings from the shared index
- wiki: route verbatim ingest through the canonical writePageAndSync so
contentHash is set and later syncs can short-circuit
- mcp: drop the as-unknown-as cast in serializeMcpError
- dialects/analytics: document the integer-division trap on postgres/sqlite/tsql
Adds regression tests for each behavior change.
* fix(wiki): scope connection filter before SQLite lane limit
Connection-scoped wiki search applied the connectionId allowlist after
the lexical/semantic lanes had already truncated to laneCandidatePoolLimit
over the full (connection-agnostic) corpus. When the requested connection
was a minority of a large corpus, its pages were crowded out of the
candidate pool before filtering, so a semantic-only match could be missed
outright and lexical hits under-ranked.
Push the path allowlist into searchLexicalCandidates/searchSemanticCandidates
so LIMIT applies to in-scope rows, matching what the token lane already did,
and drop the now-redundant post-limit JS filters.
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
2afab61417
commit
f65a5b0e2e
200 changed files with 17780 additions and 672 deletions
|
|
@ -78,6 +78,8 @@
|
|||
"openai": "^6.38.0",
|
||||
"p-limit": "^7.3.0",
|
||||
"pg": "^8.21.0",
|
||||
"pino": "^10.3.1",
|
||||
"pino-pretty": "^13.1.3",
|
||||
"posthog-node": "^5.34.9",
|
||||
"react": "^19.2.6",
|
||||
"semver": "^7.8.1",
|
||||
|
|
|
|||
|
|
@ -7,10 +7,17 @@ const promptsSource = join(packageRoot, 'src', 'prompts');
|
|||
const promptsTarget = join(packageRoot, 'dist', 'prompts');
|
||||
const skillsSource = join(packageRoot, 'src', 'skills');
|
||||
const skillsTarget = join(packageRoot, 'dist', 'skills');
|
||||
// Per-dialect SQL notes are markdown served by the sql_dialect_notes MCP tool;
|
||||
// tsc does not emit non-.ts files, so copy them next to their compiled module.
|
||||
const dialectNotesSource = join(packageRoot, 'src', 'context', 'sql-analysis', 'dialects');
|
||||
const dialectNotesTarget = join(packageRoot, 'dist', 'context', 'sql-analysis', 'dialects');
|
||||
|
||||
await rm(promptsTarget, { recursive: true, force: true });
|
||||
await rm(skillsTarget, { recursive: true, force: true });
|
||||
await rm(dialectNotesTarget, { recursive: true, force: true });
|
||||
await mkdir(dirname(promptsTarget), { recursive: true });
|
||||
await mkdir(dirname(skillsTarget), { recursive: true });
|
||||
await mkdir(dirname(dialectNotesTarget), { recursive: true });
|
||||
await cp(promptsSource, promptsTarget, { recursive: true });
|
||||
await cp(skillsSource, skillsTarget, { recursive: true });
|
||||
await cp(dialectNotesSource, dialectNotesTarget, { recursive: true });
|
||||
|
|
|
|||
|
|
@ -133,7 +133,7 @@ export function parseBooleanStringOption(value: string): boolean {
|
|||
}
|
||||
|
||||
export function parseSafeConnectionIdOption(value: string): string {
|
||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) {
|
||||
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) {
|
||||
throw new InvalidArgumentError(`Unsafe connection id: ${value}`);
|
||||
}
|
||||
return value;
|
||||
|
|
|
|||
|
|
@ -1,10 +1,12 @@
|
|||
import { type Command, Option } from '@commander-js/extra-typings';
|
||||
import { type Command, InvalidArgumentError, Option } from '@commander-js/extra-typings';
|
||||
import {
|
||||
collectOption,
|
||||
type KtxCliCommandContext,
|
||||
parsePositiveIntegerOption,
|
||||
resolveCommandProjectDir,
|
||||
} from '../cli-program.js';
|
||||
import { KTX_SCAN_ENRICHMENT_STAGES } from '../context/scan/enrichment-state.js';
|
||||
import type { KtxScanEnrichmentStage } from '../context/scan/types.js';
|
||||
import type { KtxCliDeps, KtxCliIo } from '../index.js';
|
||||
import { runtimeInstallPolicyFromFlags } from '../managed-python-command.js';
|
||||
import type { KtxPublicIngestArgs } from '../public-ingest.js';
|
||||
|
|
@ -14,6 +16,36 @@ import { resolveConnectionSelection } from './connection-selection.js';
|
|||
|
||||
profileMark('module:commands/ingest-commands');
|
||||
|
||||
/**
|
||||
* Parses `--stages` into an ordered, de-duplicated subset of the canonical
|
||||
* enrichment-stage registry. An unknown or empty name is a hard parse error so
|
||||
* a typo never silently degrades to "run everything."
|
||||
*
|
||||
* @internal
|
||||
*/
|
||||
export function parseEnrichmentStagesOption(value: string): KtxScanEnrichmentStage[] {
|
||||
const names = value
|
||||
.split(',')
|
||||
.map((name) => name.trim())
|
||||
.filter((name) => name.length > 0);
|
||||
if (names.length === 0) {
|
||||
throw new InvalidArgumentError(
|
||||
`must be a non-empty comma-separated list of stages (${KTX_SCAN_ENRICHMENT_STAGES.join(', ')})`,
|
||||
);
|
||||
}
|
||||
const valid = new Set<string>(KTX_SCAN_ENRICHMENT_STAGES);
|
||||
const selected = new Set<KtxScanEnrichmentStage>();
|
||||
for (const name of names) {
|
||||
if (!valid.has(name)) {
|
||||
throw new InvalidArgumentError(
|
||||
`unknown stage "${name}"; valid stages are ${KTX_SCAN_ENRICHMENT_STAGES.join(', ')}`,
|
||||
);
|
||||
}
|
||||
selected.add(name as KtxScanEnrichmentStage);
|
||||
}
|
||||
return KTX_SCAN_ENRICHMENT_STAGES.filter((stage) => selected.has(stage));
|
||||
}
|
||||
|
||||
interface IngestCommandOptions {
|
||||
runTextIngest: (args: KtxTextIngestArgs, io: KtxCliIo, deps: KtxCliDeps) => Promise<number>;
|
||||
}
|
||||
|
|
@ -32,8 +64,18 @@ export function registerIngestCommands(
|
|||
.addOption(new Option('--query-history', 'Include database query-history usage patterns').conflicts('noQueryHistory'))
|
||||
.addOption(new Option('--no-query-history', 'Skip database query-history usage patterns'))
|
||||
.option('--query-history-window-days <days>', 'Query-history lookback window for this run', parsePositiveIntegerOption)
|
||||
.option(
|
||||
'--stages <stages>',
|
||||
'Comma-separated enrichment stages to (re)run (descriptions,embeddings,relationships); omit to run all',
|
||||
parseEnrichmentStagesOption,
|
||||
)
|
||||
.option('--text <content>', 'Capture inline text into ktx memory; repeatable', collectOption, [])
|
||||
.option('--file <path>', 'Capture a text file into ktx memory; use - for stdin; repeatable', collectOption, [])
|
||||
.option(
|
||||
'--verbatim',
|
||||
'Store each --text/--file document body unchanged as a GLOBAL wiki page; the LLM derives only metadata',
|
||||
false,
|
||||
)
|
||||
.option('--connection-id <connectionId>', 'ktx connection id to tag captured text/file notes')
|
||||
.option('--user-id <id>', 'Memory user id for text/file capture attribution', 'local-cli')
|
||||
.option('--fail-fast', 'Stop after the first failed text/file item', false)
|
||||
|
|
@ -47,6 +89,14 @@ export function registerIngestCommands(
|
|||
const projectDir = resolveCommandProjectDir(command);
|
||||
const hasTextCapture = options.text.length > 0 || options.file.length > 0;
|
||||
|
||||
if (options.verbatim === true && !hasTextCapture) {
|
||||
command.error('error: --verbatim requires --text or --file');
|
||||
}
|
||||
|
||||
if (options.stages !== undefined && hasTextCapture) {
|
||||
command.error('error: --stages applies to database ingest only; it cannot be combined with --text or --file');
|
||||
}
|
||||
|
||||
if (hasTextCapture) {
|
||||
if (connectionId !== undefined) {
|
||||
command.error(
|
||||
|
|
@ -66,6 +116,7 @@ export function registerIngestCommands(
|
|||
userId: options.userId,
|
||||
json: options.json === true,
|
||||
failFast: options.failFast === true,
|
||||
...(options.verbatim === true ? { verbatim: true } : {}),
|
||||
},
|
||||
context.io,
|
||||
context.deps,
|
||||
|
|
@ -87,6 +138,7 @@ export function registerIngestCommands(
|
|||
inputMode: options.input === false ? 'disabled' : 'auto',
|
||||
queryHistory,
|
||||
...(options.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: options.queryHistoryWindowDays } : {}),
|
||||
...(options.stages ? { stages: options.stages } : {}),
|
||||
cliVersion: context.packageInfo.version,
|
||||
runtimeInstallPolicy: runtimeInstallPolicyFromFlags(options),
|
||||
};
|
||||
|
|
|
|||
|
|
@ -27,6 +27,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
|||
.usage('[options] [query...]')
|
||||
.argument('[query...]', 'Search query; omit to list all pages')
|
||||
.option('--user-id <id>', 'Local user id', 'local')
|
||||
.option('-c, --connection <id>', 'Scope results to one connection (unscoped pages plus pages tagged with it)')
|
||||
.option('--limit <number>', 'Maximum search results (search mode only)', parsePositiveIntegerOption)
|
||||
.addOption(
|
||||
new Option('--output <mode>', 'Output mode: pretty (default in TTY), plain (TSV), or json').choices([
|
||||
|
|
@ -46,6 +47,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
|||
query: string[],
|
||||
options: {
|
||||
userId: string;
|
||||
connection?: string;
|
||||
limit?: number;
|
||||
output?: 'pretty' | 'plain' | 'json';
|
||||
json?: boolean;
|
||||
|
|
@ -57,6 +59,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
|||
command: 'list',
|
||||
projectDir: resolveCommandProjectDir(command),
|
||||
userId: options.userId,
|
||||
...(options.connection !== undefined ? { connectionId: options.connection } : {}),
|
||||
output: options.output,
|
||||
json: options.json,
|
||||
cliVersion: context.packageInfo.version,
|
||||
|
|
@ -68,6 +71,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
|||
projectDir: resolveCommandProjectDir(command),
|
||||
query: query.join(' '),
|
||||
userId: options.userId,
|
||||
...(options.connection !== undefined ? { connectionId: options.connection } : {}),
|
||||
output: options.output,
|
||||
json: options.json,
|
||||
...(isDebugEnabled(command) ? { debug: true } : {}),
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
import type { KtxProjectConnectionConfig } from './context/project/config.js';
|
||||
|
||||
const KTX_DATABASE_DRIVER_IDS = new Set([
|
||||
/** @internal Canonical SQL-warehouse driver ids; the dialect-notes coverage test derives its required coverage from this set. */
|
||||
export const KTX_DATABASE_DRIVER_IDS = [
|
||||
'sqlite',
|
||||
'postgres',
|
||||
'mysql',
|
||||
|
|
@ -8,8 +9,11 @@ const KTX_DATABASE_DRIVER_IDS = new Set([
|
|||
'sqlserver',
|
||||
'bigquery',
|
||||
'snowflake',
|
||||
'mongodb',
|
||||
]);
|
||||
] as const;
|
||||
|
||||
// mongodb is a database driver but has no SQL dialect, so it sits outside the
|
||||
// dialect-notes coverage set above.
|
||||
const databaseDriverIds = new Set<string>([...KTX_DATABASE_DRIVER_IDS, 'mongodb']);
|
||||
|
||||
export function normalizeConnectionDriver(connection: KtxProjectConnectionConfig): string {
|
||||
return String(connection.driver ?? '')
|
||||
|
|
@ -18,5 +22,5 @@ export function normalizeConnectionDriver(connection: KtxProjectConnectionConfig
|
|||
}
|
||||
|
||||
export function isDatabaseDriver(driver: string): boolean {
|
||||
return KTX_DATABASE_DRIVER_IDS.has(driver.trim().toLowerCase());
|
||||
return databaseDriverIds.has(driver.trim().toLowerCase());
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,8 +1,14 @@
|
|||
import { BigQuery, type TableField } from '@google-cloud/bigquery';
|
||||
import { normalizeBigQueryProjectId, normalizeBigQueryRegion } from '../../context/connections/bigquery-identifiers.js';
|
||||
import {
|
||||
normalizeBigQueryDatasetId,
|
||||
normalizeBigQueryProjectId,
|
||||
normalizeBigQueryRegion,
|
||||
} from '../../context/connections/bigquery-identifiers.js';
|
||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||
import { tryIntrospectObject } from '../../context/scan/object-introspection.js';
|
||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||
import {
|
||||
connectorTestFailure,
|
||||
|
|
@ -35,14 +41,25 @@ export interface KtxBigQueryConnectionConfig {
|
|||
credentials_json?: string;
|
||||
location?: string;
|
||||
max_bytes_billed?: number | string;
|
||||
job_timeout_ms?: number;
|
||||
query_timeout_ms?: number;
|
||||
[key: string]: unknown;
|
||||
}
|
||||
|
||||
/**
|
||||
* A dataset to introspect, paired with the project that hosts it. `project`
|
||||
* defaults to the billing project (`credentials.project_id`) when an entry has
|
||||
* no `project.` prefix; a fully-qualified `project.dataset` entry resolves to
|
||||
* its own host project. Jobs always bill in `credentials.project_id`.
|
||||
*/
|
||||
export interface BigQueryDatasetRef {
|
||||
project: string;
|
||||
dataset: string;
|
||||
}
|
||||
|
||||
export interface KtxBigQueryResolvedConnectionConfig {
|
||||
projectId: string;
|
||||
credentials: Record<string, unknown>;
|
||||
datasetIds: string[];
|
||||
datasetIds: BigQueryDatasetRef[];
|
||||
location?: string;
|
||||
}
|
||||
|
||||
|
|
@ -95,7 +112,7 @@ export interface KtxBigQueryDataset {
|
|||
|
||||
export interface KtxBigQueryClient {
|
||||
getDatasets(input?: { maxResults?: number }): Promise<[Array<{ id?: string }>, ...unknown[]]>;
|
||||
dataset(datasetId: string): KtxBigQueryDataset;
|
||||
dataset(datasetId: string, projectId: string): KtxBigQueryDataset;
|
||||
createQueryJob(input: {
|
||||
query: string;
|
||||
location?: string;
|
||||
|
|
@ -116,7 +133,6 @@ export interface KtxBigQueryScanConnectorOptions {
|
|||
env?: NodeJS.ProcessEnv;
|
||||
now?: () => Date;
|
||||
maxBytesBilled?: number | string;
|
||||
queryTimeoutMs?: number;
|
||||
}
|
||||
|
||||
class DefaultBigQueryClientFactory implements KtxBigQueryClientFactory {
|
||||
|
|
@ -124,8 +140,8 @@ class DefaultBigQueryClientFactory implements KtxBigQueryClientFactory {
|
|||
const client = new BigQuery(input);
|
||||
return {
|
||||
getDatasets: (options) => client.getDatasets(options) as Promise<[Array<{ id?: string }>, ...unknown[]]>,
|
||||
dataset: (datasetId) => {
|
||||
const dataset = client.dataset(datasetId);
|
||||
dataset: (datasetId, projectId) => {
|
||||
const dataset = client.dataset(datasetId, { projectId });
|
||||
return {
|
||||
get: () => dataset.get() as Promise<unknown>,
|
||||
getTables: () => dataset.getTables() as Promise<[KtxBigQueryTableRef[], ...unknown[]]>,
|
||||
|
|
@ -145,14 +161,48 @@ function stringConfigValue(
|
|||
return typeof value === 'string' && value.trim().length > 0 ? resolveStringReference(value.trim(), env) : undefined;
|
||||
}
|
||||
|
||||
function datasetIds(connection: KtxBigQueryConnectionConfig, env: NodeJS.ProcessEnv): string[] {
|
||||
if (Array.isArray(connection.dataset_ids) && connection.dataset_ids.length > 0) {
|
||||
return connection.dataset_ids
|
||||
.filter((dataset) => dataset.trim().length > 0)
|
||||
.map((dataset) => resolveStringReference(dataset, env));
|
||||
/**
|
||||
* Parse one `dataset_ids` / `dataset_id` entry into a canonical
|
||||
* {@link BigQueryDatasetRef}. A `project.dataset` prefix selects the host
|
||||
* project; a bare entry defaults to `defaultProject` (the billing project).
|
||||
* More than one dot, or an empty segment, is a config error naming the
|
||||
* connection — never a silent mis-introspection at scan time.
|
||||
*/
|
||||
function parseBigQueryDatasetEntry(entry: string, defaultProject: string, connectionId: string): BigQueryDatasetRef {
|
||||
const context = `connections.${connectionId}.dataset_ids entry "${entry}"`;
|
||||
const parts = entry.split('.');
|
||||
if (parts.length === 1) {
|
||||
return { project: defaultProject, dataset: normalizeBigQueryDatasetId(parts[0]!, context) };
|
||||
}
|
||||
const datasetId = stringConfigValue(connection, 'dataset_id', env);
|
||||
return datasetId ? [datasetId] : [];
|
||||
if (parts.length === 2) {
|
||||
const [project, dataset] = parts;
|
||||
if (!project || !dataset) {
|
||||
throw new Error(`Invalid BigQuery dataset entry for ${context}: empty project or dataset segment`);
|
||||
}
|
||||
return {
|
||||
project: normalizeBigQueryProjectId(project, context),
|
||||
dataset: normalizeBigQueryDatasetId(dataset, context),
|
||||
};
|
||||
}
|
||||
throw new Error(
|
||||
`Invalid BigQuery dataset entry for ${context}: expected "dataset" or "project.dataset", got more than one "."`,
|
||||
);
|
||||
}
|
||||
|
||||
function resolveDatasetRefs(
|
||||
connection: KtxBigQueryConnectionConfig,
|
||||
env: NodeJS.ProcessEnv,
|
||||
defaultProject: string,
|
||||
connectionId: string,
|
||||
): BigQueryDatasetRef[] {
|
||||
const rawEntries =
|
||||
Array.isArray(connection.dataset_ids) && connection.dataset_ids.length > 0
|
||||
? connection.dataset_ids.map((dataset) => resolveStringReference(dataset, env))
|
||||
: [stringConfigValue(connection, 'dataset_id', env)].filter((value): value is string => Boolean(value));
|
||||
return rawEntries
|
||||
.map((entry) => entry.trim())
|
||||
.filter((entry) => entry.length > 0)
|
||||
.map((entry) => parseBigQueryDatasetEntry(entry, defaultProject, connectionId));
|
||||
}
|
||||
|
||||
function bigQueryMaxBytesBilledFromConnection(
|
||||
|
|
@ -169,12 +219,25 @@ function bigQueryMaxBytesBilledFromConnection(
|
|||
return undefined;
|
||||
}
|
||||
|
||||
function bigQueryJobTimeoutMsFromConnection(connection: KtxBigQueryConnectionConfig | undefined): number | undefined {
|
||||
const value = connection?.job_timeout_ms;
|
||||
if (typeof value !== 'number') {
|
||||
return undefined;
|
||||
// jobTimeoutMs cancels the job with a "Job timed out" message (or a timeout
|
||||
// reason in the errors array) once the deadline elapses.
|
||||
function isBigQueryTimeoutError(error: unknown): boolean {
|
||||
if (!error || typeof error !== 'object') {
|
||||
return false;
|
||||
}
|
||||
return Number.isInteger(value) && value > 0 ? value : undefined;
|
||||
const topMessage = (error as { message?: unknown }).message;
|
||||
if (typeof topMessage === 'string' && /timed out|timeout/i.test(topMessage)) {
|
||||
return true;
|
||||
}
|
||||
const errors = (error as { errors?: unknown }).errors;
|
||||
return (
|
||||
Array.isArray(errors) &&
|
||||
errors.some((entry) => {
|
||||
const reason = (entry as { reason?: unknown })?.reason;
|
||||
const message = (entry as { message?: unknown })?.message;
|
||||
return reason === 'timeout' || (typeof message === 'string' && /timed out|timeout/i.test(message));
|
||||
})
|
||||
);
|
||||
}
|
||||
|
||||
function tableKind(metadataType: string | undefined): KtxSchemaTable['kind'] {
|
||||
|
|
@ -267,7 +330,7 @@ export function bigQueryConnectionConfigFromConfig(input: {
|
|||
if (!projectId) {
|
||||
throw new Error(`Native BigQuery connector requires credentials_json.project_id for connections.${input.connectionId}`);
|
||||
}
|
||||
const resolvedDatasetIds = datasetIds(input.connection, env);
|
||||
const resolvedDatasetIds = resolveDatasetRefs(input.connection, env, projectId, input.connectionId);
|
||||
const location = stringConfigValue(input.connection, 'location', env);
|
||||
return { projectId, credentials, datasetIds: resolvedDatasetIds, ...(location ? { location } : {}) };
|
||||
}
|
||||
|
|
@ -290,7 +353,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
private readonly clientFactory: KtxBigQueryClientFactory;
|
||||
private readonly now: () => Date;
|
||||
private readonly maxBytesBilled?: number | string;
|
||||
private readonly queryTimeoutMs?: number;
|
||||
private readonly deadlineMs: number;
|
||||
private readonly dialect = getSqlDialectForDriver('bigquery');
|
||||
private client: KtxBigQueryClient | null = null;
|
||||
|
||||
|
|
@ -304,7 +367,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
this.clientFactory = options.clientFactory ?? new DefaultBigQueryClientFactory();
|
||||
this.now = options.now ?? (() => new Date());
|
||||
this.maxBytesBilled = options.maxBytesBilled ?? bigQueryMaxBytesBilledFromConnection(options.connection);
|
||||
this.queryTimeoutMs = options.queryTimeoutMs ?? bigQueryJobTimeoutMsFromConnection(options.connection);
|
||||
this.deadlineMs = resolveQueryDeadlineMs(options.connection);
|
||||
this.id = `bigquery:${options.connectionId}`;
|
||||
}
|
||||
|
||||
|
|
@ -312,8 +375,8 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
try {
|
||||
const client = this.getClient();
|
||||
await client.getDatasets({ maxResults: 1 });
|
||||
for (const datasetId of this.resolved.datasetIds) {
|
||||
await client.dataset(datasetId).get();
|
||||
for (const ref of this.resolved.datasetIds) {
|
||||
await client.dataset(ref.dataset, ref.project).get();
|
||||
}
|
||||
return { success: true };
|
||||
} catch (error) {
|
||||
|
|
@ -324,22 +387,23 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise<KtxSchemaSnapshot> {
|
||||
this.assertConnection(input.connectionId);
|
||||
const tables: KtxSchemaTable[] = [];
|
||||
const datasetIds = this.requireDatasetIdsForScan();
|
||||
const datasetRefs = this.requireDatasetIdsForScan();
|
||||
const snapshotWarnings: KtxScanWarning[] = [];
|
||||
for (const datasetId of datasetIds) {
|
||||
for (const ref of datasetRefs) {
|
||||
const scopedNames = input.tableScope
|
||||
? scopedTableNames(input.tableScope, { catalog: this.resolved.projectId, db: datasetId })
|
||||
? scopedTableNames(input.tableScope, { catalog: ref.project, db: ref.dataset })
|
||||
: null;
|
||||
tables.push(...(await this.introspectDataset(datasetId, scopedNames, snapshotWarnings)));
|
||||
tables.push(...(await this.introspectDataset(ref, scopedNames, snapshotWarnings)));
|
||||
}
|
||||
const datasetLabels = datasetRefs.map((ref) => this.qualifiedDatasetLabel(ref));
|
||||
return {
|
||||
connectionId: this.connectionId,
|
||||
driver: 'bigquery',
|
||||
extractedAt: this.now().toISOString(),
|
||||
scope: { catalogs: [this.resolved.projectId], datasets: datasetIds },
|
||||
scope: { catalogs: [...new Set(datasetRefs.map((ref) => ref.project))], datasets: datasetLabels },
|
||||
metadata: {
|
||||
project_id: this.resolved.projectId,
|
||||
datasets: datasetIds,
|
||||
datasets: datasetLabels,
|
||||
table_count: tables.length,
|
||||
total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0),
|
||||
},
|
||||
|
|
@ -400,11 +464,14 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
return { values: valueRows.filter((row) => row.val !== null).map((row) => String(row.val)), cardinality };
|
||||
}
|
||||
|
||||
async getTableRowCount(tableName: string, datasetId = this.resolved.datasetIds[0]): Promise<number> {
|
||||
if (!datasetId) {
|
||||
async getTableRowCount(
|
||||
tableName: string,
|
||||
ref: BigQueryDatasetRef | undefined = this.resolved.datasetIds[0],
|
||||
): Promise<number> {
|
||||
if (!ref) {
|
||||
return 0;
|
||||
}
|
||||
const tables = await this.introspectDataset(datasetId, null, []);
|
||||
const tables = await this.introspectDataset(ref, null, []);
|
||||
return tables.find((table) => table.name === tableName)?.estimatedRows ?? 0;
|
||||
}
|
||||
|
||||
|
|
@ -422,12 +489,28 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
}
|
||||
|
||||
async listTables(datasetIds?: string[]): Promise<KtxTableListEntry[]> {
|
||||
const projectId = normalizeBigQueryProjectId(this.resolved.projectId, 'table discovery');
|
||||
const region = normalizeBigQueryRegion(this.resolved.location ?? 'US', 'table discovery');
|
||||
if (!datasetIds || datasetIds.length === 0) {
|
||||
return this.listTablesInProject(this.resolved.projectId, region);
|
||||
}
|
||||
const datasetsByProject = new Map<string, string[]>();
|
||||
for (const entry of datasetIds) {
|
||||
const ref = parseBigQueryDatasetEntry(entry.trim(), this.resolved.projectId, this.connectionId);
|
||||
datasetsByProject.set(ref.project, [...(datasetsByProject.get(ref.project) ?? []), ref.dataset]);
|
||||
}
|
||||
const entries: KtxTableListEntry[] = [];
|
||||
for (const [project, datasets] of datasetsByProject) {
|
||||
entries.push(...(await this.listTablesInProject(project, region, datasets)));
|
||||
}
|
||||
return entries;
|
||||
}
|
||||
|
||||
private async listTablesInProject(project: string, region: string, datasets?: string[]): Promise<KtxTableListEntry[]> {
|
||||
const projectId = normalizeBigQueryProjectId(project, 'table discovery');
|
||||
const params: Record<string, unknown> = {};
|
||||
const filter = datasetIds && datasetIds.length > 0 ? 'AND table_schema IN UNNEST(@dataset_ids)' : '';
|
||||
if (datasetIds && datasetIds.length > 0) {
|
||||
params.dataset_ids = datasetIds;
|
||||
const filter = datasets && datasets.length > 0 ? 'AND table_schema IN UNNEST(@dataset_ids)' : '';
|
||||
if (datasets && datasets.length > 0) {
|
||||
params.dataset_ids = datasets;
|
||||
}
|
||||
const rows = await this.queryRaw<{ table_schema: string; table_name: string; table_type: string }>(
|
||||
`
|
||||
|
|
@ -442,7 +525,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
params,
|
||||
);
|
||||
return rows.map((row) => ({
|
||||
catalog: this.resolved.projectId,
|
||||
catalog: project,
|
||||
schema: row.table_schema,
|
||||
name: row.table_name,
|
||||
kind:
|
||||
|
|
@ -466,34 +549,48 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
return this.client;
|
||||
}
|
||||
|
||||
private requireDatasetIdsForScan(): string[] {
|
||||
private requireDatasetIdsForScan(): BigQueryDatasetRef[] {
|
||||
if (this.resolved.datasetIds.length === 0) {
|
||||
throw new Error(`Native BigQuery scan requires connections.${this.connectionId}.dataset_ids or dataset_id`);
|
||||
}
|
||||
return this.resolved.datasetIds;
|
||||
}
|
||||
|
||||
// Bare in the billing project, qualified `project.dataset` otherwise, so the
|
||||
// snapshot's scope/metadata stay unambiguous when two projects host the same
|
||||
// dataset name. The dotless form is the unchanged single-project label.
|
||||
private qualifiedDatasetLabel(ref: BigQueryDatasetRef): string {
|
||||
return ref.project === this.resolved.projectId ? ref.dataset : `${ref.project}.${ref.dataset}`;
|
||||
}
|
||||
|
||||
private async query(sql: string, params?: Record<string, unknown>): Promise<KtxQueryResult> {
|
||||
const [job] = await this.getClient().createQueryJob({
|
||||
query: sql,
|
||||
...(this.resolved.location ? { location: this.resolved.location } : {}),
|
||||
...(params && Object.keys(params).length > 0 ? { params } : {}),
|
||||
...(this.maxBytesBilled ? { maximumBytesBilled: String(this.maxBytesBilled) } : {}),
|
||||
...(this.queryTimeoutMs ? { jobTimeoutMs: this.queryTimeoutMs } : {}),
|
||||
});
|
||||
const [rows, , response] = await job.getQueryResults();
|
||||
let headers = response?.schema?.fields?.map((field) => field.name || '') ?? [];
|
||||
const headerTypes = response?.schema?.fields?.map((field) => String(field.type || 'STRING')) ?? [];
|
||||
if (headers.length === 0 && rows.length > 0) {
|
||||
headers = Object.keys(rows[0]!);
|
||||
try {
|
||||
const [job] = await this.getClient().createQueryJob({
|
||||
query: sql,
|
||||
...(this.resolved.location ? { location: this.resolved.location } : {}),
|
||||
...(params && Object.keys(params).length > 0 ? { params } : {}),
|
||||
...(this.maxBytesBilled ? { maximumBytesBilled: String(this.maxBytesBilled) } : {}),
|
||||
jobTimeoutMs: this.deadlineMs,
|
||||
});
|
||||
const [rows, , response] = await job.getQueryResults();
|
||||
let headers = response?.schema?.fields?.map((field) => field.name || '') ?? [];
|
||||
const headerTypes = response?.schema?.fields?.map((field) => String(field.type || 'STRING')) ?? [];
|
||||
if (headers.length === 0 && rows.length > 0) {
|
||||
headers = Object.keys(rows[0]!);
|
||||
}
|
||||
return {
|
||||
headers,
|
||||
headerTypes: headerTypes.length > 0 ? headerTypes : undefined,
|
||||
rows: rows.map((row) => headers.map((header) => normalizeValue(row[header]))),
|
||||
totalRows: rows.length,
|
||||
rowCount: rows.length,
|
||||
};
|
||||
} catch (error) {
|
||||
if (isBigQueryTimeoutError(error)) {
|
||||
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
return {
|
||||
headers,
|
||||
headerTypes: headerTypes.length > 0 ? headerTypes : undefined,
|
||||
rows: rows.map((row) => headers.map((header) => normalizeValue(row[header]))),
|
||||
totalRows: rows.length,
|
||||
rowCount: rows.length,
|
||||
};
|
||||
}
|
||||
|
||||
private async queryRaw<T extends Record<string, unknown>>(sql: string, params?: Record<string, unknown>): Promise<T[]> {
|
||||
|
|
@ -507,18 +604,18 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
}
|
||||
|
||||
private async introspectDataset(
|
||||
datasetId: string,
|
||||
ref: BigQueryDatasetRef,
|
||||
scopedNames: readonly string[] | null,
|
||||
snapshotWarnings: KtxScanWarning[],
|
||||
): Promise<KtxSchemaTable[]> {
|
||||
if (scopedNames && scopedNames.length === 0) return [];
|
||||
const dataset = this.getClient().dataset(datasetId);
|
||||
const dataset = this.getClient().dataset(ref.dataset, ref.project);
|
||||
const [tableRefs] = await dataset.getTables();
|
||||
const scopeSet = scopedNames ? new Set(scopedNames) : null;
|
||||
const filteredTableRefs = scopeSet ? tableRefs.filter((tableRef) => scopeSet.has(tableRef.id ?? '')) : tableRefs;
|
||||
const primaryKeysResult = await tryConstraintQuery(
|
||||
{ schema: datasetId, kind: 'primary_key', isDeniedError },
|
||||
() => this.primaryKeys(datasetId),
|
||||
{ schema: ref.dataset, kind: 'primary_key', isDeniedError },
|
||||
() => this.primaryKeys(ref),
|
||||
);
|
||||
const primaryKeys = primaryKeysResult.ok ? primaryKeysResult.value : new Map<string, Set<string>>();
|
||||
if (!primaryKeysResult.ok) {
|
||||
|
|
@ -527,41 +624,51 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
|||
const tables: KtxSchemaTable[] = [];
|
||||
for (const tableRef of filteredTableRefs) {
|
||||
const tableName = tableRef.id || '';
|
||||
const [table] = await tableRef.get();
|
||||
const fields = table.metadata.schema?.fields ?? [];
|
||||
tables.push({
|
||||
catalog: this.resolved.projectId,
|
||||
db: datasetId,
|
||||
name: tableName,
|
||||
kind: tableKind(table.metadata.type),
|
||||
comment: table.metadata.description || null,
|
||||
estimatedRows: firstNumber(table.metadata.numRows) ?? 0,
|
||||
columns: fields.map((field) => this.toSchemaColumn(tableName, field, primaryKeys)),
|
||||
foreignKeys: [],
|
||||
});
|
||||
const outcome = await tryIntrospectObject<KtxSchemaTable>(
|
||||
{ object: tableName, catalog: ref.project, db: ref.dataset },
|
||||
async () => {
|
||||
const [table] = await tableRef.get();
|
||||
const fields = table.metadata.schema?.fields ?? [];
|
||||
return {
|
||||
catalog: ref.project,
|
||||
db: ref.dataset,
|
||||
name: tableName,
|
||||
kind: tableKind(table.metadata.type),
|
||||
comment: table.metadata.description || null,
|
||||
estimatedRows: firstNumber(table.metadata.numRows) ?? 0,
|
||||
columns: fields.map((field) => this.toSchemaColumn(tableName, field, primaryKeys)),
|
||||
foreignKeys: [],
|
||||
};
|
||||
},
|
||||
);
|
||||
if (outcome.ok) {
|
||||
tables.push(outcome.table);
|
||||
} else {
|
||||
snapshotWarnings.push(outcome.warning);
|
||||
}
|
||||
}
|
||||
return tables;
|
||||
}
|
||||
|
||||
private async primaryKeys(datasetId: string): Promise<Map<string, Set<string>>> {
|
||||
private async primaryKeys(ref: BigQueryDatasetRef): Promise<Map<string, Set<string>>> {
|
||||
const rows = await this.queryRaw<{ table_name: string; column_name: string }>(
|
||||
'SELECT tc.table_name, kcu.column_name ' +
|
||||
'FROM `' +
|
||||
this.resolved.projectId +
|
||||
ref.project +
|
||||
'.' +
|
||||
datasetId +
|
||||
ref.dataset +
|
||||
'.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` tc ' +
|
||||
'JOIN `' +
|
||||
this.resolved.projectId +
|
||||
ref.project +
|
||||
'.' +
|
||||
datasetId +
|
||||
ref.dataset +
|
||||
'.INFORMATION_SCHEMA.KEY_COLUMN_USAGE` kcu ' +
|
||||
'ON tc.constraint_name = kcu.constraint_name ' +
|
||||
'AND tc.table_schema = kcu.table_schema ' +
|
||||
'AND tc.table_name = kcu.table_name ' +
|
||||
"WHERE tc.constraint_type = 'PRIMARY KEY' " +
|
||||
"AND tc.table_schema = '" +
|
||||
datasetId +
|
||||
ref.dataset +
|
||||
"' " +
|
||||
"AND NOT REGEXP_CONTAINS(kcu.column_name, r'^(stacksync_record_id|sync_primary_key)_') " +
|
||||
'ORDER BY tc.table_name, kcu.ordinal_position',
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
import { createClient } from '@clickhouse/client';
|
||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||
import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxSchemaColumn, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableRef, type KtxTableSampleInput, type KtxTableListEntry, type KtxTableSampleResult } from '../../context/scan/types.js';
|
||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||
|
|
@ -144,6 +145,21 @@ function maybeNumber(value: unknown): number | undefined {
|
|||
return typeof value === 'number' && Number.isFinite(value) ? value : undefined;
|
||||
}
|
||||
|
||||
// ClickHouse error code 159 = TIMEOUT_EXCEEDED, raised when max_execution_time
|
||||
// is hit. The client surfaces it via a numeric/string `code` or a "Code: 159"
|
||||
// message prefix depending on transport.
|
||||
function isClickHouseTimeoutError(error: unknown): boolean {
|
||||
if (!error || typeof error !== 'object') {
|
||||
return false;
|
||||
}
|
||||
const code = (error as { code?: unknown }).code;
|
||||
if (code === 159 || code === '159') {
|
||||
return true;
|
||||
}
|
||||
const message = (error as { message?: unknown }).message;
|
||||
return typeof message === 'string' && (/\bCode:\s*159\b/.test(message) || message.includes('TIMEOUT_EXCEEDED'));
|
||||
}
|
||||
|
||||
function parseClickHouseUrl(url: string): Partial<KtxClickHouseConnectionConfig> {
|
||||
const parsed = new URL(url);
|
||||
return {
|
||||
|
|
@ -284,6 +300,7 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
|||
private readonly clientFactory: KtxClickHouseClientFactory;
|
||||
private readonly endpointResolver?: KtxClickHouseEndpointResolver;
|
||||
private readonly now: () => Date;
|
||||
private readonly deadlineMs: number;
|
||||
private readonly dialect = getSqlDialectForDriver('clickhouse');
|
||||
private client: KtxClickHouseClient | null = null;
|
||||
private resolvedEndpoint: KtxClickHouseResolvedEndpoint | null = null;
|
||||
|
|
@ -299,6 +316,7 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
|||
this.clientFactory = options.clientFactory ?? new DefaultClickHouseClientFactory();
|
||||
this.endpointResolver = options.endpointResolver;
|
||||
this.now = options.now ?? (() => new Date());
|
||||
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||
this.id = `clickhouse:${options.connectionId}`;
|
||||
}
|
||||
|
||||
|
|
@ -584,9 +602,13 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
|||
username: config.username,
|
||||
password: config.password ?? '',
|
||||
database: config.database,
|
||||
request_timeout: 30_000,
|
||||
// The server aborts at max_execution_time (seconds); request_timeout must
|
||||
// outlast it so the HTTP client receives the code-159 error instead of
|
||||
// giving up first and leaving the query running.
|
||||
request_timeout: this.deadlineMs + 5_000,
|
||||
clickhouse_settings: {
|
||||
output_format_json_quote_64bit_integers: 1,
|
||||
max_execution_time: Math.ceil(this.deadlineMs / 1000),
|
||||
},
|
||||
...(isProxied && config.ssl
|
||||
? {
|
||||
|
|
@ -613,19 +635,26 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
|||
|
||||
private async query(sql: string, params?: Record<string, unknown>): Promise<Omit<KtxQueryResult, 'rowCount'>> {
|
||||
const client = await this.clientForQuery();
|
||||
const resultSet = await client.query({
|
||||
query: assertReadOnlySql(sql),
|
||||
format: 'JSONCompact',
|
||||
...(params ? { query_params: params } : {}),
|
||||
});
|
||||
const response = (await resultSet.json()) as ClickHouseCompactResponse;
|
||||
const meta = response.meta ?? [];
|
||||
return {
|
||||
headers: meta.map((field) => field.name),
|
||||
headerTypes: meta.map((field) => field.type),
|
||||
rows: response.data ?? [],
|
||||
totalRows: response.rows ?? response.data?.length ?? 0,
|
||||
};
|
||||
try {
|
||||
const resultSet = await client.query({
|
||||
query: assertReadOnlySql(sql),
|
||||
format: 'JSONCompact',
|
||||
...(params ? { query_params: params } : {}),
|
||||
});
|
||||
const response = (await resultSet.json()) as ClickHouseCompactResponse;
|
||||
const meta = response.meta ?? [];
|
||||
return {
|
||||
headers: meta.map((field) => field.name),
|
||||
headerTypes: meta.map((field) => field.type),
|
||||
rows: response.data ?? [],
|
||||
totalRows: response.rows ?? response.data?.length ?? 0,
|
||||
};
|
||||
} catch (error) {
|
||||
if (isClickHouseTimeoutError(error)) {
|
||||
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
private assertConnection(connectionId: string): void {
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
import mysql, { type FieldPacket, type Pool, type RowDataPacket } from 'mysql2/promise';
|
||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||
import { resolveStringReference } from '../shared/string-reference.js';
|
||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||
import {
|
||||
|
|
@ -282,6 +283,11 @@ function isDeniedError(error: unknown): boolean {
|
|||
);
|
||||
}
|
||||
|
||||
// errno 3024 = ER_QUERY_TIMEOUT, raised when max_execution_time is exceeded.
|
||||
function isMysqlTimeoutError(error: unknown): boolean {
|
||||
return Boolean(error) && typeof error === 'object' && (error as { errno?: unknown }).errno === 3024;
|
||||
}
|
||||
|
||||
function pushConstraintWarnings(
|
||||
warnings: KtxScanWarning[],
|
||||
schemas: readonly string[],
|
||||
|
|
@ -391,6 +397,7 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
|||
private readonly poolFactory: KtxMysqlPoolFactory;
|
||||
private readonly endpointResolver?: KtxMysqlEndpointResolver;
|
||||
private readonly now: () => Date;
|
||||
private readonly deadlineMs: number;
|
||||
private readonly dialect = getSqlDialectForDriver('mysql');
|
||||
private pool: KtxMysqlPool | null = null;
|
||||
private resolvedEndpoint: KtxMysqlResolvedEndpoint | null = null;
|
||||
|
|
@ -406,6 +413,7 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
|||
this.poolFactory = options.poolFactory ?? new DefaultMysqlPoolFactory();
|
||||
this.endpointResolver = options.endpointResolver;
|
||||
this.now = options.now ?? (() => new Date());
|
||||
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||
this.id = `mysql:${options.connectionId}`;
|
||||
}
|
||||
|
||||
|
|
@ -763,6 +771,9 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
|||
const pool = await this.poolForQuery();
|
||||
const connection = await pool.getConnection();
|
||||
try {
|
||||
// max_execution_time (ms) bounds read-only SELECTs server-side; our path
|
||||
// only runs SELECT/WITH, so the session setting always applies.
|
||||
await connection.query('SET SESSION max_execution_time = ?', [this.deadlineMs]);
|
||||
const [rows, fields] = await connection.query(assertReadOnlySql(sql), queryParams(params));
|
||||
const headers = fields.map((field) => field.name);
|
||||
const headerTypes = fields.map((field) => String(field.type ?? 'unknown'));
|
||||
|
|
@ -772,6 +783,11 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
|||
rows: rows.map((row) => headers.map((header) => row[header])),
|
||||
totalRows: rows.length,
|
||||
};
|
||||
} catch (error) {
|
||||
if (isMysqlTimeoutError(error)) {
|
||||
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||
}
|
||||
throw error;
|
||||
} finally {
|
||||
connection.release();
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
import { resolveStringReference } from '../shared/string-reference.js';
|
||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||
|
|
@ -260,6 +261,11 @@ function isDeniedError(error: unknown): boolean {
|
|||
return code === '42501' || code === '42P01';
|
||||
}
|
||||
|
||||
// 57014 = query_canceled, which is how statement_timeout surfaces.
|
||||
function isPostgresTimeoutError(error: unknown): boolean {
|
||||
return Boolean(error) && typeof error === 'object' && (error as { code?: unknown }).code === '57014';
|
||||
}
|
||||
|
||||
function queryRows(result: KtxPostgresQueryResult): unknown[][] {
|
||||
const headers = (result.fields ?? []).map((field) => field.name);
|
||||
return result.rows.map((row) => headers.map((header) => row[header]));
|
||||
|
|
@ -384,9 +390,13 @@ export function postgresPoolConfigFromConfig(input: {
|
|||
: { host, port: numberValue(merged.port) ?? 5432, database, user, password }),
|
||||
};
|
||||
const searchPathSchemas = searchPathSchemasFromConnection(merged);
|
||||
// statement_timeout (ms) bounds every query on connections from this pool, so
|
||||
// the server itself aborts a runaway query and frees the connection cleanly.
|
||||
const serverOptions = [`-c statement_timeout=${resolveQueryDeadlineMs(merged)}`];
|
||||
if (searchPathSchemas.length > 0) {
|
||||
config.options = `-c search_path=${searchPathSchemas.join(',')}`;
|
||||
serverOptions.unshift(`-c search_path=${searchPathSchemas.join(',')}`);
|
||||
}
|
||||
config.options = serverOptions.join(' ');
|
||||
if (merged.ssl && sslmode !== 'prefer' && sslmode !== 'disable') {
|
||||
config.ssl = { rejectUnauthorized: merged.rejectUnauthorized ?? true };
|
||||
}
|
||||
|
|
@ -412,6 +422,7 @@ export class KtxPostgresScanConnector implements KtxScanConnector {
|
|||
private readonly poolFactory: KtxPostgresPoolFactory;
|
||||
private readonly endpointResolver?: KtxPostgresEndpointResolver;
|
||||
private readonly now: () => Date;
|
||||
private readonly deadlineMs: number;
|
||||
private readonly dialect = getSqlDialectForDriver('postgres');
|
||||
private pool: KtxPostgresPool | null = null;
|
||||
private lastIdlePoolError: Error | null = null;
|
||||
|
|
@ -428,6 +439,7 @@ export class KtxPostgresScanConnector implements KtxScanConnector {
|
|||
this.poolFactory = options.poolFactory ?? new DefaultPostgresPoolFactory();
|
||||
this.endpointResolver = options.endpointResolver;
|
||||
this.now = options.now ?? (() => new Date());
|
||||
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||
this.id = `postgres:${options.connectionId}`;
|
||||
}
|
||||
|
||||
|
|
@ -819,6 +831,11 @@ export class KtxPostgresScanConnector implements KtxScanConnector {
|
|||
totalRows: result.rows.length,
|
||||
rowCount: result.rows.length,
|
||||
};
|
||||
} catch (error) {
|
||||
if (isPostgresTimeoutError(error)) {
|
||||
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||
}
|
||||
throw error;
|
||||
} finally {
|
||||
client.release();
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
import { createPrivateKey } from 'node:crypto';
|
||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||
import { resolveStringReference } from '../shared/string-reference.js';
|
||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||
|
|
@ -60,6 +61,7 @@ export interface KtxSnowflakeResolvedConnectionConfig {
|
|||
passphrase?: string;
|
||||
role?: string;
|
||||
maxConnections: number;
|
||||
deadlineMs: number;
|
||||
}
|
||||
|
||||
export interface KtxSnowflakeRawColumnMetadata {
|
||||
|
|
@ -181,6 +183,22 @@ function isDeniedError(error: unknown): boolean {
|
|||
return false;
|
||||
}
|
||||
|
||||
// Snowflake cancels with code 604 and a "reached its statement ... timeout"
|
||||
// message once STATEMENT_TIMEOUT_IN_SECONDS elapses.
|
||||
function isSnowflakeTimeoutError(error: unknown): boolean {
|
||||
if (!error || typeof error !== 'object') {
|
||||
return false;
|
||||
}
|
||||
const code = (error as { code?: unknown }).code;
|
||||
const message = (error as { message?: unknown }).message;
|
||||
return (
|
||||
code === 604 ||
|
||||
code === '604' ||
|
||||
code === '000604' ||
|
||||
(typeof message === 'string' && /reached its (statement|warehouse) .*timeout/i.test(message))
|
||||
);
|
||||
}
|
||||
|
||||
function normalizeSnowflakeValue(value: unknown, columnType?: string): unknown {
|
||||
if (columnType && DATE_TYPES.some((type) => columnType.toUpperCase().includes(type))) {
|
||||
if (typeof value === 'number') {
|
||||
|
|
@ -282,6 +300,7 @@ export function snowflakeConnectionConfigFromConfig(input: {
|
|||
connectionId: input.connectionId,
|
||||
defaultValue: 4,
|
||||
}),
|
||||
deadlineMs: resolveQueryDeadlineMs(input.connection),
|
||||
};
|
||||
const role = stringConfigValue(input.connection, 'role', env);
|
||||
if (role) {
|
||||
|
|
@ -339,13 +358,23 @@ class SnowflakeSdkDriver implements KtxSnowflakeDriver {
|
|||
|
||||
async query(sql: string, params?: unknown): Promise<KtxQueryResult> {
|
||||
const binds = Array.isArray(params) ? toSnowflakeBinds(params) : undefined;
|
||||
const statementTimeoutSeconds = Math.ceil(this.resolved.deadlineMs / 1000);
|
||||
try {
|
||||
const pool = await this.getPool();
|
||||
const result = await pool.use(async (connection: snowflake.Connection) =>
|
||||
this.executeSnowflakeQuery(connection, sql, binds),
|
||||
);
|
||||
const result = await pool.use(async (connection: snowflake.Connection) => {
|
||||
// Bound the statement server-side; Snowflake cancels and frees the
|
||||
// warehouse slot when STATEMENT_TIMEOUT_IN_SECONDS is reached.
|
||||
await this.executeSnowflakeQuery(
|
||||
connection,
|
||||
`ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = ${statementTimeoutSeconds}`,
|
||||
);
|
||||
return this.executeSnowflakeQuery(connection, sql, binds);
|
||||
});
|
||||
return { ...result, totalRows: result.rows.length, rowCount: result.rows.length };
|
||||
} catch (error) {
|
||||
if (isSnowflakeTimeoutError(error)) {
|
||||
throw queryDeadlineExceededError(this.resolved.deadlineMs, { cause: error });
|
||||
}
|
||||
const message = error instanceof Error ? error.message : String(error);
|
||||
if (/timeout/i.test(message) && /pool|acquire/i.test(message)) {
|
||||
throw new Error(
|
||||
|
|
|
|||
|
|
@ -3,19 +3,44 @@ import { existsSync, readFileSync, statSync } from 'node:fs';
|
|||
import { homedir } from 'node:os';
|
||||
import { isAbsolute, resolve } from 'node:path';
|
||||
import { fileURLToPath } from 'node:url';
|
||||
import { fork, type ChildProcess } from 'node:child_process';
|
||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||
import { normalizeQueryRows } from '../../context/connections/query-executor.js';
|
||||
import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxSchemaForeignKey, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableListEntry, type KtxTableRef, type KtxTableSampleInput, type KtxTableSampleResult } from '../../context/scan/types.js';
|
||||
import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxScanWarning, type KtxSchemaForeignKey, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableListEntry, type KtxTableRef, type KtxTableSampleInput, type KtxTableSampleResult } from '../../context/scan/types.js';
|
||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||
import { tryIntrospectObject } from '../../context/scan/object-introspection.js';
|
||||
|
||||
export interface KtxSqliteConnectionConfig {
|
||||
driver?: string;
|
||||
path?: string;
|
||||
url?: string;
|
||||
query_timeout_ms?: number;
|
||||
[key: string]: unknown;
|
||||
}
|
||||
|
||||
// In dist, connector.js and read-query-child.js are siblings; under vitest the
|
||||
// compiled .js is absent and Node strips types from the .ts when forking it.
|
||||
const readQueryChildUrl = existsSync(fileURLToPath(new URL('./read-query-child.js', import.meta.url)))
|
||||
? new URL('./read-query-child.js', import.meta.url)
|
||||
: new URL('./read-query-child.ts', import.meta.url);
|
||||
|
||||
/** @internal */
|
||||
export function forkReadQueryChild(): ChildProcess {
|
||||
// Empty execArgv so the child is a clean Node process (no inherited vitest /
|
||||
// inspector flags); advanced serialization preserves BigInt/Buffer in rows.
|
||||
return fork(readQueryChildUrl, {
|
||||
execArgv: [],
|
||||
serialization: 'advanced',
|
||||
stdio: ['ignore', 'ignore', 'inherit', 'ipc'],
|
||||
});
|
||||
}
|
||||
|
||||
type ReadQueryChildMessage =
|
||||
| { ok: true; headers: string[]; rows: unknown[]; totalRows: number }
|
||||
| { ok: false; message: string };
|
||||
|
||||
/** @internal */
|
||||
export interface SqliteDatabasePathInput {
|
||||
connectionId: string;
|
||||
|
|
@ -25,6 +50,8 @@ export interface SqliteDatabasePathInput {
|
|||
|
||||
export interface KtxSqliteScanConnectorOptions extends SqliteDatabasePathInput {
|
||||
now?: () => Date;
|
||||
/** @internal Test seam: spawn the read-query child so tests can observe its lifecycle. */
|
||||
spawnReadQueryChild?: () => ChildProcess;
|
||||
}
|
||||
|
||||
export interface KtxSqliteReadOnlyQueryInput extends KtxReadOnlyQueryInput {
|
||||
|
|
@ -133,6 +160,8 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
|||
private readonly connectionId: string;
|
||||
private readonly dbPath: string;
|
||||
private readonly now: () => Date;
|
||||
private readonly deadlineMs: number;
|
||||
private readonly spawnReadQueryChild: () => ChildProcess;
|
||||
private readonly dialect = getSqlDialectForDriver('sqlite');
|
||||
private db: Database.Database | null = null;
|
||||
|
||||
|
|
@ -140,6 +169,8 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
|||
this.connectionId = options.connectionId;
|
||||
this.dbPath = sqliteDatabasePathFromConfig(options);
|
||||
this.now = options.now ?? (() => new Date());
|
||||
this.deadlineMs = resolveQueryDeadlineMs(options.connection);
|
||||
this.spawnReadQueryChild = options.spawnReadQueryChild ?? forkReadQueryChild;
|
||||
this.id = `sqlite:${options.connectionId}`;
|
||||
}
|
||||
|
||||
|
|
@ -158,17 +189,27 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
|||
async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise<KtxSchemaSnapshot> {
|
||||
this.assertConnection(input.connectionId);
|
||||
const database = this.database();
|
||||
const scopedNames = input.tableScope ? scopedTableNames(input.tableScope, { catalog: null, db: null }) : null;
|
||||
const scopeClause = scopedNames ? `AND name IN (${scopedNames.map(() => '?').join(', ')})` : '';
|
||||
const rawTables =
|
||||
scopedNames && scopedNames.length === 0
|
||||
? []
|
||||
: (database
|
||||
.prepare(
|
||||
`SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view') AND name NOT LIKE 'sqlite_%' ${scopeClause} ORDER BY name`,
|
||||
)
|
||||
.all(...(scopedNames ?? [])) as SqliteMasterRow[]);
|
||||
const tables = rawTables.map((table) => this.readTable(database, table));
|
||||
const allObjects = database
|
||||
.prepare(
|
||||
`SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view') AND name NOT LIKE 'sqlite_%' ORDER BY name`,
|
||||
)
|
||||
.all() as SqliteMasterRow[];
|
||||
const scopedNames = input.tableScope
|
||||
? new Set(scopedTableNames(input.tableScope, { catalog: null, db: null }))
|
||||
: null;
|
||||
const selectedObjects = scopedNames ? allObjects.filter((object) => scopedNames.has(object.name)) : allObjects;
|
||||
|
||||
const tables: KtxSchemaTable[] = [];
|
||||
const warnings: KtxScanWarning[] = [];
|
||||
for (const object of selectedObjects) {
|
||||
const outcome = await tryIntrospectObject({ object: object.name }, () => this.readTable(database, object));
|
||||
if (outcome.ok) {
|
||||
tables.push(outcome.table);
|
||||
} else {
|
||||
warnings.push(outcome.warning);
|
||||
}
|
||||
}
|
||||
|
||||
const fileStats = existsSync(this.dbPath) ? statSync(this.dbPath) : null;
|
||||
return {
|
||||
connectionId: this.connectionId,
|
||||
|
|
@ -180,8 +221,12 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
|||
file_size: fileStats ? fileStats.size : 0,
|
||||
table_count: tables.length,
|
||||
total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0),
|
||||
// Carries the full object inventory so a zero-match enabled_tables scope
|
||||
// can report which objects were actually available.
|
||||
...(scopedNames ? { discovered_object_names: allObjects.map((object) => object.name) } : {}),
|
||||
},
|
||||
tables,
|
||||
...(warnings.length > 0 ? { warnings } : {}),
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -229,12 +274,81 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
|||
return null;
|
||||
}
|
||||
|
||||
async executeReadOnly(input: KtxSqliteReadOnlyQueryInput, _ctx: KtxScanContext): Promise<KtxQueryResult> {
|
||||
async executeReadOnly(input: KtxSqliteReadOnlyQueryInput, ctx: KtxScanContext): Promise<KtxQueryResult> {
|
||||
this.assertConnection(input.connectionId);
|
||||
const result = this.query(limitSqlForExecution(input.sql, input.maxRows), input.params);
|
||||
// Validate and row-limit on the main thread so invalid SQL fails instantly
|
||||
// without spawning a process and read-only enforcement stays at the boundary.
|
||||
const sql = limitSqlForExecution(input.sql, input.maxRows);
|
||||
const result = await this.runReadQueryOffProcess(sql, input.params, ctx.signal);
|
||||
return { ...result, rowCount: result.rows.length };
|
||||
}
|
||||
|
||||
// The LLM-SQL path runs off the event loop in a short-lived child process so a
|
||||
// pathological scan cannot freeze the MCP server, and the deadline is enforced
|
||||
// by SIGKILL-ing that process. A synchronous better-sqlite3 scan never yields,
|
||||
// so a worker-thread terminate cannot interrupt it — only the OS reclaiming the
|
||||
// whole process frees the CPU. One short-lived process per call; killed on
|
||||
// completion, deadline, or external abort.
|
||||
private runReadQueryOffProcess(
|
||||
sql: string,
|
||||
params: Record<string, unknown> | unknown[] | undefined,
|
||||
signal: AbortSignal | undefined,
|
||||
): Promise<Omit<KtxQueryResult, 'rowCount'>> {
|
||||
const deadlineMs = this.deadlineMs;
|
||||
const dbPath = this.dbPath;
|
||||
return new Promise((resolvePromise, rejectPromise) => {
|
||||
const child = this.spawnReadQueryChild();
|
||||
let settled = false;
|
||||
const onDeadline = () => settle(() => rejectPromise(queryDeadlineExceededError(deadlineMs)));
|
||||
const timer = setTimeout(onDeadline, deadlineMs);
|
||||
function settle(finish: () => void): void {
|
||||
if (settled) {
|
||||
return;
|
||||
}
|
||||
settled = true;
|
||||
clearTimeout(timer);
|
||||
signal?.removeEventListener('abort', onDeadline);
|
||||
if (child.exitCode === null && child.signalCode === null) {
|
||||
child.kill('SIGKILL');
|
||||
}
|
||||
finish();
|
||||
}
|
||||
child.on('message', (message: ReadQueryChildMessage) => {
|
||||
if (message.ok) {
|
||||
settle(() =>
|
||||
resolvePromise({
|
||||
headers: message.headers,
|
||||
rows: normalizeQueryRows(message.rows),
|
||||
totalRows: message.totalRows,
|
||||
}),
|
||||
);
|
||||
} else {
|
||||
settle(() => rejectPromise(new Error(message.message)));
|
||||
}
|
||||
});
|
||||
child.on('error', (error) => settle(() => rejectPromise(error)));
|
||||
child.on('exit', (code, processSignal) => {
|
||||
if (!settled) {
|
||||
settle(() =>
|
||||
rejectPromise(
|
||||
new Error(`SQLite read process exited before returning a result (code ${code}, signal ${processSignal}).`),
|
||||
),
|
||||
);
|
||||
}
|
||||
});
|
||||
if (signal?.aborted) {
|
||||
onDeadline();
|
||||
return;
|
||||
}
|
||||
signal?.addEventListener('abort', onDeadline, { once: true });
|
||||
try {
|
||||
child.send({ dbPath, sql, params });
|
||||
} catch (error) {
|
||||
settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error))));
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
async getColumnDistinctValues(
|
||||
table: KtxTableRef,
|
||||
columnName: string,
|
||||
|
|
@ -310,16 +424,7 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
|||
const foreignKeys = database
|
||||
.prepare(`PRAGMA foreign_key_list(${this.dialect.quoteIdentifier(table.name)})`)
|
||||
.all() as SqliteForeignKeyRow[];
|
||||
const estimatedRows =
|
||||
table.type === 'table'
|
||||
? Number(
|
||||
(
|
||||
database
|
||||
.prepare(`SELECT COUNT(*) AS count FROM ${this.dialect.quoteIdentifier(table.name)}`)
|
||||
.get() as { count: unknown }
|
||||
).count,
|
||||
)
|
||||
: null;
|
||||
const estimatedRows = table.type === 'table' ? this.readRowCount(database, table.name) : null;
|
||||
return {
|
||||
catalog: null,
|
||||
db: null,
|
||||
|
|
@ -340,6 +445,19 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
|||
};
|
||||
}
|
||||
|
||||
// A row-count read is profiling, not structure: a failure here leaves the
|
||||
// object's structure intact rather than skipping the whole object.
|
||||
private readRowCount(database: Database.Database, name: string): number | null {
|
||||
try {
|
||||
const row = database.prepare(`SELECT COUNT(*) AS count FROM ${this.dialect.quoteIdentifier(name)}`).get() as {
|
||||
count: unknown;
|
||||
};
|
||||
return Number(row.count);
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
private mapForeignKeys(rows: SqliteForeignKeyRow[]): KtxSchemaForeignKey[] {
|
||||
return rows
|
||||
.sort((a, b) => a.id - b.id || a.seq - b.seq)
|
||||
|
|
|
|||
40
packages/cli/src/connectors/sqlite/read-query-child.ts
Normal file
40
packages/cli/src/connectors/sqlite/read-query-child.ts
Normal file
|
|
@ -0,0 +1,40 @@
|
|||
import Database from 'better-sqlite3';
|
||||
|
||||
// Runs on a forked child process (no bundler, no test transform), so it imports
|
||||
// only better-sqlite3 and node builtins. The SQL is already read-only-validated
|
||||
// and row-limited by the parent; this process just executes it and posts the
|
||||
// structured-cloneable raw rows back over IPC. Its only cancellation mechanism
|
||||
// is the parent sending SIGKILL: a synchronous better-sqlite3 scan never yields,
|
||||
// so neither a worker-thread terminate nor any in-process timer can interrupt
|
||||
// it — only the OS reclaiming the whole process can.
|
||||
|
||||
interface ReadQueryRequest {
|
||||
dbPath: string;
|
||||
sql: string;
|
||||
params?: Record<string, unknown> | unknown[];
|
||||
}
|
||||
|
||||
type ReadQueryResponse =
|
||||
| { ok: true; headers: string[]; rows: unknown[]; totalRows: number }
|
||||
| { ok: false; message: string };
|
||||
|
||||
process.once('message', (request: ReadQueryRequest) => {
|
||||
let db: Database.Database | undefined;
|
||||
let response: ReadQueryResponse;
|
||||
try {
|
||||
db = new Database(request.dbPath, { readonly: true, fileMustExist: true });
|
||||
const statement = db.prepare(request.sql);
|
||||
const rows = (request.params ? statement.all(request.params) : statement.all()) as unknown[];
|
||||
response = {
|
||||
ok: true,
|
||||
headers: statement.columns().map((column) => column.name),
|
||||
rows,
|
||||
totalRows: rows.length,
|
||||
};
|
||||
} catch (error) {
|
||||
response = { ok: false, message: error instanceof Error ? error.message : String(error) };
|
||||
} finally {
|
||||
db?.close();
|
||||
}
|
||||
process.send?.(response, () => process.exit(0));
|
||||
});
|
||||
|
|
@ -1,5 +1,6 @@
|
|||
import { assertReadOnlySql, hoistLeadingCte, stripTrailingSqlNoise } from '../../context/connections/read-only-sql.js';
|
||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||
import {
|
||||
|
|
@ -50,6 +51,8 @@ export interface KtxSqlServerPoolConfig {
|
|||
database: string;
|
||||
user: string;
|
||||
password?: string;
|
||||
// ms; on expiry mssql sends a TDS attention that cancels the query server-side.
|
||||
requestTimeout: number;
|
||||
options: { encrypt: true; trustServerCertificate: boolean };
|
||||
pool: { max: number; min: number; idleTimeoutMillis: number };
|
||||
}
|
||||
|
|
@ -269,6 +272,11 @@ function isDeniedError(error: unknown): boolean {
|
|||
return number === 229 || number === 230 || number === 297;
|
||||
}
|
||||
|
||||
// mssql raises a RequestError with code 'ETIMEOUT' once requestTimeout elapses.
|
||||
function isSqlServerTimeoutError(error: unknown): boolean {
|
||||
return Boolean(error) && typeof error === 'object' && (error as { code?: unknown }).code === 'ETIMEOUT';
|
||||
}
|
||||
|
||||
function limitSqlForSqlServerExecution(sqlText: string, maxRows: number | undefined): string {
|
||||
const trimmed = stripTrailingSqlNoise(assertReadOnlySql(sqlText));
|
||||
if (!maxRows) {
|
||||
|
|
@ -328,6 +336,7 @@ export function sqlServerConnectionPoolConfigFromConfig(input: {
|
|||
database,
|
||||
user,
|
||||
password: stringConfigValue(merged, 'password', env),
|
||||
requestTimeout: resolveQueryDeadlineMs(merged),
|
||||
options: { encrypt: true, trustServerCertificate: merged.trustServerCertificate ?? true },
|
||||
pool: { max: maxConnections, min: 0, idleTimeoutMillis: 30000 },
|
||||
};
|
||||
|
|
@ -353,6 +362,7 @@ export class KtxSqlServerScanConnector implements KtxScanConnector {
|
|||
private readonly poolFactory: KtxSqlServerPoolFactory;
|
||||
private readonly endpointResolver?: KtxSqlServerEndpointResolver;
|
||||
private readonly now: () => Date;
|
||||
private readonly deadlineMs: number;
|
||||
private readonly dialect = getSqlDialectForDriver('sqlserver');
|
||||
private pool: KtxSqlServerPool | null = null;
|
||||
private resolvedEndpoint: KtxSqlServerResolvedEndpoint | null = null;
|
||||
|
|
@ -370,6 +380,7 @@ export class KtxSqlServerScanConnector implements KtxScanConnector {
|
|||
this.poolFactory = options.poolFactory ?? new DefaultSqlServerPoolFactory();
|
||||
this.endpointResolver = options.endpointResolver;
|
||||
this.now = options.now ?? (() => new Date());
|
||||
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||
this.id = `sqlserver:${options.connectionId}`;
|
||||
}
|
||||
|
||||
|
|
@ -804,7 +815,15 @@ export class KtxSqlServerScanConnector implements KtxScanConnector {
|
|||
request.input(key, value);
|
||||
}
|
||||
}
|
||||
const result = await request.query(assertReadOnlySql(query));
|
||||
let result: KtxSqlServerQueryResult;
|
||||
try {
|
||||
result = await request.query(assertReadOnlySql(query));
|
||||
} catch (error) {
|
||||
if (isSqlServerTimeoutError(error)) {
|
||||
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
const recordset = result.recordset ?? [];
|
||||
const columnMetadata = recordset.columns ?? {};
|
||||
const metadataHeaders = Object.keys(columnMetadata);
|
||||
|
|
|
|||
|
|
@ -98,6 +98,7 @@ export interface ContextBuildArgs {
|
|||
queryHistory?: Extract<KtxPublicIngestArgs, { command: 'run' }>['queryHistory'];
|
||||
queryHistoryWindowDays?: number;
|
||||
scanMode?: Extract<KtxPublicIngestArgs, { command: 'run' }>['scanMode'];
|
||||
stages?: Extract<KtxPublicIngestArgs, { command: 'run' }>['stages'];
|
||||
detectRelationships?: boolean;
|
||||
cliVersion?: string;
|
||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||
|
|
@ -990,6 +991,7 @@ export async function runContextBuild(
|
|||
...(args.queryHistory ? { queryHistory: args.queryHistory } : {}),
|
||||
...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}),
|
||||
...(args.scanMode ? { scanMode: args.scanMode } : {}),
|
||||
...(args.stages ? { stages: args.stages } : {}),
|
||||
...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}),
|
||||
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
||||
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
const BIGQUERY_PROJECT_ID_PATTERN = /^[A-Za-z0-9_-]+$/;
|
||||
const BIGQUERY_DATASET_ID_PATTERN = /^[A-Za-z0-9_]+$/;
|
||||
const BIGQUERY_REGION_PATTERN = /^[a-z0-9-]+$/;
|
||||
|
||||
export function normalizeBigQueryProjectId(value: string, context = 'historic-SQL ingest'): string {
|
||||
|
|
@ -8,6 +9,13 @@ export function normalizeBigQueryProjectId(value: string, context = 'historic-SQ
|
|||
return value;
|
||||
}
|
||||
|
||||
export function normalizeBigQueryDatasetId(value: string, context = 'historic-SQL ingest'): string {
|
||||
if (!BIGQUERY_DATASET_ID_PATTERN.test(value)) {
|
||||
throw new Error(`Invalid BigQuery dataset id for ${context}: ${value}`);
|
||||
}
|
||||
return value;
|
||||
}
|
||||
|
||||
export function normalizeBigQueryRegion(value: string, context = 'historic-SQL ingest'): string {
|
||||
const normalized = value.trim().toLowerCase().replace(/^region-/, '');
|
||||
if (!BIGQUERY_REGION_PATTERN.test(normalized)) {
|
||||
|
|
|
|||
|
|
@ -0,0 +1,24 @@
|
|||
import type { KtxProjectConnectionConfig } from '../project/config.js';
|
||||
|
||||
function listConfiguredConnectionIds(connections: Record<string, KtxProjectConnectionConfig>): string[] {
|
||||
return Object.keys(connections).sort();
|
||||
}
|
||||
|
||||
/**
|
||||
* Validate a connection id supplied as an explicit command/tool argument against
|
||||
* the canonical `ktx.yaml` connections map. Returns the id when configured;
|
||||
* otherwise throws an error that lists the configured ids so the caller can fix
|
||||
* the typo. Use for explicit arguments only — persisted page frontmatter that
|
||||
* references a since-removed connection must warn, not fail.
|
||||
*/
|
||||
export function assertConfiguredConnectionId(
|
||||
connections: Record<string, KtxProjectConnectionConfig>,
|
||||
connectionId: string,
|
||||
): string {
|
||||
if (Object.hasOwn(connections, connectionId)) {
|
||||
return connectionId;
|
||||
}
|
||||
const ids = listConfiguredConnectionIds(connections);
|
||||
const configured = ids.length > 0 ? ids.join(', ') : '(none configured)';
|
||||
throw new Error(`Unknown connection "${connectionId}". Configured connections: ${configured}.`);
|
||||
}
|
||||
45
packages/cli/src/context/connections/query-deadline.ts
Normal file
45
packages/cli/src/context/connections/query-deadline.ts
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
import { KtxQueryError } from '../../errors.js';
|
||||
|
||||
/**
|
||||
* Canonical default bound on read-query execution time. Generous headroom over
|
||||
* any indexed aggregate or normal profiling probe; a pathological nested-loop
|
||||
* scan blows past it immediately. Overridable per-connection via
|
||||
* `query_timeout_ms`. Production reads it through {@link resolveQueryDeadlineMs};
|
||||
* exported for the resolver's own unit tests.
|
||||
* @internal
|
||||
*/
|
||||
export const DEFAULT_QUERY_TIMEOUT_MS = 30_000;
|
||||
|
||||
interface QueryTimeoutConnectionConfig {
|
||||
query_timeout_ms?: unknown;
|
||||
[key: string]: unknown;
|
||||
}
|
||||
|
||||
/**
|
||||
* Single source of truth for the read-query deadline: the per-connection
|
||||
* `query_timeout_ms` override (milliseconds) when present, else the default.
|
||||
* Every connector resolves through here so the default and override precedence
|
||||
* live in exactly one place. A malformed override (zero, negative, non-integer,
|
||||
* non-number) is a config error — surfaced here even though `ktx.yaml`
|
||||
* validation also rejects it, so programmatically-built connectors cannot
|
||||
* silently run unbounded.
|
||||
*/
|
||||
export function resolveQueryDeadlineMs(connection: QueryTimeoutConnectionConfig | undefined): number {
|
||||
const raw = connection?.query_timeout_ms;
|
||||
if (raw === undefined || raw === null) {
|
||||
return DEFAULT_QUERY_TIMEOUT_MS;
|
||||
}
|
||||
if (typeof raw !== 'number' || !Number.isInteger(raw) || raw <= 0) {
|
||||
throw new Error(`query_timeout_ms must be a positive integer in milliseconds, received ${JSON.stringify(raw)}.`);
|
||||
}
|
||||
return raw;
|
||||
}
|
||||
|
||||
/**
|
||||
* The canonical, driver-independent timeout error an agent sees regardless of
|
||||
* which connector enforced the deadline. Reads in whole seconds. Remote
|
||||
* connectors pass the driver's own timeout error as `cause`.
|
||||
*/
|
||||
export function queryDeadlineExceededError(deadlineMs: number, options?: ErrorOptions): KtxQueryError {
|
||||
return new KtxQueryError(`query exceeded ${Math.round(deadlineMs / 1000)}s`, options);
|
||||
}
|
||||
|
|
@ -3,8 +3,9 @@ import { request as httpRequest } from 'node:http';
|
|||
import { request as httpsRequest } from 'node:https';
|
||||
import { URL } from 'node:url';
|
||||
import type { KtxProjectConnectionConfig } from '../../../project/config.js';
|
||||
import { isKtxScanWarningCode } from '../../../scan/local-structural-artifacts.js';
|
||||
import { tableRefFromKey } from '../../../scan/table-ref.js';
|
||||
import type { KtxSchemaColumn, KtxSchemaForeignKey, KtxSchemaSnapshot, KtxSchemaTable } from '../../../scan/types.js';
|
||||
import type { KtxScanWarning, KtxSchemaColumn, KtxSchemaForeignKey, KtxSchemaSnapshot, KtxSchemaTable } from '../../../scan/types.js';
|
||||
import { inferKtxDimensionType, normalizeKtxNativeType } from '../../../scan/type-normalization.js';
|
||||
import type { LiveDatabaseIntrospectionOptions, LiveDatabaseIntrospectionPort } from './types.js';
|
||||
|
||||
|
|
@ -206,10 +207,32 @@ function mapTable(raw: Record<string, unknown>): KtxSchemaTable {
|
|||
};
|
||||
}
|
||||
|
||||
function mapWarning(raw: Record<string, unknown>): KtxScanWarning | null {
|
||||
const code = optionalString(raw.code);
|
||||
// Drop codes Node cannot render, keeping the daemon and Node warning catalogs
|
||||
// in parity rather than surfacing an unknown code downstream.
|
||||
if (!code || !isKtxScanWarningCode(code)) return null;
|
||||
const table = optionalString(raw.table);
|
||||
const column = optionalString(raw.column);
|
||||
return {
|
||||
code,
|
||||
message: requiredString(raw.message, 'warnings[].message'),
|
||||
recoverable: raw.recoverable !== false,
|
||||
...(table ? { table } : {}),
|
||||
...(column ? { column } : {}),
|
||||
...(raw.metadata && typeof raw.metadata === 'object' && !Array.isArray(raw.metadata)
|
||||
? { metadata: recordValue(raw.metadata) }
|
||||
: {}),
|
||||
};
|
||||
}
|
||||
|
||||
function mapDaemonSnapshot(
|
||||
raw: Record<string, unknown>,
|
||||
input: { connectionId: string; extractedAt: string; schemas: string[] },
|
||||
): KtxSchemaSnapshot {
|
||||
const warnings = recordArray(raw.warnings)
|
||||
.map(mapWarning)
|
||||
.filter((warning): warning is KtxScanWarning => warning !== null);
|
||||
return {
|
||||
connectionId: requiredString(raw.connection_id, 'connection_id') || input.connectionId,
|
||||
driver: 'postgres',
|
||||
|
|
@ -217,6 +240,7 @@ function mapDaemonSnapshot(
|
|||
scope: { schemas: input.schemas },
|
||||
metadata: recordValue(raw.metadata),
|
||||
tables: recordArray(raw.tables).map(mapTable),
|
||||
...(warnings.length > 0 ? { warnings } : {}),
|
||||
};
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,48 @@
|
|||
import { readFile } from 'node:fs/promises';
|
||||
import { join } from 'node:path';
|
||||
import type { SourceFetchReport } from '../../types.js';
|
||||
import { LIVE_DATABASE_WARNINGS_FILE } from './stage.js';
|
||||
|
||||
const OBJECT_SKIP_CODE = 'object_introspection_failed';
|
||||
|
||||
interface RawWarning {
|
||||
code?: unknown;
|
||||
message?: unknown;
|
||||
table?: unknown;
|
||||
}
|
||||
|
||||
/**
|
||||
* Derives the fetch report from the staged `warnings.json`: objects that failed
|
||||
* introspection become `skipped` entries so the run report, ingest summary, and
|
||||
* `ktx status` can surface them. Returns null when nothing was skipped, keeping
|
||||
* clean ingests free of an empty report.
|
||||
*/
|
||||
export async function readLiveDatabaseFetchReport(stagedDir: string): Promise<SourceFetchReport | null> {
|
||||
let parsed: unknown;
|
||||
try {
|
||||
parsed = JSON.parse(await readFile(join(stagedDir, LIVE_DATABASE_WARNINGS_FILE), 'utf8'));
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
const warnings =
|
||||
parsed && typeof parsed === 'object' && Array.isArray((parsed as { warnings?: unknown }).warnings)
|
||||
? ((parsed as { warnings: RawWarning[] }).warnings)
|
||||
: [];
|
||||
|
||||
const skipped = warnings
|
||||
.filter((warning) => warning.code === OBJECT_SKIP_CODE)
|
||||
.map((warning) => ({
|
||||
rawPath: '',
|
||||
entityType: 'database_object',
|
||||
entityId: typeof warning.table === 'string' ? warning.table : null,
|
||||
severity: 'warning' as const,
|
||||
statusCode: null,
|
||||
message: typeof warning.message === 'string' ? warning.message : 'introspection failed',
|
||||
retryRecommended: false,
|
||||
}));
|
||||
|
||||
if (skipped.length === 0) {
|
||||
return null;
|
||||
}
|
||||
return { status: 'partial', retryRecommended: false, skipped, warnings: [] };
|
||||
}
|
||||
|
|
@ -1,5 +1,7 @@
|
|||
import type { ChunkResult, DiffSet, FetchContext, SourceAdapter } from '../../types.js';
|
||||
import type { ChunkResult, DiffSet, FetchContext, SourceAdapter, SourceFetchReport } from '../../types.js';
|
||||
import { chunkLiveDatabaseStagedDir } from './chunk.js';
|
||||
import { readLiveDatabaseFetchReport } from './fetch-report.js';
|
||||
import { assertLiveDatabaseScanOutcome } from './scan-outcome.js';
|
||||
import { detectLiveDatabaseStagedDir, writeLiveDatabaseSnapshot } from './stage.js';
|
||||
import type { LiveDatabaseSourceAdapterDeps } from './types.js';
|
||||
|
||||
|
|
@ -13,14 +15,20 @@ export class LiveDatabaseSourceAdapter implements SourceAdapter {
|
|||
return detectLiveDatabaseStagedDir(stagedDir);
|
||||
}
|
||||
|
||||
readFetchReport(stagedDir: string): Promise<SourceFetchReport | null> {
|
||||
return readLiveDatabaseFetchReport(stagedDir);
|
||||
}
|
||||
|
||||
async fetch(_pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise<void> {
|
||||
const tableScope = ctx.tableScope;
|
||||
const snapshot = await this.deps.introspection.extractSchema(ctx.connectionId, { tableScope });
|
||||
await writeLiveDatabaseSnapshot(stagedDir, {
|
||||
const finalized = {
|
||||
...snapshot,
|
||||
connectionId: ctx.connectionId,
|
||||
extractedAt: snapshot.extractedAt ?? (this.deps.now ?? (() => new Date()))().toISOString(),
|
||||
});
|
||||
};
|
||||
assertLiveDatabaseScanOutcome({ connectionId: ctx.connectionId, scope: tableScope, snapshot: finalized });
|
||||
await writeLiveDatabaseSnapshot(stagedDir, finalized);
|
||||
}
|
||||
|
||||
chunk(stagedDir: string, diffSet?: DiffSet): Promise<ChunkResult> {
|
||||
|
|
|
|||
|
|
@ -162,7 +162,8 @@ function getShardKey(connectionType: string, catalog: string | null, db: string
|
|||
}
|
||||
}
|
||||
|
||||
function buildTableRef(name: string, catalog: string | null, db: string | null): string {
|
||||
/** @internal */
|
||||
export function buildTableRef(name: string, catalog: string | null, db: string | null): string {
|
||||
const parts: string[] = [];
|
||||
if (catalog) {
|
||||
parts.push(catalog);
|
||||
|
|
@ -273,7 +274,10 @@ export function buildLiveDatabaseManifestShards(
|
|||
for (const table of input.tables) {
|
||||
const shardKey = getShardKey(input.connectionType, table.catalog, table.db);
|
||||
const shard = shards.get(shardKey) ?? { tables: {} };
|
||||
const existingDescriptions = input.existingDescriptions?.get(table.name);
|
||||
// Existing descriptions/usage are keyed by the fully-qualified ref so two
|
||||
// same-named tables in different schemas never share an entry.
|
||||
const fullRef = buildTableRef(table.name, table.catalog, table.db);
|
||||
const existingDescriptions = input.existingDescriptions?.get(fullRef);
|
||||
|
||||
const columns: LiveDatabaseManifestColumn[] = table.columns.map((column) => {
|
||||
const manifestColumn: LiveDatabaseManifestColumn = {
|
||||
|
|
@ -297,7 +301,7 @@ export function buildLiveDatabaseManifestShards(
|
|||
});
|
||||
|
||||
const entry: LiveDatabaseManifestTableEntry = {
|
||||
table: buildTableRef(table.name, table.catalog, table.db),
|
||||
table: fullRef,
|
||||
columns,
|
||||
};
|
||||
|
||||
|
|
@ -306,7 +310,7 @@ export function buildLiveDatabaseManifestShards(
|
|||
entry.descriptions = tableDescriptions;
|
||||
}
|
||||
|
||||
const usage = mergeUsagePreservingExternal(input.existingUsage?.get(table.name), table.usage);
|
||||
const usage = mergeUsagePreservingExternal(input.existingUsage?.get(fullRef), table.usage);
|
||||
if (usage) {
|
||||
entry.usage = usage;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,55 @@
|
|||
import { KtxExpectedError } from '../../../../errors.js';
|
||||
import { tableRefFromKey, type KtxTableRefKey } from '../../../scan/table-ref.js';
|
||||
import type { KtxSchemaSnapshot } from '../../../scan/types.js';
|
||||
|
||||
const OBJECT_SKIP_CODE = 'object_introspection_failed';
|
||||
|
||||
function formatScopeEntry(key: KtxTableRefKey): string {
|
||||
const ref = tableRefFromKey(key);
|
||||
return [ref.catalog, ref.db, ref.name].filter((part): part is string => Boolean(part)).join('.');
|
||||
}
|
||||
|
||||
function discoveredObjectNames(snapshot: KtxSchemaSnapshot): string[] {
|
||||
const raw = (snapshot.metadata as Record<string, unknown>).discovered_object_names;
|
||||
return Array.isArray(raw) ? raw.filter((value): value is string => typeof value === 'string') : [];
|
||||
}
|
||||
|
||||
/**
|
||||
* Enforces the partial-vs-total outcome rules for a live-database snapshot,
|
||||
* uniformly for every connector. Outcomes follow from object counts, not a
|
||||
* mode: a connection with at least one ingested object succeeds (any broken
|
||||
* objects ride along as warnings); a connection where every introspected object
|
||||
* failed, or a non-empty enabled_tables scope that matched nothing, raises a
|
||||
* clear connection error instead of staging an empty layer that would later
|
||||
* surface as the generic "did not recognize" message. A legitimately empty
|
||||
* database (no scope, no objects) succeeds with an empty layer.
|
||||
*/
|
||||
export function assertLiveDatabaseScanOutcome(input: {
|
||||
connectionId: string;
|
||||
scope: ReadonlySet<KtxTableRefKey> | undefined;
|
||||
snapshot: KtxSchemaSnapshot;
|
||||
}): void {
|
||||
const { connectionId, scope, snapshot } = input;
|
||||
if (snapshot.tables.length > 0) {
|
||||
return;
|
||||
}
|
||||
|
||||
const skipped = (snapshot.warnings ?? []).filter((warning) => warning.code === OBJECT_SKIP_CODE);
|
||||
if (skipped.length > 0) {
|
||||
const detail = skipped.map((warning) => `${warning.table ?? 'object'}: ${warning.message}`).join('; ');
|
||||
throw new KtxExpectedError(
|
||||
`Connection "${connectionId}" produced no semantic layer: all ${skipped.length} introspected ` +
|
||||
`${skipped.length === 1 ? 'object' : 'objects'} failed (${detail}).`,
|
||||
);
|
||||
}
|
||||
|
||||
if (scope && scope.size > 0) {
|
||||
const requested = [...scope].map(formatScopeEntry).sort();
|
||||
const available = discoveredObjectNames(snapshot);
|
||||
const availableClause = available.length > 0 ? ` Available objects: ${available.join(', ')}.` : '';
|
||||
throw new KtxExpectedError(
|
||||
`enabled_tables for connection "${connectionId}" matched no objects ` +
|
||||
`(looked for: ${requested.join(', ')}).${availableClause}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
|
|
@ -136,13 +136,13 @@ export async function readLiveDatabaseTableFiles(stagedDir: string): Promise<Liv
|
|||
}
|
||||
|
||||
export async function detectLiveDatabaseStagedDir(stagedDir: string): Promise<boolean> {
|
||||
// A valid live-database staging is identified by its connection.json marker.
|
||||
// An empty table set is a legitimate outcome (an empty database), so the
|
||||
// presence of table files is not required — the total-vs-partial decision is
|
||||
// made earlier by assertLiveDatabaseScanOutcome, before staging.
|
||||
try {
|
||||
const meta = JSON.parse(await readFile(join(stagedDir, LIVE_DATABASE_META_FILE), 'utf8')) as unknown;
|
||||
if (!meta || typeof meta !== 'object' || Array.isArray(meta)) {
|
||||
return false;
|
||||
}
|
||||
const files = await readLiveDatabaseTableFiles(stagedDir);
|
||||
return files.length > 0;
|
||||
return Boolean(meta) && typeof meta === 'object' && !Array.isArray(meta);
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@ import { z } from 'zod';
|
|||
const metabaseSyncModeSchema = z.enum(['ALL', 'ONLY', 'EXCEPT']);
|
||||
export type MetabaseSyncMode = z.infer<typeof metabaseSyncModeSchema>;
|
||||
|
||||
const metabaseLocalConnectionIdSchema = z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/);
|
||||
const metabaseLocalConnectionIdSchema = z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/);
|
||||
|
||||
/**
|
||||
* The lean config the adapter needs at `fetch()` time. Lives in the BullMQ payload's
|
||||
|
|
|
|||
|
|
@ -1081,6 +1081,7 @@ export class IngestBundleRunner {
|
|||
skillsPrompt: input.skillsPrompt,
|
||||
syncId: input.syncId,
|
||||
sourceKey: input.job.sourceKey,
|
||||
connectionId: input.job.connectionId,
|
||||
canonicalPins: input.canonicalPins,
|
||||
});
|
||||
|
||||
|
|
|
|||
|
|
@ -478,11 +478,11 @@ function parseKnowledgeIndexPath(file: string): { scope: 'GLOBAL' | 'USER'; page
|
|||
const segments = file.split('/');
|
||||
if (segments.length === 2 && segments[0] === 'global') {
|
||||
const pageKey = segments[1].replace(/\.md$/, '');
|
||||
return /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'GLOBAL', pageKey } : null;
|
||||
return /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'GLOBAL', pageKey } : null;
|
||||
}
|
||||
if (segments.length === 3 && segments[0] === 'user') {
|
||||
const pageKey = segments[2].replace(/\.md$/, '');
|
||||
return /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'USER', pageKey } : null;
|
||||
return /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'USER', pageKey } : null;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -104,7 +104,7 @@ class LocalIngestPhase implements IngestJobPhase {
|
|||
}
|
||||
|
||||
function safeSegment(kind: string, value: string): string {
|
||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) {
|
||||
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) {
|
||||
throw new Error(`Unsafe ${kind}: ${value}`);
|
||||
}
|
||||
return value;
|
||||
|
|
|
|||
|
|
@ -10,7 +10,7 @@ import type { MemoryFlowEventSink, MemoryFlowPlannedWorkUnit } from './memory-fl
|
|||
import { buildSyncId } from './raw-sources-paths.js';
|
||||
import { SqliteLocalIngestStore } from './sqlite-local-ingest-store.js';
|
||||
import type { KtxTableRefKey } from '../scan/table-ref.js';
|
||||
import type { IngestTrigger, SourceAdapter, WorkUnit } from './types.js';
|
||||
import type { IngestTrigger, SourceAdapter, SourceFetchReport, WorkUnit } from './types.js';
|
||||
|
||||
type LocalIngestStatus = 'running' | 'done' | 'error';
|
||||
|
||||
|
|
@ -46,6 +46,8 @@ export interface LocalIngestRunRecord {
|
|||
workUnits: Array<Pick<WorkUnit, 'unitKey' | 'rawFiles' | 'peerFileIndex' | 'dependencyPaths'>>;
|
||||
evictionDeletedRawPaths: string[];
|
||||
errors: string[];
|
||||
/** Fetch-phase outcome (e.g. objects skipped during introspection). */
|
||||
fetch?: SourceFetchReport;
|
||||
}
|
||||
|
||||
export type LocalIngestReport = LocalIngestRunRecord & {
|
||||
|
|
@ -70,7 +72,7 @@ const LOCAL_AUTHOR = 'ktx';
|
|||
const LOCAL_AUTHOR_EMAIL = 'ktx@example.com';
|
||||
|
||||
function safeSegment(kind: string, value: string): string {
|
||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) {
|
||||
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) {
|
||||
throw new Error(`Unsafe ${kind}: ${value}`);
|
||||
}
|
||||
return value;
|
||||
|
|
@ -291,6 +293,8 @@ async function runLocalStageOnlyIngestInner(options: RunLocalStageOnlyIngestOpti
|
|||
throw new Error(`Adapter "${adapter.source}" did not recognize ${sourceDir ?? 'fetched source output'}`);
|
||||
}
|
||||
|
||||
const fetchReport = adapter.readFetchReport ? await adapter.readFetchReport(stagedDir) : null;
|
||||
|
||||
const relativeFiles = await walkFiles(stagedDir);
|
||||
options.memoryFlow?.update({ sourceDir });
|
||||
options.memoryFlow?.emit({
|
||||
|
|
@ -405,6 +409,7 @@ async function runLocalStageOnlyIngestInner(options: RunLocalStageOnlyIngestOpti
|
|||
})),
|
||||
evictionDeletedRawPaths: chunkResult.eviction?.deletedRawPaths ?? [],
|
||||
errors: [],
|
||||
...(fetchReport ? { fetch: fetchReport } : {}),
|
||||
};
|
||||
|
||||
if (!options.dryRun) {
|
||||
|
|
|
|||
|
|
@ -26,14 +26,16 @@ export function buildWuSystemPrompt(params: {
|
|||
skillsPrompt: string;
|
||||
syncId: string;
|
||||
sourceKey: string;
|
||||
connectionId?: string;
|
||||
canonicalPins?: CanonicalPin[];
|
||||
}): string {
|
||||
const connectionLine = params.connectionId ? `\nconnectionId: ${params.connectionId}` : '';
|
||||
const parts = [
|
||||
params.baseFraming.trimEnd(),
|
||||
VERIFICATION_LEDGER_PROMPT,
|
||||
params.skillsPrompt.trimEnd(),
|
||||
buildCanonicalPinsPromptBlock(params.canonicalPins ?? []),
|
||||
`\n<context>\nsyncId: ${params.syncId}\nsource: ${params.sourceKey}\n</context>`,
|
||||
`\n<context>\nsyncId: ${params.syncId}\nsource: ${params.sourceKey}${connectionLine}\n</context>`,
|
||||
];
|
||||
return parts.filter(Boolean).join('\n');
|
||||
}
|
||||
|
|
|
|||
|
|
@ -4,7 +4,7 @@ import { BaseTool, type ToolContext, type ToolOutput } from '../../../../context
|
|||
|
||||
const discoverDataInputSchema = z.object({
|
||||
query: z.string().optional(),
|
||||
connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/).optional(),
|
||||
connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/).optional(),
|
||||
limit: z.number().int().positive().max(50).optional().default(10),
|
||||
sourceName: z.string().optional(),
|
||||
}).strict();
|
||||
|
|
|
|||
|
|
@ -14,7 +14,7 @@ const targetSchema = z.union([
|
|||
]);
|
||||
|
||||
const entityDetailsInputSchema = z.object({
|
||||
connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/),
|
||||
connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/),
|
||||
targets: z.array(targetSchema).min(1).max(50),
|
||||
}).strict();
|
||||
|
||||
|
|
|
|||
|
|
@ -6,7 +6,7 @@ import type { SqlAnalysisPort } from '../../../../context/sql-analysis/ports.js'
|
|||
import { BaseTool, type ToolContext, type ToolOutput } from '../../../../context/tools/base-tool.js';
|
||||
|
||||
const sqlExecutionInputSchema = z.object({
|
||||
connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/),
|
||||
connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/),
|
||||
sql: z.string().min(1),
|
||||
rowLimit: z.number().int().positive().max(1000).optional().default(100),
|
||||
}).strict();
|
||||
|
|
|
|||
|
|
@ -172,6 +172,12 @@ export class AiSdkKtxLlmRuntime implements KtxLlmRuntimePort {
|
|||
this.logger = deps.logger ?? noopLogger;
|
||||
}
|
||||
|
||||
// HTTP backend: abortSignal cancels the underlying fetch natively, so there is
|
||||
// no SDK-owned child to tree-kill.
|
||||
subprocessForkSpec(): null {
|
||||
return null;
|
||||
}
|
||||
|
||||
private async generateTextWithRateLimitRetry<T>(
|
||||
provider: RateLimitProvider,
|
||||
abortSignal: AbortSignal | undefined,
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@ import {
|
|||
type SDKResultMessage,
|
||||
} from '@anthropic-ai/claude-agent-sdk';
|
||||
import { z } from 'zod';
|
||||
import type { KtxModelRole } from '../../llm/types.js';
|
||||
import { createAbortError, isAbortError, throwIfAborted } from '../core/abort.js';
|
||||
import { createKtxClaudeCodeEnv } from './claude-code-env.js';
|
||||
import { resolveClaudeCodeModel } from './claude-code-models.js';
|
||||
|
|
@ -13,6 +14,7 @@ import type { RateLimitGovernor, RateLimitSignal } from './rate-limit-governor.j
|
|||
import { createClaudeSdkTools, mcpToolIds } from './runtime-tools.js';
|
||||
import type {
|
||||
KtxGenerateObjectInput,
|
||||
KtxGenerateStructuredJsonInput,
|
||||
KtxGenerateTextInput,
|
||||
KtxLlmRuntimePort,
|
||||
KtxRuntimeToolSet,
|
||||
|
|
@ -20,6 +22,7 @@ import type {
|
|||
RunLoopParams,
|
||||
RunLoopResult,
|
||||
RunLoopStopReason,
|
||||
SubprocessRuntimeForkSpec,
|
||||
} from './runtime-port.js';
|
||||
|
||||
type QueryResult = AsyncIterable<SDKMessage> & {
|
||||
|
|
@ -389,9 +392,15 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort {
|
|||
return result.result;
|
||||
}
|
||||
|
||||
async generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
||||
): Promise<TOutput> {
|
||||
// Structured generation has no tools, so generateObject and
|
||||
// generateStructuredJson (the kill-boundary child path) share this one query.
|
||||
private async runStructuredQuery(input: {
|
||||
role: KtxModelRole;
|
||||
prompt: string;
|
||||
system?: string;
|
||||
jsonSchema: Record<string, unknown>;
|
||||
abortSignal?: AbortSignal;
|
||||
}): Promise<SDKResultMessage> {
|
||||
const options = {
|
||||
...baseOptions({
|
||||
projectDir: this.deps.projectDir,
|
||||
|
|
@ -403,19 +412,30 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort {
|
|||
// 5 leaves headroom without enabling unbounded loops; the json_schema
|
||||
// constraint still forces the final answer to be the schema.
|
||||
maxTurns: 5,
|
||||
tools: input.tools,
|
||||
}),
|
||||
outputFormat: { type: 'json_schema' as const, schema: jsonSchema(input.schema as z.ZodType) },
|
||||
outputFormat: { type: 'json_schema' as const, schema: input.jsonSchema },
|
||||
};
|
||||
const startedAt = Date.now();
|
||||
const result = await collectResultWithRateLimitRetry({
|
||||
return collectResultWithRateLimitRetry({
|
||||
query: this.runQuery,
|
||||
prompt: [input.system, input.prompt].filter(Boolean).join('\n\n'),
|
||||
options,
|
||||
allowedToolIds: new Set([...mcpToolIds(input.tools ?? {}), STRUCTURED_OUTPUT_TOOL_NAME]),
|
||||
expectedMcpServerNames: expectedMcpServerNames(input.tools),
|
||||
allowedToolIds: new Set([STRUCTURED_OUTPUT_TOOL_NAME]),
|
||||
expectedMcpServerNames: new Set(),
|
||||
rateLimitGovernor: this.deps.rateLimitGovernor,
|
||||
abortSignal: input.abortSignal,
|
||||
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||
});
|
||||
}
|
||||
|
||||
async generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
||||
): Promise<TOutput> {
|
||||
const startedAt = Date.now();
|
||||
const result = await this.runStructuredQuery({
|
||||
role: input.role,
|
||||
prompt: input.prompt,
|
||||
...(input.system !== undefined ? { system: input.system } : {}),
|
||||
jsonSchema: jsonSchema(input.schema as z.ZodType),
|
||||
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||
});
|
||||
input.onMetrics?.({ totalMs: Date.now() - startedAt, usage: claudeTokenUsage(result) });
|
||||
const error = resultError(result);
|
||||
|
|
@ -428,6 +448,28 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort {
|
|||
return (input.schema as z.ZodType<TOutput>).parse(result.structured_output);
|
||||
}
|
||||
|
||||
async generateStructuredJson(input: KtxGenerateStructuredJsonInput): Promise<unknown> {
|
||||
const result = await this.runStructuredQuery({
|
||||
role: input.role,
|
||||
prompt: input.prompt,
|
||||
...(input.system !== undefined ? { system: input.system } : {}),
|
||||
jsonSchema: input.jsonSchema,
|
||||
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||
});
|
||||
const error = resultError(result);
|
||||
if (error) {
|
||||
throw error;
|
||||
}
|
||||
if (result.subtype !== 'success') {
|
||||
throw new Error(`Claude Code query failed (${result.subtype})`);
|
||||
}
|
||||
return result.structured_output;
|
||||
}
|
||||
|
||||
subprocessForkSpec(): SubprocessRuntimeForkSpec {
|
||||
return { backend: 'claude-code', projectDir: this.deps.projectDir, modelSlots: this.deps.modelSlots };
|
||||
}
|
||||
|
||||
async runAgentLoop(params: RunLoopParams): Promise<RunLoopResult> {
|
||||
const startedAt = Date.now();
|
||||
try {
|
||||
|
|
|
|||
|
|
@ -9,14 +9,17 @@ import { resolveCodexModel } from './codex-models.js';
|
|||
import { buildCodexRuntimeConfig } from './codex-runtime-config.js';
|
||||
import { CodexSdkCliRunner, type CodexSdkRunner } from './codex-sdk-runner.js';
|
||||
import type { RateLimitGovernor } from './rate-limit-governor.js';
|
||||
import type { KtxModelRole } from '../../llm/types.js';
|
||||
import type {
|
||||
KtxGenerateObjectInput,
|
||||
KtxGenerateStructuredJsonInput,
|
||||
KtxGenerateTextInput,
|
||||
KtxLlmRuntimePort,
|
||||
KtxRuntimeToolSet,
|
||||
LlmTokenUsage,
|
||||
RunLoopParams,
|
||||
RunLoopResult,
|
||||
SubprocessRuntimeForkSpec,
|
||||
} from './runtime-port.js';
|
||||
|
||||
export interface CodexKtxLlmRuntimeDeps {
|
||||
|
|
@ -249,56 +252,78 @@ export class CodexKtxLlmRuntime implements KtxLlmRuntimePort {
|
|||
}
|
||||
}
|
||||
|
||||
// Structured generation has no tools, so it skips the MCP server that
|
||||
// generateText/runAgentLoop need; generateObject and generateStructuredJson
|
||||
// (the kill-boundary child path) share this one streaming implementation.
|
||||
private async streamStructuredText(input: {
|
||||
role: KtxModelRole;
|
||||
prompt: string;
|
||||
system?: string;
|
||||
jsonSchema: Record<string, unknown>;
|
||||
abortSignal?: AbortSignal;
|
||||
}): Promise<{ text: string; summary: CodexExecEventSummary; startedAt: number }> {
|
||||
const startedAt = Date.now();
|
||||
const model = modelForRole(this.deps.modelSlots, input.role);
|
||||
const config = buildCodexRuntimeConfig({ model });
|
||||
const result = await this.runWithRateLimitRetry(
|
||||
input.abortSignal,
|
||||
async () => {
|
||||
const collected = await collectEvents(
|
||||
await this.runner.runStreamed({
|
||||
projectDir: this.deps.projectDir,
|
||||
model,
|
||||
prompt: promptWithSystem(input.system, input.prompt),
|
||||
configOverrides: config.configOverrides,
|
||||
env: config.env,
|
||||
outputSchema: input.jsonSchema,
|
||||
...(input.abortSignal ? { signal: input.abortSignal } : {}),
|
||||
}),
|
||||
);
|
||||
const summary = summarizeCodexExecEvents(collected.events, { startedAt });
|
||||
return { collected, summary };
|
||||
},
|
||||
({ collected, summary }) => summaryError(summary, collected.streamError),
|
||||
);
|
||||
return {
|
||||
text: assertSuccessfulText(result.summary, result.collected.streamError),
|
||||
summary: result.summary,
|
||||
startedAt,
|
||||
};
|
||||
}
|
||||
|
||||
async generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
||||
): Promise<TOutput> {
|
||||
const startedAt = Date.now();
|
||||
const model = modelForRole(this.deps.modelSlots, input.role);
|
||||
const mcp = await mcpForTools({
|
||||
projectDir: this.deps.projectDir,
|
||||
toolSet: input.tools,
|
||||
startMcpServer: this.deps.startMcpServer,
|
||||
const { text, summary, startedAt } = await this.streamStructuredText({
|
||||
role: input.role,
|
||||
prompt: input.prompt,
|
||||
...(input.system !== undefined ? { system: input.system } : {}),
|
||||
jsonSchema: z.toJSONSchema(input.schema, { target: 'draft-7' }) as Record<string, unknown>,
|
||||
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||
});
|
||||
input.onMetrics?.(metrics(summary, startedAt));
|
||||
return parseStructuredOutput(input.schema, text);
|
||||
}
|
||||
|
||||
async generateStructuredJson(input: KtxGenerateStructuredJsonInput): Promise<unknown> {
|
||||
const { text } = await this.streamStructuredText({
|
||||
role: input.role,
|
||||
prompt: input.prompt,
|
||||
...(input.system !== undefined ? { system: input.system } : {}),
|
||||
jsonSchema: input.jsonSchema,
|
||||
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||
});
|
||||
try {
|
||||
const config = buildCodexRuntimeConfig({
|
||||
model,
|
||||
...(mcp
|
||||
? {
|
||||
mcp: {
|
||||
url: mcp.url,
|
||||
bearerTokenEnvVar: mcp.bearerTokenEnvVar,
|
||||
bearerToken: mcp.bearerToken,
|
||||
toolNames: runtimeToolNames(input.tools),
|
||||
},
|
||||
}
|
||||
: {}),
|
||||
});
|
||||
const result = await this.runWithRateLimitRetry(
|
||||
input.abortSignal,
|
||||
async () => {
|
||||
const collected = await collectEvents(
|
||||
await this.runner.runStreamed({
|
||||
projectDir: this.deps.projectDir,
|
||||
model,
|
||||
prompt: promptWithSystem(input.system, input.prompt),
|
||||
configOverrides: config.configOverrides,
|
||||
env: config.env,
|
||||
outputSchema: z.toJSONSchema(input.schema, { target: 'draft-7' }) as Record<string, unknown>,
|
||||
...(input.abortSignal ? { signal: input.abortSignal } : {}),
|
||||
}),
|
||||
);
|
||||
const summary = summarizeCodexExecEvents(collected.events, { startedAt });
|
||||
return { collected, summary };
|
||||
},
|
||||
({ collected, summary }) => summaryError(summary, collected.streamError),
|
||||
);
|
||||
input.onMetrics?.(metrics(result.summary, startedAt));
|
||||
return parseStructuredOutput(input.schema, assertSuccessfulText(result.summary, result.collected.streamError));
|
||||
} finally {
|
||||
await mcp?.close();
|
||||
return JSON.parse(text);
|
||||
} catch (error) {
|
||||
throw new Error(`Codex structured output is not valid JSON: ${error instanceof Error ? error.message : String(error)}`);
|
||||
}
|
||||
}
|
||||
|
||||
subprocessForkSpec(): SubprocessRuntimeForkSpec {
|
||||
return { backend: 'codex', projectDir: this.deps.projectDir, modelSlots: this.deps.modelSlots };
|
||||
}
|
||||
|
||||
async runAgentLoop(params: RunLoopParams): Promise<RunLoopResult> {
|
||||
const startedAt = Date.now();
|
||||
const model = modelForRole(this.deps.modelSlots, params.modelRole);
|
||||
|
|
|
|||
|
|
@ -72,12 +72,38 @@ export interface KtxGenerateObjectInput<TOutput, TSchema extends z.ZodType<TOutp
|
|||
abortSignal?: AbortSignal;
|
||||
}
|
||||
|
||||
/** Structured generation keyed by a raw JSON Schema instead of a Zod schema, so
|
||||
* the request can cross a process boundary; the caller validates the returned
|
||||
* value against the real Zod schema. */
|
||||
export interface KtxGenerateStructuredJsonInput {
|
||||
role: KtxModelRole;
|
||||
prompt: string;
|
||||
system?: string;
|
||||
jsonSchema: Record<string, unknown>;
|
||||
abortSignal?: AbortSignal;
|
||||
}
|
||||
|
||||
/** Serializable recipe to rebuild a subprocess-backed runtime inside a ktx-owned
|
||||
* child the parent can tree-kill. Returned by {@link KtxLlmRuntimePort.subprocessForkSpec}. */
|
||||
export interface SubprocessRuntimeForkSpec {
|
||||
backend: 'codex' | 'claude-code';
|
||||
projectDir: string;
|
||||
modelSlots: { default: string } & Partial<Record<string, string>>;
|
||||
}
|
||||
|
||||
export interface KtxLlmRuntimePort {
|
||||
generateText(input: KtxGenerateTextInput): Promise<string>;
|
||||
generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
||||
): Promise<TOutput>;
|
||||
runAgentLoop(params: RunLoopParams): Promise<RunLoopResult>;
|
||||
/**
|
||||
* Non-null when this runtime drives an SDK-owned child process that ktx cannot
|
||||
* cancel by abort alone (codex/claude-code spawn a binary the SDK owns and only
|
||||
* SIGTERM on abort). ktx routes such calls through a tree-killable boundary.
|
||||
* Null for HTTP backends, whose native fetch abort already settles promptly.
|
||||
*/
|
||||
subprocessForkSpec(): SubprocessRuntimeForkSpec | null;
|
||||
}
|
||||
|
||||
export interface AgentRunnerPort {
|
||||
|
|
|
|||
|
|
@ -0,0 +1,39 @@
|
|||
import { ClaudeCodeKtxLlmRuntime } from './claude-code-runtime.js';
|
||||
import { CodexKtxLlmRuntime } from './codex-runtime.js';
|
||||
import type { SubprocessRuntimeForkSpec } from './runtime-port.js';
|
||||
import type { SubprocessGenerateObjectRequest, SubprocessGenerateObjectResponse } from './subprocess-generate-object.js';
|
||||
|
||||
// Forked by the parent as a process-group leader it can SIGKILL as a tree. Hosts
|
||||
// one structured LLM call for a subprocess-backed runtime (codex/claude-code);
|
||||
// the SDK spawns the model binary as this process's own child, so a parent
|
||||
// tree-kill reaps the wedged model too. Credentials flow via inherited env — the
|
||||
// runtimes re-derive their allowlisted env from process.env — never over IPC.
|
||||
|
||||
function buildRuntime(forkSpec: SubprocessRuntimeForkSpec): CodexKtxLlmRuntime | ClaudeCodeKtxLlmRuntime {
|
||||
if (forkSpec.backend === 'codex') {
|
||||
return new CodexKtxLlmRuntime({ projectDir: forkSpec.projectDir, modelSlots: forkSpec.modelSlots });
|
||||
}
|
||||
return new ClaudeCodeKtxLlmRuntime({ projectDir: forkSpec.projectDir, modelSlots: forkSpec.modelSlots });
|
||||
}
|
||||
|
||||
// The parent owns this process's lifecycle. If the parent dies its IPC channel
|
||||
// drops; exit rather than linger as an orphan holding a provider connection.
|
||||
process.once('disconnect', () => process.exit(0));
|
||||
|
||||
process.once('message', (request: SubprocessGenerateObjectRequest) => {
|
||||
void (async () => {
|
||||
let response: SubprocessGenerateObjectResponse;
|
||||
try {
|
||||
const output = await buildRuntime(request.forkSpec).generateStructuredJson({
|
||||
role: request.role,
|
||||
prompt: request.prompt,
|
||||
...(request.system !== undefined ? { system: request.system } : {}),
|
||||
jsonSchema: request.jsonSchema,
|
||||
});
|
||||
response = { ok: true, output };
|
||||
} catch (error) {
|
||||
response = { ok: false, message: error instanceof Error ? error.message : String(error) };
|
||||
}
|
||||
process.send?.(response, () => process.exit(0));
|
||||
})();
|
||||
});
|
||||
152
packages/cli/src/context/llm/subprocess-generate-object.ts
Normal file
152
packages/cli/src/context/llm/subprocess-generate-object.ts
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
import { fork, spawn, type ChildProcess } from 'node:child_process';
|
||||
import { existsSync } from 'node:fs';
|
||||
import { fileURLToPath } from 'node:url';
|
||||
import type { z } from 'zod';
|
||||
import type { KtxModelRole } from '../../llm/types.js';
|
||||
import { createAbortError } from '../core/abort.js';
|
||||
import type { SubprocessRuntimeForkSpec } from './runtime-port.js';
|
||||
|
||||
export interface SubprocessGenerateObjectRequest {
|
||||
forkSpec: SubprocessRuntimeForkSpec;
|
||||
role: KtxModelRole;
|
||||
prompt: string;
|
||||
system?: string;
|
||||
jsonSchema: Record<string, unknown>;
|
||||
}
|
||||
|
||||
export type SubprocessGenerateObjectResponse = { ok: true; output: unknown } | { ok: false; message: string };
|
||||
|
||||
// In dist, this file and the child are siblings; under vitest the compiled .js is
|
||||
// absent and Node strips types from the .ts. The real child imports the codex /
|
||||
// claude SDKs (which use constructor parameter properties), so it only runs as
|
||||
// built .js — tests inject a fake child via the spawnChild seam.
|
||||
function childUrl(): URL {
|
||||
const builtChild = new URL('./subprocess-generate-object-child.js', import.meta.url);
|
||||
return existsSync(fileURLToPath(builtChild)) ? builtChild : new URL('./subprocess-generate-object-child.ts', import.meta.url);
|
||||
}
|
||||
|
||||
function forkSubprocessGenerateObjectChild(): ChildProcess {
|
||||
// detached: the child becomes a process-group leader so the SDK's grandchild
|
||||
// (the codex/claude binary) inherits its group and a negative-pid SIGKILL reaps
|
||||
// the whole tree. Empty execArgv keeps it a clean Node process.
|
||||
return fork(childUrl(), {
|
||||
execArgv: [],
|
||||
serialization: 'advanced',
|
||||
detached: true,
|
||||
stdio: ['ignore', 'ignore', 'inherit', 'ipc'],
|
||||
});
|
||||
}
|
||||
|
||||
/** A per-table enrichment subprocess that did not return before its deadline. */
|
||||
export class KtxSubprocessDeadlineError extends Error {
|
||||
constructor(public readonly deadlineMs: number) {
|
||||
super(`enrichment subprocess exceeded ${Math.round(deadlineMs / 1000)}s`);
|
||||
this.name = 'KtxSubprocessDeadlineError';
|
||||
}
|
||||
}
|
||||
|
||||
// SIGTERM is too gentle for a child wedged on a hung provider socket; the SDK
|
||||
// grandchild ignores it and survives. Kill the whole tree: the detached process
|
||||
// group on POSIX, the process tree via taskkill /T on Windows.
|
||||
function killProcessTree(child: ChildProcess): void {
|
||||
if (child.pid === undefined) {
|
||||
return;
|
||||
}
|
||||
if (process.platform === 'win32') {
|
||||
spawn('taskkill', ['/pid', String(child.pid), '/T', '/F'], { stdio: 'ignore' }).on('error', () => undefined);
|
||||
return;
|
||||
}
|
||||
try {
|
||||
process.kill(-child.pid, 'SIGKILL');
|
||||
} catch {
|
||||
try {
|
||||
child.kill('SIGKILL');
|
||||
} catch {
|
||||
// Already exited.
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
export interface RunGenerateObjectInSubprocessInput<TOutput, TSchema extends z.ZodType<TOutput>> {
|
||||
forkSpec: SubprocessRuntimeForkSpec;
|
||||
role: KtxModelRole;
|
||||
prompt: string;
|
||||
system?: string;
|
||||
schema: TSchema;
|
||||
jsonSchema: Record<string, unknown>;
|
||||
deadlineMs: number;
|
||||
signal?: AbortSignal;
|
||||
/** @internal Test seam: spawn the child so tests can observe its lifecycle. */
|
||||
spawnChild?: () => ChildProcess;
|
||||
}
|
||||
|
||||
/**
|
||||
* Run one structured LLM call for a subprocess-backed runtime behind a boundary
|
||||
* ktx can hard-kill. On the deadline or an external abort, the whole process
|
||||
* group/tree is SIGKILLed (reaping the SDK's wedged model child) and the promise
|
||||
* settles promptly; on success the raw output is validated against the Zod schema.
|
||||
*/
|
||||
export function runGenerateObjectInSubprocess<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||
input: RunGenerateObjectInSubprocessInput<TOutput, TSchema>,
|
||||
): Promise<TOutput> {
|
||||
return new Promise<TOutput>((resolvePromise, rejectPromise) => {
|
||||
const child = (input.spawnChild ?? forkSubprocessGenerateObjectChild)();
|
||||
let settled = false;
|
||||
const onDeadline = () => settle(() => rejectPromise(new KtxSubprocessDeadlineError(input.deadlineMs)));
|
||||
const onAbort = () => settle(() => rejectPromise(createAbortError()));
|
||||
const timer = setTimeout(onDeadline, input.deadlineMs);
|
||||
function settle(finish: () => void): void {
|
||||
if (settled) {
|
||||
return;
|
||||
}
|
||||
settled = true;
|
||||
clearTimeout(timer);
|
||||
input.signal?.removeEventListener('abort', onAbort);
|
||||
if (child.exitCode === null && child.signalCode === null) {
|
||||
killProcessTree(child);
|
||||
}
|
||||
finish();
|
||||
}
|
||||
child.on('message', (message: SubprocessGenerateObjectResponse) => {
|
||||
if (message.ok) {
|
||||
let parsed: TOutput;
|
||||
try {
|
||||
parsed = input.schema.parse(message.output);
|
||||
} catch (error) {
|
||||
settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error))));
|
||||
return;
|
||||
}
|
||||
settle(() => resolvePromise(parsed));
|
||||
} else {
|
||||
settle(() => rejectPromise(new Error(message.message)));
|
||||
}
|
||||
});
|
||||
child.on('error', (error) => settle(() => rejectPromise(error)));
|
||||
child.on('exit', (code, processSignal) => {
|
||||
if (!settled) {
|
||||
settle(() =>
|
||||
rejectPromise(
|
||||
new Error(`enrichment subprocess exited before returning a result (code ${code}, signal ${processSignal}).`),
|
||||
),
|
||||
);
|
||||
}
|
||||
});
|
||||
if (input.signal?.aborted) {
|
||||
onAbort();
|
||||
return;
|
||||
}
|
||||
input.signal?.addEventListener('abort', onAbort, { once: true });
|
||||
try {
|
||||
const request: SubprocessGenerateObjectRequest = {
|
||||
forkSpec: input.forkSpec,
|
||||
role: input.role,
|
||||
prompt: input.prompt,
|
||||
...(input.system !== undefined ? { system: input.system } : {}),
|
||||
jsonSchema: input.jsonSchema,
|
||||
};
|
||||
child.send(request);
|
||||
} catch (error) {
|
||||
settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error))));
|
||||
}
|
||||
});
|
||||
}
|
||||
|
|
@ -11,6 +11,7 @@ import {
|
|||
} from '../../telemetry/index.js';
|
||||
import { collectTelemetryRedactionSecrets } from '../../telemetry/redaction-secrets.js';
|
||||
import { formatErrorDetail, scrubErrorClass } from '../../telemetry/scrubber.js';
|
||||
import { mcpSlowToolMs, serializeMcpError, type KtxMcpLogger } from './logger.js';
|
||||
import type {
|
||||
KtxMcpClientInfo,
|
||||
KtxMcpContextPorts,
|
||||
|
|
@ -29,6 +30,7 @@ export interface RegisterKtxContextToolsDeps {
|
|||
userContext: KtxMcpUserContext;
|
||||
projectDir?: string;
|
||||
io?: KtxCliIo;
|
||||
logger?: KtxMcpLogger;
|
||||
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
||||
}
|
||||
|
||||
|
|
@ -50,6 +52,7 @@ const toolAnnotations = {
|
|||
sl_read_source: { title: 'Semantic Layer Read Source', readOnlyHint: true, idempotentHint: true, openWorldHint: false },
|
||||
sl_query: { title: 'Semantic Layer Query', readOnlyHint: true, openWorldHint: false },
|
||||
sql_execution: { title: 'SQL Execution', readOnlyHint: true, openWorldHint: false },
|
||||
sql_dialect_notes: { title: 'SQL Dialect Notes', readOnlyHint: true, idempotentHint: true, openWorldHint: false },
|
||||
memory_ingest: { title: 'Memory Ingest', destructiveHint: true, openWorldHint: false },
|
||||
memory_ingest_status: { title: 'Memory Ingest Status', readOnlyHint: true, openWorldHint: false },
|
||||
} satisfies Record<string, ToolAnnotations>;
|
||||
|
|
@ -60,7 +63,7 @@ const toolDescriptions = {
|
|||
discover_data:
|
||||
'Search across ktx wiki pages, semantic-layer sources, measures, dimensions, raw tables, and columns. Example: discover_data({ query: "monthly orders by customer", connectionId: "warehouse", kinds: ["sl_source", "table"] }).',
|
||||
wiki_search:
|
||||
'Search ktx wiki pages for reusable business context. Example: wiki_search({ query: "revenue recognition", limit: 5 }).',
|
||||
'Search ktx wiki pages for reusable business context. Pass connectionId to scope results to one warehouse (unscoped pages plus pages tagged with that connection) when a concept name collides across databases. Example: wiki_search({ query: "revenue recognition", connectionId: "warehouse", limit: 5 }).',
|
||||
wiki_read: 'Read a ktx wiki page by key returned from wiki_search. Example: wiki_read({ key: "global/revenue" }).',
|
||||
entity_details:
|
||||
'Read table and column metadata from the latest live-database scan snapshot. Example: entity_details({ connectionId: "warehouse", entities: [{ table: { catalog: null, db: "public", name: "orders" }, columns: ["id"] }] }).',
|
||||
|
|
@ -72,6 +75,8 @@ const toolDescriptions = {
|
|||
'Execute a semantic-layer query and return headers, rows, and total row count, plus correctness notes (e.g. compile-only or fan-out) when relevant. The generated SQL and full query plan are omitted by default; request them with include: ["sql"] and/or include: ["plan"]. Example: sl_query({ connectionId: "warehouse", measures: ["orders.order_count"], dimensions: [{ field: "orders.created_at", granularity: "month" }], include: ["sql"] }).',
|
||||
sql_execution:
|
||||
'Execute one parser-validated read-only SQL query against a configured ktx connection. Example: sql_execution({ connectionId: "warehouse", sql: "select count(*) from public.orders", maxRows: 100 }).',
|
||||
sql_dialect_notes:
|
||||
'Return the SQL syntax conventions for the dialect of a ktx connection: fully-qualified table-name form, identifier quoting and case-folding, date/time functions, top-N / window-filtering idiom, and JSON access. Call this before writing raw sql_execution SQL against a connection so the SQL matches that engine. Example: sql_dialect_notes({ connectionId: "warehouse" }).',
|
||||
memory_ingest:
|
||||
'Ingest free-form markdown knowledge into durable ktx memory. Use this for business rules, metric definitions, schema gotchas, recurring findings, or explicit user requests to remember something. Example: memory_ingest({ connectionId: "warehouse", content: "ARR is reported in cents in this warehouse." }).',
|
||||
memory_ingest_status:
|
||||
|
|
@ -83,6 +88,11 @@ const connectionListSchema = z.object({});
|
|||
const knowledgeSearchSchema = z.object({
|
||||
query: z.string().min(1).describe('Natural-language wiki search query, e.g. "revenue recognition policy".'),
|
||||
limit: z.number().int().min(1).max(50).default(10).describe('Maximum wiki pages to return.'),
|
||||
connectionId: connectionIdSchema
|
||||
.optional()
|
||||
.describe(
|
||||
'Scope results to one connection: returns unscoped pages plus pages tagged with this connection. Omit to search all pages.',
|
||||
),
|
||||
});
|
||||
|
||||
const knowledgeReadSchema = z.object({
|
||||
|
|
@ -203,6 +213,10 @@ const sqlExecutionSchema = z.object({
|
|||
maxRows: z.number().int().min(1).max(10_000).default(1000).optional().describe('Maximum rows to return.'),
|
||||
});
|
||||
|
||||
const sqlDialectNotesSchema = z.object({
|
||||
connectionId: connectionIdSchema.describe('Connection id whose engine dialect conventions to return.'),
|
||||
});
|
||||
|
||||
const memoryIngestSchema = z.object({
|
||||
content: z
|
||||
.string()
|
||||
|
|
@ -405,6 +419,12 @@ const sqlExecutionOutputSchema = z.object({
|
|||
rowCount: z.number(),
|
||||
});
|
||||
|
||||
const sqlDialectNotesOutputSchema = z.object({
|
||||
connectionId: z.string(),
|
||||
dialect: z.string(),
|
||||
notes: z.string(),
|
||||
});
|
||||
|
||||
const memoryIngestOutputSchema = z.object({
|
||||
runId: z.string(),
|
||||
});
|
||||
|
|
@ -566,6 +586,63 @@ function clientTelemetryFields(
|
|||
};
|
||||
}
|
||||
|
||||
function toolResultIsError(result: unknown): boolean {
|
||||
return (
|
||||
typeof result === 'object' && result !== null && 'isError' in result && (result as { isError?: unknown }).isError === true
|
||||
);
|
||||
}
|
||||
|
||||
/** Tool-agnostic size: byte length of the serialized text content the client reads. */
|
||||
function toolResultSize(result: unknown): number {
|
||||
if (typeof result !== 'object' || result === null || !('content' in result)) {
|
||||
return 0;
|
||||
}
|
||||
const content = (result as { content?: unknown }).content;
|
||||
if (!Array.isArray(content)) {
|
||||
return 0;
|
||||
}
|
||||
let size = 0;
|
||||
for (const item of content) {
|
||||
if (item && typeof item === 'object' && (item as { type?: unknown }).type === 'text') {
|
||||
const text = (item as { text?: unknown }).text;
|
||||
if (typeof text === 'string') {
|
||||
size += Buffer.byteLength(text, 'utf8');
|
||||
}
|
||||
}
|
||||
}
|
||||
return size;
|
||||
}
|
||||
|
||||
function toolResultErrorText(result: unknown): string {
|
||||
if (typeof result === 'object' && result !== null && 'content' in result) {
|
||||
const content = (result as { content?: unknown }).content;
|
||||
if (Array.isArray(content)) {
|
||||
const text = content
|
||||
.filter(
|
||||
(item): item is { type: 'text'; text: string } =>
|
||||
!!item &&
|
||||
typeof item === 'object' &&
|
||||
(item as { type?: unknown }).type === 'text' &&
|
||||
typeof (item as { text?: unknown }).text === 'string',
|
||||
)
|
||||
.map((item) => item.text)
|
||||
.join('\n');
|
||||
if (text.length > 0) {
|
||||
return text;
|
||||
}
|
||||
}
|
||||
}
|
||||
return 'Tool returned an error result.';
|
||||
}
|
||||
|
||||
interface InstrumentMcpServerDeps {
|
||||
projectDir?: string;
|
||||
io?: KtxCliIo;
|
||||
logger?: KtxMcpLogger;
|
||||
slowToolMs: number;
|
||||
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
||||
}
|
||||
|
||||
// Tools registered via registerParsedTool catch their own errors and return an
|
||||
// isError result, so the telemetry layer never sees the thrown Error. Recover
|
||||
// the failure message from the result's text content (the same string the agent
|
||||
|
|
@ -588,68 +665,91 @@ function mcpErrorResultDetail(result: unknown): string | undefined {
|
|||
return formatErrorDetail(text);
|
||||
}
|
||||
|
||||
function instrumentMcpServer(
|
||||
server: KtxMcpServerLike,
|
||||
telemetry: { projectDir?: string; io?: KtxCliIo; getClientInfo?: () => KtxMcpClientInfo | undefined },
|
||||
): KtxMcpServerLike {
|
||||
function instrumentMcpServer(server: KtxMcpServerLike, deps: InstrumentMcpServerDeps): KtxMcpServerLike {
|
||||
return {
|
||||
registerTool(name, config, handler) {
|
||||
server.registerTool(name, config, async (input, context) => {
|
||||
const callId = randomUUID();
|
||||
const callLogger = deps.logger?.child({
|
||||
tool: name,
|
||||
callId,
|
||||
...(context?.sessionId ? { sessionId: context.sessionId } : {}),
|
||||
});
|
||||
const startedAt = performance.now();
|
||||
// Synchronous, before the (possibly blocking) handler: a runaway query that never
|
||||
// returns still leaves this start line — with its exact params — on disk.
|
||||
callLogger?.info({ params: input }, 'tool.start');
|
||||
try {
|
||||
const result = await handler(input, context);
|
||||
if (telemetry.io && telemetry.projectDir && shouldEmitMcpTelemetry()) {
|
||||
const isError =
|
||||
typeof result === 'object' && result !== null && 'isError' in result && result.isError === true;
|
||||
const durationMs = Math.max(0, performance.now() - startedAt);
|
||||
const isError = toolResultIsError(result);
|
||||
if (deps.io && deps.projectDir && shouldEmitMcpTelemetry()) {
|
||||
const errorDetail = isError ? mcpErrorResultDetail(result) : undefined;
|
||||
await emitTelemetryEvent({
|
||||
name: 'mcp_request_completed',
|
||||
projectDir: telemetry.projectDir,
|
||||
io: telemetry.io,
|
||||
projectDir: deps.projectDir,
|
||||
io: deps.io,
|
||||
fields: {
|
||||
toolName: name,
|
||||
outcome: isError ? 'error' : 'ok',
|
||||
durationMs: Math.max(0, performance.now() - startedAt),
|
||||
durationMs,
|
||||
sampleRate: mcpTelemetrySampleRate(),
|
||||
...(errorDetail ? { errorDetail } : {}),
|
||||
...clientTelemetryFields(telemetry.getClientInfo),
|
||||
...clientTelemetryFields(deps.getClientInfo),
|
||||
},
|
||||
});
|
||||
}
|
||||
if (callLogger) {
|
||||
if (isError) {
|
||||
callLogger.error(
|
||||
{ durationMs, outcome: 'error', err: serializeMcpError(toolResultErrorText(result)) },
|
||||
'tool.end',
|
||||
);
|
||||
} else {
|
||||
const fields = { durationMs, outcome: 'ok' as const, resultSize: toolResultSize(result) };
|
||||
if (durationMs > deps.slowToolMs) {
|
||||
callLogger.warn(fields, 'tool.end');
|
||||
} else {
|
||||
callLogger.info(fields, 'tool.end');
|
||||
}
|
||||
}
|
||||
}
|
||||
return result;
|
||||
} catch (error) {
|
||||
if (telemetry.io) {
|
||||
const durationMs = Math.max(0, performance.now() - startedAt);
|
||||
if (deps.io) {
|
||||
await reportException({
|
||||
error,
|
||||
context: { source: `mcp:${name}`, handled: true, fatal: false },
|
||||
projectDir: telemetry.projectDir,
|
||||
io: telemetry.io,
|
||||
projectDir: deps.projectDir,
|
||||
io: deps.io,
|
||||
redactionSecrets: await collectTelemetryRedactionSecrets({
|
||||
projectDir: telemetry.projectDir,
|
||||
projectDir: deps.projectDir,
|
||||
includeLlm: true,
|
||||
includeEmbeddings: true,
|
||||
env: process.env,
|
||||
}),
|
||||
});
|
||||
}
|
||||
if (telemetry.io && telemetry.projectDir && shouldEmitMcpTelemetry()) {
|
||||
if (deps.io && deps.projectDir && shouldEmitMcpTelemetry()) {
|
||||
const errorClass = scrubErrorClass(error);
|
||||
const errorDetail = formatErrorDetail(error);
|
||||
await emitTelemetryEvent({
|
||||
name: 'mcp_request_completed',
|
||||
projectDir: telemetry.projectDir,
|
||||
io: telemetry.io,
|
||||
projectDir: deps.projectDir,
|
||||
io: deps.io,
|
||||
fields: {
|
||||
toolName: name,
|
||||
outcome: 'error',
|
||||
...(errorClass ? { errorClass } : {}),
|
||||
...(errorDetail ? { errorDetail } : {}),
|
||||
durationMs: Math.max(0, performance.now() - startedAt),
|
||||
durationMs,
|
||||
sampleRate: mcpTelemetrySampleRate(),
|
||||
...clientTelemetryFields(telemetry.getClientInfo),
|
||||
...clientTelemetryFields(deps.getClientInfo),
|
||||
},
|
||||
});
|
||||
}
|
||||
callLogger?.error({ durationMs, outcome: 'error', err: serializeMcpError(error) }, 'tool.end');
|
||||
throw error;
|
||||
}
|
||||
});
|
||||
|
|
@ -663,6 +763,8 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void
|
|||
const server = instrumentMcpServer(deps.server, {
|
||||
projectDir: deps.projectDir,
|
||||
io: deps.io,
|
||||
logger: deps.logger,
|
||||
slowToolMs: mcpSlowToolMs(),
|
||||
getClientInfo: deps.getClientInfo,
|
||||
});
|
||||
|
||||
|
|
@ -703,6 +805,7 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void
|
|||
userId: userContext.userId,
|
||||
query: input.query,
|
||||
limit: input.limit,
|
||||
...(input.connectionId !== undefined ? { connectionId: input.connectionId } : {}),
|
||||
}),
|
||||
),
|
||||
toolTelemetry,
|
||||
|
|
@ -867,6 +970,24 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void
|
|||
);
|
||||
}
|
||||
|
||||
if (ports.dialectNotes) {
|
||||
const dialectNotes = ports.dialectNotes;
|
||||
registerParsedTool(
|
||||
server,
|
||||
'sql_dialect_notes',
|
||||
{
|
||||
title: toolAnnotations.sql_dialect_notes.title!,
|
||||
description: toolDescriptions.sql_dialect_notes,
|
||||
inputSchema: sqlDialectNotesSchema.shape,
|
||||
outputSchema: sqlDialectNotesOutputSchema,
|
||||
annotations: toolAnnotations.sql_dialect_notes,
|
||||
},
|
||||
sqlDialectNotesSchema,
|
||||
async (input) => jsonToolResult(await dialectNotes.read(input)),
|
||||
toolTelemetry,
|
||||
);
|
||||
}
|
||||
|
||||
if (ports.memoryIngest) {
|
||||
const memoryIngest = ports.memoryIngest;
|
||||
registerParsedTool(
|
||||
|
|
|
|||
|
|
@ -1,5 +1,8 @@
|
|||
import type { KtxSqlQueryExecutorPort } from '../../context/connections/query-executor.js';
|
||||
import { KtxExpectedError, KtxQueryError, isNativeProgrammingFault } from '../../errors.js';
|
||||
import { isDatabaseDriver, normalizeConnectionDriver } from '../../connection-drivers.js';
|
||||
import { sqlDialectNotes } from '../../context/sql-analysis/dialect-notes.js';
|
||||
import type { KtxProjectConnectionConfig } from '../../context/project/config.js';
|
||||
import { executeProjectReadOnlySql } from '../../context/connections/project-sql-executor.js';
|
||||
import { FEDERATED_CONNECTION_ID, federatedConnectionListing } from '../../context/connections/federation.js';
|
||||
import { assertSqlQueryableConnection } from '../../context/connections/dialects.js';
|
||||
|
|
@ -20,6 +23,7 @@ import { compileLocalSlQuery } from '../../context/sl/local-query.js';
|
|||
import { createKtxDictionarySearchService } from '../../context/sl/dictionary-search.js';
|
||||
import { readLocalSlSource } from '../../context/sl/local-sl.js';
|
||||
import { assertSafeConnectionId } from '../../context/sl/source-files.js';
|
||||
import { assertConfiguredConnectionId } from '../../context/connections/configured-connections.js';
|
||||
import { readLocalKnowledgePage, searchLocalKnowledgePages } from '../wiki/local-knowledge.js';
|
||||
import type { KtxMcpContextPorts, KtxMcpProgressCallback, KtxSqlExecutionResponse } from './types.js';
|
||||
|
||||
|
|
@ -94,6 +98,24 @@ async function executeValidatedReadOnlySql(
|
|||
return response;
|
||||
}
|
||||
|
||||
/** @internal Resolves a connection's dialect SQL notes; throws KtxExpectedError for an unknown or non-SQL-warehouse connection. */
|
||||
export function resolveDialectNotesForConnection(
|
||||
connectionId: string,
|
||||
connection: KtxProjectConnectionConfig | undefined,
|
||||
): { connectionId: string; dialect: string; notes: string } {
|
||||
if (!connection) {
|
||||
throw new KtxExpectedError(`Connection "${connectionId}" is not configured in ktx.yaml`);
|
||||
}
|
||||
const driver = normalizeConnectionDriver(connection);
|
||||
if (!isDatabaseDriver(driver)) {
|
||||
throw new KtxExpectedError(
|
||||
`Connection "${connectionId}" uses the "${driver}" context source, not a SQL warehouse; sql_dialect_notes applies only to SQL database connections.`,
|
||||
);
|
||||
}
|
||||
const dialect = sqlAnalysisDialectForDriver(driver);
|
||||
return { connectionId, dialect, notes: sqlDialectNotes(dialect) };
|
||||
}
|
||||
|
||||
export function createLocalProjectMcpContextPorts(
|
||||
project: KtxLocalProject,
|
||||
options: CreateLocalProjectMcpContextPortsOptions,
|
||||
|
|
@ -121,11 +143,16 @@ export function createLocalProjectMcpContextPorts(
|
|||
},
|
||||
knowledge: {
|
||||
async search(input) {
|
||||
const connectionId =
|
||||
input.connectionId === undefined
|
||||
? undefined
|
||||
: assertConfiguredConnectionId(project.config.connections, input.connectionId);
|
||||
const results = await searchLocalKnowledgePages(project, {
|
||||
query: input.query,
|
||||
userId: input.userId,
|
||||
limit: input.limit,
|
||||
embeddingService,
|
||||
...(connectionId !== undefined ? { connectionId } : {}),
|
||||
});
|
||||
return {
|
||||
results: results.slice(0, input.limit).map((result) => ({
|
||||
|
|
@ -196,6 +223,12 @@ export function createLocalProjectMcpContextPorts(
|
|||
return createKtxDiscoverDataService(project, { userId: 'local', embeddingService }).search(input);
|
||||
},
|
||||
},
|
||||
dialectNotes: {
|
||||
async read(input) {
|
||||
const connectionId = assertSafeConnectionId(input.connectionId);
|
||||
return resolveDialectNotesForConnection(connectionId, project.config.connections[connectionId]);
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
if (options.sqlAnalysis && options.localScan?.createConnector) {
|
||||
|
|
|
|||
58
packages/cli/src/context/mcp/logger.ts
Normal file
58
packages/cli/src/context/mcp/logger.ts
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
import { Writable } from 'node:stream';
|
||||
import pino, { type DestinationStream, type Logger } from 'pino';
|
||||
import PinoPretty from 'pino-pretty';
|
||||
import type { KtxCliIo } from '../../cli-runtime.js';
|
||||
|
||||
export type KtxMcpLogger = Logger;
|
||||
|
||||
const LOG_LEVELS = new Set(['trace', 'debug', 'info', 'warn', 'error', 'fatal', 'silent']);
|
||||
|
||||
const DEFAULT_LEVEL = 'info';
|
||||
const DEFAULT_SLOW_TOOL_MS = 10_000;
|
||||
|
||||
/** @internal */
|
||||
export function mcpLogLevel(env: NodeJS.ProcessEnv = process.env): string {
|
||||
const raw = env.KTX_MCP_LOG_LEVEL?.trim().toLowerCase();
|
||||
return raw && LOG_LEVELS.has(raw) ? raw : DEFAULT_LEVEL;
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
export function mcpSlowToolMs(env: NodeJS.ProcessEnv = process.env): number {
|
||||
const raw = Number(env.KTX_MCP_SLOW_TOOL_MS);
|
||||
return Number.isFinite(raw) && raw >= 0 ? raw : DEFAULT_SLOW_TOOL_MS;
|
||||
}
|
||||
|
||||
/**
|
||||
* Serialize an error for a structured `err` field. Genuine `Error`s get pino's
|
||||
* standard serializer (type + message + stack); everything else is reduced to a
|
||||
* message — the in-band tool-error path has already lost the original stack.
|
||||
*/
|
||||
export function serializeMcpError(error: unknown): Record<string, unknown> {
|
||||
if (error instanceof Error) {
|
||||
return { ...pino.stdSerializers.err(error) };
|
||||
}
|
||||
return { message: typeof error === 'string' ? error : String(error) };
|
||||
}
|
||||
|
||||
/**
|
||||
* One synchronous pino logger per MCP server process, written to the `io.stderr`
|
||||
* sink. stderr is the only universally-correct sink: the stdio transport reserves
|
||||
* stdout for JSON-RPC, and the HTTP daemon redirects stderr into `.ktx/logs/mcp.log`.
|
||||
* Synchronous writes are load-bearing — a `tool.start` line must reach the fd before
|
||||
* a blocking handler runs, so a runaway query still leaves its start record on disk.
|
||||
* Format follows the terminal, not a flag: pretty for a TTY, plain JSON otherwise.
|
||||
*/
|
||||
export function createMcpLogger(io: KtxCliIo, options: { isTTY?: boolean } = {}): KtxMcpLogger {
|
||||
const level = mcpLogLevel();
|
||||
const isTTY = options.isTTY ?? process.stderr.isTTY === true;
|
||||
if (isTTY) {
|
||||
const sink = new Writable({
|
||||
write(chunk: Buffer | string, _encoding, callback) {
|
||||
io.stderr.write(typeof chunk === 'string' ? chunk : chunk.toString('utf8'));
|
||||
callback();
|
||||
},
|
||||
});
|
||||
return pino({ level }, PinoPretty({ colorize: true, sync: true, destination: sink }));
|
||||
}
|
||||
return pino({ level }, io.stderr as DestinationStream);
|
||||
}
|
||||
|
|
@ -11,6 +11,7 @@ export function createKtxMcpServer(deps: KtxMcpServerDeps): KtxMcpServerDeps['se
|
|||
userContext: deps.userContext,
|
||||
projectDir: deps.projectDir,
|
||||
io: deps.io,
|
||||
logger: deps.logger,
|
||||
getClientInfo: deps.getClientInfo,
|
||||
});
|
||||
}
|
||||
|
|
@ -31,6 +32,7 @@ export function createDefaultKtxMcpServer(
|
|||
contextTools: deps.contextTools,
|
||||
projectDir: deps.projectDir,
|
||||
io: deps.io,
|
||||
logger: deps.logger,
|
||||
// The SDK populates the client identity after the initialize handshake, so
|
||||
// read it lazily at emit time rather than at registration (undefined here).
|
||||
getClientInfo: () => server.server.getClientVersion(),
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
import type { MemoryIngestService } from '../../context/memory/memory-runs.js';
|
||||
import type { KtxCliIo } from '../../cli-runtime.js';
|
||||
import type { KtxMcpLogger } from './logger.js';
|
||||
import type { KtxEntityDetailsInput, KtxEntityDetailsResponse } from '../scan/entity-details.js';
|
||||
import type { KtxDiscoverDataInput, KtxDiscoverDataResponse } from '../../context/search/discover.js';
|
||||
import type { KtxDictionarySearchInput, KtxDictionarySearchResponse } from '../../context/sl/dictionary-search.js';
|
||||
|
|
@ -28,6 +29,8 @@ interface KtxMcpProgressEvent {
|
|||
export type KtxMcpProgressCallback = (event: KtxMcpProgressEvent) => void | Promise<void>;
|
||||
|
||||
export interface KtxMcpToolHandlerContext {
|
||||
/** Present for the HTTP StreamableHTTP transport (one per session); absent for stdio. */
|
||||
sessionId?: string;
|
||||
_meta?: { progressToken?: string | number; [key: string]: unknown };
|
||||
sendNotification?: (notification: {
|
||||
method: 'notifications/progress';
|
||||
|
|
@ -113,7 +116,12 @@ interface KtxKnowledgePage {
|
|||
|
||||
/** @internal */
|
||||
export interface KtxKnowledgeMcpPort {
|
||||
search(input: { userId: string; query: string; limit: number }): Promise<KtxKnowledgeSearchResponse>;
|
||||
search(input: {
|
||||
userId: string;
|
||||
query: string;
|
||||
limit: number;
|
||||
connectionId?: string;
|
||||
}): Promise<KtxKnowledgeSearchResponse>;
|
||||
read(input: { userId: string; key: string }): Promise<KtxKnowledgePage | null>;
|
||||
}
|
||||
|
||||
|
|
@ -172,6 +180,11 @@ export interface KtxSqlExecutionMcpPort {
|
|||
): Promise<KtxSqlExecutionResponse>;
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
export interface KtxDialectNotesMcpPort {
|
||||
read(input: { connectionId: string }): Promise<{ connectionId: string; dialect: string; notes: string }>;
|
||||
}
|
||||
|
||||
export interface KtxMcpContextPorts {
|
||||
connections?: KtxConnectionsMcpPort;
|
||||
knowledge?: KtxKnowledgeMcpPort;
|
||||
|
|
@ -180,6 +193,7 @@ export interface KtxMcpContextPorts {
|
|||
dictionarySearch?: KtxDictionarySearchMcpPort;
|
||||
discover?: KtxDiscoverDataMcpPort;
|
||||
sqlExecution?: KtxSqlExecutionMcpPort;
|
||||
dialectNotes?: KtxDialectNotesMcpPort;
|
||||
memoryIngest?: MemoryIngestPort;
|
||||
}
|
||||
|
||||
|
|
@ -189,6 +203,8 @@ export interface KtxMcpServerDeps {
|
|||
contextTools?: KtxMcpContextPorts;
|
||||
projectDir?: string;
|
||||
io?: KtxCliIo;
|
||||
/** Shared per-process logger for tool-call observability; tool-call logging is off when absent. */
|
||||
logger?: KtxMcpLogger;
|
||||
/** Reads the connected client's identity once the initialize handshake completes. */
|
||||
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -168,7 +168,7 @@ export class MemoryAgentService {
|
|||
: '';
|
||||
const prompt = [
|
||||
`# Wiki Index\n\n${wikiIndex}`,
|
||||
hasSL ? `\n# Semantic Layer Sources\n\n${slIndex}` : '',
|
||||
hasSL ? `\n# Semantic Layer Sources (connectionId: ${input.connectionId})\n\n${slIndex}` : '',
|
||||
'\n---\n',
|
||||
assistantSection,
|
||||
`\n## User Message\n\n${input.userMessage.trim()}`,
|
||||
|
|
|
|||
|
|
@ -209,6 +209,11 @@ const scanRelationshipsSchema = z
|
|||
.union([z.literal('all'), z.int().nonnegative()])
|
||||
.optional()
|
||||
.describe('Cap on validation queries per scan run. Use "all" for unlimited, an integer for a hard cap, or omit for the runtime default.'),
|
||||
detectionBudgetMs: z
|
||||
.int()
|
||||
.positive()
|
||||
.default(600_000)
|
||||
.describe('Wall-clock budget (ms) for the whole relationship-detection stage. Checked at table-profile, LLM-proposal, candidate-validation, and composite-probe boundaries; above the per-query deadline. On exhaustion the stage stops scheduling new work and returns the relationships found so far, marked partial. Raise it to trigger a fresher, fuller run.'),
|
||||
})
|
||||
.describe('Schema-scan relationship discovery and validation tunables.');
|
||||
|
||||
|
|
|
|||
|
|
@ -30,7 +30,15 @@ function warehouseConnectionSchema<const Driver extends WarehouseDriver>(driver:
|
|||
.array(z.string().min(1))
|
||||
.optional()
|
||||
.describe(
|
||||
'Optional allowlist of fully-qualified table names ("schema.table") to ingest. When set, live-database ingest discards any table whose schema-qualified name is not in this list. Useful for smoke-testing ingest on a single table.',
|
||||
'Optional allowlist of object names to ingest. Accepted forms: "catalog.db.name", "db.name" (schema-qualified), or bare "name". When set, live-database ingest restricts the scan to the listed objects and fails with a clear error if none match. For SQLite, "main.<name>" and the bare "<name>" are equivalent (SQLite exposes a single "main" schema). Useful for smoke-testing ingest on a single table.',
|
||||
),
|
||||
query_timeout_ms: z
|
||||
.number()
|
||||
.int()
|
||||
.positive()
|
||||
.optional()
|
||||
.describe(
|
||||
'Maximum execution time for a single read-only query, in milliseconds (default 30000). Enforced as a server-side statement timeout for remote engines and by SIGKILL-ing a forked query subprocess for in-process SQLite. A query exceeding it is cancelled and returns a "query exceeded Ns" error so the agent can revise.',
|
||||
),
|
||||
})
|
||||
.describe(
|
||||
|
|
|
|||
|
|
@ -37,7 +37,7 @@ export interface InitKtxProjectResult extends KtxLocalProject {
|
|||
const TRACKED_SCAFFOLD_FILES: Array<{ path: string; content: string }> = [
|
||||
{
|
||||
path: '.ktx/.gitignore',
|
||||
content: 'cache/\ndb.sqlite\ndb.sqlite-*\ningest-transcripts/\nsecrets/\nsetup/\nagents/\n',
|
||||
content: 'cache/\ndb.sqlite\ndb.sqlite-*\ningest-transcripts/\nlogs/\nsecrets/\nsetup/\nagents/\n',
|
||||
},
|
||||
{ path: '.ktx/prompts/.gitkeep', content: '' },
|
||||
{ path: '.ktx/skills/.gitkeep', content: '' },
|
||||
|
|
|
|||
|
|
@ -24,6 +24,7 @@ const SETUP_GITIGNORE_ENTRIES = [
|
|||
'db.sqlite',
|
||||
'db.sqlite-*',
|
||||
'ingest-transcripts/',
|
||||
'logs/',
|
||||
'secrets/',
|
||||
'setup/',
|
||||
'agents/',
|
||||
|
|
|
|||
|
|
@ -1,5 +1,10 @@
|
|||
import type { KtxLlmRuntimePort } from '../../context/llm/runtime-port.js';
|
||||
import type { ChildProcess } from 'node:child_process';
|
||||
import { z } from 'zod';
|
||||
import type { KtxLlmRuntimePort } from '../../context/llm/runtime-port.js';
|
||||
import {
|
||||
KtxSubprocessDeadlineError,
|
||||
runGenerateObjectInSubprocess,
|
||||
} from '../../context/llm/subprocess-generate-object.js';
|
||||
import type {
|
||||
KtxColumnSampleInput,
|
||||
KtxColumnSampleResult,
|
||||
|
|
@ -145,6 +150,8 @@ export interface KtxDescriptionGeneratorOptions {
|
|||
logger?: KtxScanLoggerPort;
|
||||
onWarning?: (warning: KtxScanWarning) => void;
|
||||
settings: KtxDescriptionGenerationSettings;
|
||||
/** @internal Test seam: spawn the kill-boundary child for subprocess backends. */
|
||||
spawnSubprocessGenerateChild?: () => ChildProcess;
|
||||
}
|
||||
|
||||
interface ColumnTaskResult {
|
||||
|
|
@ -510,12 +517,14 @@ export class KtxDescriptionGenerator {
|
|||
private readonly logger?: KtxScanLoggerPort;
|
||||
private readonly onWarning?: (warning: KtxScanWarning) => void;
|
||||
private readonly settings: ResolvedKtxDescriptionGenerationSettings;
|
||||
private readonly spawnSubprocessGenerateChild?: () => ChildProcess;
|
||||
|
||||
constructor(options: KtxDescriptionGeneratorOptions) {
|
||||
this.llmRuntime = options.llmRuntime;
|
||||
this.cache = options.cache;
|
||||
this.logger = options.logger;
|
||||
this.onWarning = options.onWarning;
|
||||
this.spawnSubprocessGenerateChild = options.spawnSubprocessGenerateChild;
|
||||
this.settings = {
|
||||
columnMaxWords: options.settings.columnMaxWords,
|
||||
tableMaxWords: options.settings.tableMaxWords,
|
||||
|
|
@ -757,6 +766,21 @@ export class KtxDescriptionGenerator {
|
|||
let tableDescription: string | null = null;
|
||||
let structuredGenerationSucceeded = false;
|
||||
|
||||
// Bound + retry the per-table enrichment LLM call. A transient backend error
|
||||
// (e.g. an "overloaded"/burst rejection when many tables enrich concurrently)
|
||||
// otherwise nulls a whole table's descriptions on the FIRST failure — sampleTable
|
||||
// already retries, this call did not, so transient errors silently dropped most
|
||||
// tables of a db. retryAsync gives it the same 3-attempt backoff. A FRESH timeout
|
||||
// per attempt still bounds a wedged wide table (it never returns a result message);
|
||||
// a timeout is surfaced as KtxAbortedError so retryAsync does NOT retry it (one
|
||||
// wedge stays one timeout, not 3×). Tune via KTX_ENRICH_LLM_TIMEOUT_MS (default
|
||||
// 120s) and KTX_ENRICH_LLM_ATTEMPTS (default 3).
|
||||
const rawEnrichTimeoutMs = Number(process.env.KTX_ENRICH_LLM_TIMEOUT_MS);
|
||||
const enrichTimeoutMs = Number.isFinite(rawEnrichTimeoutMs) && rawEnrichTimeoutMs > 0 ? rawEnrichTimeoutMs : 120_000;
|
||||
const enrichAttempts = Math.max(1, Number(process.env.KTX_ENRICH_LLM_ATTEMPTS ?? 3) || 3);
|
||||
let llmStartedAt = 0;
|
||||
let lastTimedOut = false;
|
||||
|
||||
try {
|
||||
const prompt = batchedPrompt({
|
||||
table: input.table,
|
||||
|
|
@ -765,15 +789,91 @@ export class KtxDescriptionGenerator {
|
|||
tableMaxWords: this.settings.tableMaxWords,
|
||||
columnMaxWords: this.settings.columnMaxWords,
|
||||
});
|
||||
const generated = await this.llmRuntime.generateObject<
|
||||
BatchedTableDescriptionOutput,
|
||||
typeof batchedTableDescriptionSchema
|
||||
>({
|
||||
role: 'candidateExtraction',
|
||||
system: prompt.system,
|
||||
prompt: prompt.user,
|
||||
schema: batchedTableDescriptionSchema,
|
||||
temperature: this.settings.temperature,
|
||||
llmStartedAt = Date.now();
|
||||
this.logger?.info(
|
||||
`[enrich] llm:start table=${input.table.name} cols=${input.table.columns.length} promptChars=${prompt.user.length} timeoutMs=${enrichTimeoutMs} attempts=${enrichAttempts}`,
|
||||
{ connectorId: input.connector.id, table: input.table.name, columns: input.table.columns.length },
|
||||
);
|
||||
// Subprocess backends (codex/claude-code) own an SDK child that ignores the
|
||||
// in-process abort, so each attempt runs behind a tree-killable boundary;
|
||||
// HTTP backends keep the native abortSignal -> fetch cancellation.
|
||||
const enrichForkSpec = this.llmRuntime.subprocessForkSpec();
|
||||
const enrichJsonSchema = enrichForkSpec
|
||||
? (z.toJSONSchema(batchedTableDescriptionSchema, { target: 'draft-7' }) as Record<string, unknown>)
|
||||
: null;
|
||||
const generated = await retryAsync(
|
||||
async () => {
|
||||
if (enrichForkSpec && enrichJsonSchema) {
|
||||
try {
|
||||
return await runGenerateObjectInSubprocess<
|
||||
BatchedTableDescriptionOutput,
|
||||
typeof batchedTableDescriptionSchema
|
||||
>({
|
||||
forkSpec: enrichForkSpec,
|
||||
role: 'candidateExtraction',
|
||||
system: prompt.system,
|
||||
prompt: prompt.user,
|
||||
schema: batchedTableDescriptionSchema,
|
||||
jsonSchema: enrichJsonSchema,
|
||||
deadlineMs: enrichTimeoutMs,
|
||||
...(input.context.signal ? { signal: input.context.signal } : {}),
|
||||
...(this.spawnSubprocessGenerateChild
|
||||
? { spawnChild: this.spawnSubprocessGenerateChild }
|
||||
: {}),
|
||||
});
|
||||
} catch (error) {
|
||||
// The boundary tree-kills the wedged child on deadline; a per-table
|
||||
// timeout is not worth retrying (it would just time out again), so
|
||||
// surface it as KtxAbortedError so retryAsync stops immediately.
|
||||
if (error instanceof KtxSubprocessDeadlineError && !input.context.signal?.aborted) {
|
||||
lastTimedOut = true;
|
||||
throw new KtxAbortedError();
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
const enrichTimeout = AbortSignal.timeout(enrichTimeoutMs);
|
||||
const abortSignal = input.context.signal
|
||||
? AbortSignal.any([enrichTimeout, input.context.signal])
|
||||
: enrichTimeout;
|
||||
try {
|
||||
return await this.llmRuntime.generateObject<
|
||||
BatchedTableDescriptionOutput,
|
||||
typeof batchedTableDescriptionSchema
|
||||
>({
|
||||
role: 'candidateExtraction',
|
||||
system: prompt.system,
|
||||
prompt: prompt.user,
|
||||
schema: batchedTableDescriptionSchema,
|
||||
temperature: this.settings.temperature,
|
||||
abortSignal,
|
||||
});
|
||||
} catch (error) {
|
||||
// A per-table timeout is not worth retrying (it would just time out
|
||||
// again); surface it as KtxAbortedError so retryAsync stops immediately.
|
||||
// A genuine context cancellation is handled by retryAsync's own signal check.
|
||||
if (enrichTimeout.aborted && !input.context.signal?.aborted) {
|
||||
lastTimedOut = true;
|
||||
throw new KtxAbortedError();
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
{
|
||||
attempts: enrichAttempts,
|
||||
baseDelayMs: 500,
|
||||
...(input.context.signal ? { signal: input.context.signal } : {}),
|
||||
onAttemptFailure: (error, attempt) => {
|
||||
this.logger?.warn(
|
||||
`[enrich] llm:retry table=${input.table.name} attempt=${attempt}: ${errorMessage(error)}`,
|
||||
{ connectorId: input.connector.id, table: input.table.name, attempt },
|
||||
);
|
||||
},
|
||||
},
|
||||
);
|
||||
this.logger?.info(`[enrich] llm:done table=${input.table.name} ms=${Date.now() - llmStartedAt}`, {
|
||||
connectorId: input.connector.id,
|
||||
table: input.table.name,
|
||||
});
|
||||
structuredGenerationSucceeded = true;
|
||||
tableDescription = generated.tableDescription.trim() || null;
|
||||
|
|
@ -794,16 +894,25 @@ export class KtxDescriptionGenerator {
|
|||
});
|
||||
}
|
||||
} catch (error) {
|
||||
this.logger?.warn(`Batched table description failed for ${input.table.name}: ${errorMessage(error)}`, {
|
||||
connectorId: input.connector.id,
|
||||
table: input.table.name,
|
||||
});
|
||||
// A genuine cancellation propagates so the stage fails and resumes; a
|
||||
// per-table timeout (context.signal not aborted) still degrades to null.
|
||||
if (input.context.signal?.aborted) {
|
||||
throw error;
|
||||
}
|
||||
const elapsedMs = llmStartedAt ? Date.now() - llmStartedAt : 0;
|
||||
const timedOut = lastTimedOut;
|
||||
this.logger?.warn(
|
||||
`[enrich] llm:${timedOut ? 'TIMEOUT' : 'fail'} table=${input.table.name} cols=${input.table.columns.length} ms=${elapsedMs}: ${errorMessage(error)}`,
|
||||
{ connectorId: input.connector.id, table: input.table.name, timedOut, elapsedMs },
|
||||
);
|
||||
this.onWarning?.({
|
||||
code: 'enrichment_failed',
|
||||
message: `Failed to generate batched description for table ${input.table.name}: ${errorMessage(error)}`,
|
||||
code: timedOut ? 'enrichment_timeout' : 'enrichment_failed',
|
||||
message: `${
|
||||
timedOut ? `Timed out after ${elapsedMs}ms generating` : 'Failed to generate'
|
||||
} batched description for table ${input.table.name}: ${errorMessage(error)}`,
|
||||
table: input.table.name,
|
||||
recoverable: true,
|
||||
metadata: { connectorId: input.connector.id },
|
||||
metadata: { connectorId: input.connector.id, ...(timedOut ? { timeoutMs: enrichTimeoutMs } : {}) },
|
||||
});
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -10,21 +10,34 @@ import type { KtxTableRef } from './types.js';
|
|||
* "catalog.db.name" — fully qualified
|
||||
* "db.name" — schema-qualified (catalog = null)
|
||||
* "name" — bare (catalog = db = null; SQLite-shape)
|
||||
*
|
||||
* SQLite exposes a single schema named `main` but the connector emits objects
|
||||
* with `db: null`, so the `"main.<name>"` form is normalized to the bare shape
|
||||
* to match. Both `"main.customers"` and `"customers"` therefore select the same
|
||||
* object.
|
||||
*/
|
||||
export function resolveEnabledTables(
|
||||
connection: Record<string, unknown> | undefined,
|
||||
): ReadonlySet<KtxTableRefKey> | null {
|
||||
const raw = connection?.enabled_tables;
|
||||
if (!Array.isArray(raw) || raw.length === 0) return null;
|
||||
const driver = typeof connection?.driver === 'string' ? connection.driver : undefined;
|
||||
const refs: KtxTableRef[] = [];
|
||||
for (const value of raw) {
|
||||
const parsed = parseEnabledTableEntry(value);
|
||||
if (parsed) refs.push(parsed);
|
||||
if (parsed) refs.push(normalizeRefForDriver(parsed, driver));
|
||||
}
|
||||
if (refs.length === 0) return null;
|
||||
return tableRefSet(refs);
|
||||
}
|
||||
|
||||
function normalizeRefForDriver(ref: KtxTableRef, driver: string | undefined): KtxTableRef {
|
||||
if (driver === 'sqlite' && ref.catalog === null && ref.db === 'main') {
|
||||
return { catalog: null, db: null, name: ref.name };
|
||||
}
|
||||
return ref;
|
||||
}
|
||||
|
||||
function parseEnabledTableEntry(value: unknown): KtxTableRef | null {
|
||||
if (typeof value === 'string') {
|
||||
return parseDottedTableEntry(value);
|
||||
|
|
|
|||
|
|
@ -1,14 +1,19 @@
|
|||
import { createHash } from 'node:crypto';
|
||||
import type { KtxScanRelationshipConfig } from '../project/config.js';
|
||||
import type { KtxScanEnrichmentStage, KtxScanEnrichmentStateSummary, KtxScanMode, KtxSchemaSnapshot } from './types.js';
|
||||
|
||||
const KTX_SCAN_ENRICHMENT_STAGES: readonly KtxScanEnrichmentStage[] = [
|
||||
/**
|
||||
* Canonical enrichment-stage registry. The `--stages` CLI parser validates
|
||||
* against this list, and stage selection / iteration derives its order here.
|
||||
*/
|
||||
export const KTX_SCAN_ENRICHMENT_STAGES: readonly KtxScanEnrichmentStage[] = [
|
||||
'descriptions',
|
||||
'embeddings',
|
||||
'relationships',
|
||||
] as const;
|
||||
|
||||
export interface KtxScanEnrichmentStageLookup {
|
||||
runId: string;
|
||||
connectionId: string;
|
||||
stage: KtxScanEnrichmentStage;
|
||||
inputHash: string;
|
||||
}
|
||||
|
|
@ -47,6 +52,15 @@ export interface KtxScanEnrichmentStateStore {
|
|||
findCompletedStage<TOutput = unknown>(
|
||||
input: KtxScanEnrichmentStageLookup,
|
||||
): Promise<KtxScanEnrichmentCompletedStage<TOutput> | null>;
|
||||
/**
|
||||
* The most recently completed row for a (connection, stage) pair regardless of
|
||||
* input hash. Used by the staleness check to compare a stage's stored hash
|
||||
* against its freshly recomputed one (D4).
|
||||
*/
|
||||
findLatestCompletedStage(input: {
|
||||
connectionId: string;
|
||||
stage: KtxScanEnrichmentStage;
|
||||
}): Promise<KtxScanEnrichmentCompletedStage | null>;
|
||||
saveCompletedStage<TOutput = unknown>(
|
||||
input: Omit<KtxScanEnrichmentCompletedStage<TOutput>, 'status' | 'errorMessage'>,
|
||||
): Promise<void>;
|
||||
|
|
@ -54,12 +68,35 @@ export interface KtxScanEnrichmentStateStore {
|
|||
listRunStages(runId: string): Promise<KtxScanEnrichmentStageRecord[]>;
|
||||
}
|
||||
|
||||
export interface ComputeKtxScanEnrichmentInputHashInput {
|
||||
/** Description-LLM identity: the inputs that change a description's content. */
|
||||
export interface KtxScanLlmIdentity {
|
||||
model: string | null;
|
||||
baseUrlConfigured: boolean;
|
||||
}
|
||||
|
||||
/** Embedding-model identity: the inputs that change an embedding vector. */
|
||||
export interface KtxScanEmbeddingIdentity {
|
||||
model: string | null;
|
||||
dimensions: number | null;
|
||||
batchSize: number | null;
|
||||
}
|
||||
|
||||
export interface KtxDescriptionsStageHashInput {
|
||||
snapshot: KtxSchemaSnapshot;
|
||||
mode: KtxScanMode;
|
||||
detectRelationships: boolean;
|
||||
providerIdentity: Record<string, unknown>;
|
||||
relationshipSettings?: unknown;
|
||||
llmIdentity: KtxScanLlmIdentity;
|
||||
}
|
||||
|
||||
export interface KtxEmbeddingsStageHashInput {
|
||||
snapshot: KtxSchemaSnapshot;
|
||||
embeddingIdentity: KtxScanEmbeddingIdentity;
|
||||
/** Digest of the resolved description text the embeddings consume (see {@link computeKtxScanDescriptionDigest}). */
|
||||
descriptionDigest: string;
|
||||
}
|
||||
|
||||
export interface KtxRelationshipsStageHashInput {
|
||||
snapshot: KtxSchemaSnapshot;
|
||||
relationshipSettings: KtxScanRelationshipConfig;
|
||||
llmIdentity: KtxScanLlmIdentity;
|
||||
}
|
||||
|
||||
function stableJson(value: unknown): string {
|
||||
|
|
@ -75,8 +112,38 @@ function stableJson(value: unknown): string {
|
|||
return JSON.stringify(value);
|
||||
}
|
||||
|
||||
export function computeKtxScanEnrichmentInputHash(input: ComputeKtxScanEnrichmentInputHashInput): string {
|
||||
return createHash('sha256').update(stableJson(input)).digest('hex');
|
||||
function sha256(value: unknown): string {
|
||||
return createHash('sha256').update(stableJson(value)).digest('hex');
|
||||
}
|
||||
|
||||
export function computeKtxDescriptionsStageHash(input: KtxDescriptionsStageHashInput): string {
|
||||
return sha256({ snapshot: input.snapshot, llmIdentity: input.llmIdentity });
|
||||
}
|
||||
|
||||
export function computeKtxEmbeddingsStageHash(input: KtxEmbeddingsStageHashInput): string {
|
||||
return sha256({
|
||||
snapshot: input.snapshot,
|
||||
embeddingIdentity: input.embeddingIdentity,
|
||||
descriptionDigest: input.descriptionDigest,
|
||||
});
|
||||
}
|
||||
|
||||
export function computeKtxRelationshipsStageHash(input: KtxRelationshipsStageHashInput): string {
|
||||
return sha256({
|
||||
snapshot: input.snapshot,
|
||||
relationshipSettings: input.relationshipSettings,
|
||||
llmIdentity: input.llmIdentity,
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Content digest of the resolved per-column description text the embeddings
|
||||
* stage consumes. Folding it into the embeddings hash content-addresses
|
||||
* embeddings on their real upstream, so re-describing busts only the embeddings
|
||||
* that depend on the changed text (D4 self-healing).
|
||||
*/
|
||||
export function computeKtxScanDescriptionDigest(texts: readonly string[]): string {
|
||||
return sha256(texts);
|
||||
}
|
||||
|
||||
function uniqueStages(stages: KtxScanEnrichmentStage[]): KtxScanEnrichmentStage[] {
|
||||
|
|
|
|||
|
|
@ -1,10 +1,11 @@
|
|||
import YAML from 'yaml';
|
||||
import { buildLiveDatabaseManifestShards, type LiveDatabaseManifestExistingDescriptions, type LiveDatabaseManifestJoinData, type LiveDatabaseManifestJoinEntry, type LiveDatabaseManifestShard, type LiveDatabaseManifestTableData } from '../../context/ingest/adapters/live-database/manifest.js';
|
||||
import { buildLiveDatabaseManifestShards, buildTableRef, type LiveDatabaseManifestExistingDescriptions, type LiveDatabaseManifestJoinData, type LiveDatabaseManifestJoinEntry, type LiveDatabaseManifestShard, type LiveDatabaseManifestTableData } from '../../context/ingest/adapters/live-database/manifest.js';
|
||||
import type { TableUsageOutput } from '../../context/ingest/adapters/historic-sql/skill-schemas.js';
|
||||
import type { KtxScanRelationshipConfig } from '../project/config.js';
|
||||
import type { KtxLocalProject } from '../../context/project/project.js';
|
||||
import { isSlYamlPath } from '../../context/sl/source-files.js';
|
||||
import { deriveFederatedConnection } from '../connections/federation.js';
|
||||
import { tableRefKey } from './table-ref.js';
|
||||
import type { KtxLocalScanEnrichmentResult } from './local-enrichment.js';
|
||||
import {
|
||||
buildKtxRelationshipArtifacts,
|
||||
|
|
@ -28,6 +29,12 @@ export interface WriteLocalScanManifestShardsInput {
|
|||
dryRun: boolean;
|
||||
descriptionUpdates?: KtxLocalScanEnrichmentResult['descriptionUpdates'];
|
||||
relationshipUpdate?: KtxLocalScanEnrichmentResult['relationshipUpdate'];
|
||||
/**
|
||||
* When set, write only the shards that contain one of these tables. All shards
|
||||
* are still built (so merging preserves prior content); the unlisted shards are
|
||||
* left untouched on disk. Used by the incremental flush to bound git commits.
|
||||
*/
|
||||
onlyChangedTableNames?: ReadonlySet<string>;
|
||||
}
|
||||
|
||||
export interface WriteLocalScanManifestShardsResult {
|
||||
|
|
@ -75,9 +82,8 @@ function schemaDir(connectionId: string): string {
|
|||
|
||||
function tableDescription(
|
||||
table: KtxSchemaTable,
|
||||
descriptionUpdates: LocalDescriptionUpdates = [],
|
||||
update: LocalDescriptionUpdates[number] | undefined,
|
||||
): Record<string, string> | undefined {
|
||||
const update = descriptionUpdates.find((candidate) => candidate.table.name === table.name);
|
||||
const descriptions: Record<string, string> = {};
|
||||
if (table.comment) {
|
||||
descriptions.db = table.comment;
|
||||
|
|
@ -89,11 +95,9 @@ function tableDescription(
|
|||
}
|
||||
|
||||
function columnDescription(
|
||||
table: KtxSchemaTable,
|
||||
column: KtxSchemaColumn,
|
||||
descriptionUpdates: LocalDescriptionUpdates = [],
|
||||
update: LocalDescriptionUpdates[number] | undefined,
|
||||
): Record<string, string> | undefined {
|
||||
const update = descriptionUpdates.find((candidate) => candidate.table.name === table.name);
|
||||
const aiDescription = update?.columnDescriptions[column.name] ?? null;
|
||||
const descriptions: Record<string, string> = {};
|
||||
if (column.comment) {
|
||||
|
|
@ -109,19 +113,25 @@ function snapshotTablesToManifestData(
|
|||
snapshot: KtxSchemaSnapshot,
|
||||
descriptionUpdates: LocalDescriptionUpdates = [],
|
||||
): LiveDatabaseManifestTableData[] {
|
||||
return snapshot.tables.map((table) => ({
|
||||
name: table.name,
|
||||
catalog: table.catalog,
|
||||
db: table.db,
|
||||
descriptions: tableDescription(table, descriptionUpdates),
|
||||
columns: table.columns.map((column) => ({
|
||||
name: column.name,
|
||||
type: column.dimensionType,
|
||||
...(column.primaryKey ? { pk: true } : {}),
|
||||
...(column.nullable === false ? { nullable: false } : {}),
|
||||
descriptions: columnDescription(table, column, descriptionUpdates),
|
||||
})),
|
||||
}));
|
||||
// Resolve a table's descriptions by full identity: two same-named tables in
|
||||
// different schemas must not collapse onto one update.
|
||||
const updateByRef = new Map(descriptionUpdates.map((update) => [tableRefKey(update.table), update]));
|
||||
return snapshot.tables.map((table) => {
|
||||
const update = updateByRef.get(tableRefKey({ catalog: table.catalog, db: table.db, name: table.name }));
|
||||
return {
|
||||
name: table.name,
|
||||
catalog: table.catalog,
|
||||
db: table.db,
|
||||
descriptions: tableDescription(table, update),
|
||||
columns: table.columns.map((column) => ({
|
||||
name: column.name,
|
||||
type: column.dimensionType,
|
||||
...(column.primaryKey ? { pk: true } : {}),
|
||||
...(column.nullable === false ? { nullable: false } : {}),
|
||||
descriptions: columnDescription(column, update),
|
||||
})),
|
||||
};
|
||||
});
|
||||
}
|
||||
|
||||
function formalJoins(snapshot: KtxSchemaSnapshot): LiveDatabaseManifestJoinData[] {
|
||||
|
|
@ -256,7 +266,10 @@ async function loadExistingManifestState(
|
|||
if (!validTableNames.has(tableName)) {
|
||||
continue;
|
||||
}
|
||||
descriptions.set(tableName, {
|
||||
// Descriptions/usage key on the fully-qualified `entry.table` ref so two
|
||||
// same-named tables across schemas stay distinct; joins remain keyed by
|
||||
// bare name to match the bare-name join graph.
|
||||
descriptions.set(entry.table, {
|
||||
table: entry.descriptions ? { ...entry.descriptions } : undefined,
|
||||
columns: new Map(
|
||||
(entry.columns ?? []).flatMap((column) =>
|
||||
|
|
@ -265,7 +278,7 @@ async function loadExistingManifestState(
|
|||
),
|
||||
});
|
||||
if (entry.usage) {
|
||||
usage.set(tableName, { ...entry.usage });
|
||||
usage.set(entry.table, { ...entry.usage });
|
||||
}
|
||||
const joins = (entry.joins ?? []).filter((join) => {
|
||||
return (
|
||||
|
|
@ -286,6 +299,108 @@ async function loadExistingManifestState(
|
|||
return { descriptions, preservedJoins, usage };
|
||||
}
|
||||
|
||||
/**
|
||||
* Reconstructs the descriptions already persisted in the on-disk `_schema` as
|
||||
* the in-memory `descriptionUpdates` shape, so a stage-selective run that skips
|
||||
* the descriptions stage (e.g. `--stages relationships`/`--stages embeddings`)
|
||||
* can still feed embeddings + relationships the prior AI descriptions. Tables or
|
||||
* columns with no AI description carry `null`.
|
||||
*/
|
||||
export async function loadOnDiskDescriptionUpdates(
|
||||
project: KtxLocalProject,
|
||||
connectionId: string,
|
||||
snapshot: KtxSchemaSnapshot,
|
||||
): Promise<LocalDescriptionUpdates> {
|
||||
const siblingTargets = await federatedSiblingTargets(project, connectionId);
|
||||
const existing = await loadExistingManifestState(project, connectionId, snapshot, siblingTargets);
|
||||
return snapshot.tables.map((table) => {
|
||||
const entry = existing.descriptions.get(buildTableRef(table.name, table.catalog, table.db));
|
||||
const columnDescriptions: Record<string, string | null> = {};
|
||||
for (const column of table.columns) {
|
||||
columnDescriptions[column.name] = entry?.columns.get(column.name)?.ai ?? null;
|
||||
}
|
||||
return {
|
||||
table: { catalog: table.catalog, db: table.db, name: table.name },
|
||||
tableDescription: entry?.table?.ai ?? null,
|
||||
columnDescriptions,
|
||||
};
|
||||
});
|
||||
}
|
||||
|
||||
// The incremental descriptions resume record. It lives at a stable, NON-syncId
|
||||
// path: a from-scratch interruption gets a fresh syncId on the next run, so a
|
||||
// syncId-scoped record would be unreachable on resume. The manifest already lives
|
||||
// at the same stable per-connection scope.
|
||||
function descriptionsProgressPath(connectionId: string): string {
|
||||
return `raw-sources/${connectionId}/${LIVE_DATABASE_ADAPTER}/enrichment-progress/descriptions.json`;
|
||||
}
|
||||
|
||||
interface DescriptionsProgressRecord {
|
||||
inputHash: string;
|
||||
descriptions: LocalDescriptionUpdates;
|
||||
}
|
||||
|
||||
export interface KtxScanDescriptionResumeStore {
|
||||
/** Prior enriched descriptions when the durable record matches `inputHash`, else null. */
|
||||
load(inputHash: string): Promise<LocalDescriptionUpdates | null>;
|
||||
/** Persist the descriptions so far + the manifest shards that gained a table this batch. */
|
||||
flush(input: {
|
||||
inputHash: string;
|
||||
snapshot: KtxSchemaSnapshot;
|
||||
descriptionUpdates: LocalDescriptionUpdates;
|
||||
changedTableNames: ReadonlySet<string>;
|
||||
}): Promise<void>;
|
||||
}
|
||||
|
||||
export function createKtxScanDescriptionResumeStore(deps: {
|
||||
project: KtxLocalProject;
|
||||
connectionId: string;
|
||||
syncId: string;
|
||||
driver: KtxConnectionDriver;
|
||||
}): KtxScanDescriptionResumeStore {
|
||||
const path = descriptionsProgressPath(deps.connectionId);
|
||||
return {
|
||||
async load(inputHash) {
|
||||
let content: string;
|
||||
try {
|
||||
({ content } = await deps.project.fileStore.readFile(path));
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
try {
|
||||
const record = JSON.parse(content) as DescriptionsProgressRecord | null;
|
||||
// A changed inputHash (schema or enrichment settings changed) ignores the
|
||||
// prior record and recomputes — spec-19's inputHash-gated resume semantics.
|
||||
if (!record || record.inputHash !== inputHash || !Array.isArray(record.descriptions)) {
|
||||
return null;
|
||||
}
|
||||
return record.descriptions;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
},
|
||||
async flush({ inputHash, snapshot, descriptionUpdates, changedTableNames }) {
|
||||
const record: DescriptionsProgressRecord = { inputHash, descriptions: descriptionUpdates };
|
||||
await writeJsonArtifact(
|
||||
deps.project,
|
||||
path,
|
||||
record,
|
||||
`scan(${LIVE_DATABASE_ADAPTER}): flush enrichment descriptions progress syncId=${deps.syncId}`,
|
||||
);
|
||||
await writeLocalScanManifestShards({
|
||||
project: deps.project,
|
||||
connectionId: deps.connectionId,
|
||||
syncId: deps.syncId,
|
||||
driver: deps.driver,
|
||||
snapshot,
|
||||
descriptionUpdates,
|
||||
dryRun: false,
|
||||
onlyChangedTableNames: changedTableNames,
|
||||
});
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
async function writeJsonArtifact(
|
||||
project: KtxLocalProject,
|
||||
path: string,
|
||||
|
|
@ -331,6 +446,9 @@ export async function writeLocalScanManifestShards(
|
|||
|
||||
const manifestShards: string[] = [];
|
||||
for (const [shardKey, shard] of [...shards.entries()].sort(([left], [right]) => left.localeCompare(right))) {
|
||||
if (input.onlyChangedTableNames && !Object.keys(shard.tables).some((table) => input.onlyChangedTableNames!.has(table))) {
|
||||
continue;
|
||||
}
|
||||
const path = `${schemaDir(input.connectionId)}/${shardKey}.yaml`;
|
||||
await input.project.fileStore.writeFile(
|
||||
path,
|
||||
|
|
@ -348,23 +466,14 @@ export async function writeLocalScanManifestShards(
|
|||
};
|
||||
}
|
||||
|
||||
export async function writeLocalScanEnrichmentArtifacts(
|
||||
input: WriteLocalScanEnrichmentArtifactsInput,
|
||||
): Promise<WriteLocalScanEnrichmentArtifactsResult> {
|
||||
if (input.dryRun) {
|
||||
return {
|
||||
enrichmentArtifacts: [],
|
||||
manifestShards: [],
|
||||
manifestShardsWritten: 0,
|
||||
};
|
||||
}
|
||||
|
||||
const enrichmentRoot = artifactDir(input.connectionId, input.syncId);
|
||||
const descriptionsArtifact = `${enrichmentRoot}/descriptions.json`;
|
||||
const embeddingsArtifact = `${enrichmentRoot}/embeddings.json`;
|
||||
const relationshipsArtifact = `${enrichmentRoot}/relationships.json`;
|
||||
const relationshipProfileArtifact = `${enrichmentRoot}/relationship-profile.json`;
|
||||
const relationshipDiagnosticsArtifact = `${enrichmentRoot}/relationship-diagnostics.json`;
|
||||
async function writeEnrichmentDescriptionArtifacts(input: {
|
||||
project: KtxLocalProject;
|
||||
enrichmentRoot: string;
|
||||
syncId: string;
|
||||
enrichment: KtxLocalScanEnrichmentResult;
|
||||
}): Promise<string[]> {
|
||||
const descriptionsArtifact = `${input.enrichmentRoot}/descriptions.json`;
|
||||
const embeddingsArtifact = `${input.enrichmentRoot}/embeddings.json`;
|
||||
const enrichmentArtifacts: string[] = [];
|
||||
|
||||
if (
|
||||
|
|
@ -388,6 +497,67 @@ export async function writeLocalScanEnrichmentArtifacts(
|
|||
`scan(${LIVE_DATABASE_ADAPTER}): write enrichment embeddings syncId=${input.syncId}`,
|
||||
);
|
||||
}
|
||||
return enrichmentArtifacts;
|
||||
}
|
||||
|
||||
/**
|
||||
* Promote the descriptions + embeddings into the queryable `_schema` manifest
|
||||
* (and the raw enrichment artifacts) before relationship detection runs. The
|
||||
* generated joins and the relationship diagnostics are deliberately left to the
|
||||
* final write, so an interrupted relationship stage never loses the paid LLM
|
||||
* enrichment and never emits empty relationship diagnostics.
|
||||
*/
|
||||
export async function writeLocalScanEnrichmentCheckpoint(
|
||||
input: WriteLocalScanEnrichmentArtifactsInput,
|
||||
): Promise<WriteLocalScanEnrichmentArtifactsResult> {
|
||||
if (input.dryRun) {
|
||||
return { enrichmentArtifacts: [], manifestShards: [], manifestShardsWritten: 0 };
|
||||
}
|
||||
|
||||
const enrichmentArtifacts = await writeEnrichmentDescriptionArtifacts({
|
||||
project: input.project,
|
||||
enrichmentRoot: artifactDir(input.connectionId, input.syncId),
|
||||
syncId: input.syncId,
|
||||
enrichment: input.enrichment,
|
||||
});
|
||||
const manifestResult = await writeLocalScanManifestShards({
|
||||
project: input.project,
|
||||
connectionId: input.connectionId,
|
||||
syncId: input.syncId,
|
||||
driver: input.driver,
|
||||
snapshot: input.enrichment.snapshot,
|
||||
descriptionUpdates: input.enrichment.descriptionUpdates,
|
||||
dryRun: false,
|
||||
});
|
||||
|
||||
return {
|
||||
enrichmentArtifacts,
|
||||
manifestShards: manifestResult.manifestShards,
|
||||
manifestShardsWritten: manifestResult.manifestShardsWritten,
|
||||
};
|
||||
}
|
||||
|
||||
export async function writeLocalScanEnrichmentArtifacts(
|
||||
input: WriteLocalScanEnrichmentArtifactsInput,
|
||||
): Promise<WriteLocalScanEnrichmentArtifactsResult> {
|
||||
if (input.dryRun) {
|
||||
return {
|
||||
enrichmentArtifacts: [],
|
||||
manifestShards: [],
|
||||
manifestShardsWritten: 0,
|
||||
};
|
||||
}
|
||||
|
||||
const enrichmentRoot = artifactDir(input.connectionId, input.syncId);
|
||||
const relationshipsArtifact = `${enrichmentRoot}/relationships.json`;
|
||||
const relationshipProfileArtifact = `${enrichmentRoot}/relationship-profile.json`;
|
||||
const relationshipDiagnosticsArtifact = `${enrichmentRoot}/relationship-diagnostics.json`;
|
||||
const enrichmentArtifacts = await writeEnrichmentDescriptionArtifacts({
|
||||
project: input.project,
|
||||
enrichmentRoot,
|
||||
syncId: input.syncId,
|
||||
enrichment: input.enrichment,
|
||||
});
|
||||
enrichmentArtifacts.push(relationshipsArtifact, relationshipProfileArtifact, relationshipDiagnosticsArtifact);
|
||||
const hasResolvedRelationships = input.enrichment.resolvedRelationships !== null;
|
||||
const relationshipArtifacts = buildKtxRelationshipArtifacts({
|
||||
|
|
@ -413,6 +583,7 @@ export async function writeLocalScanEnrichmentArtifacts(
|
|||
artifacts: relationshipArtifacts,
|
||||
profile: relationshipProfile,
|
||||
warnings: input.enrichment.warnings,
|
||||
partial: input.enrichment.relationshipPartial,
|
||||
thresholds: input.relationshipSettings
|
||||
? {
|
||||
acceptThreshold: input.relationshipSettings.acceptThreshold,
|
||||
|
|
|
|||
|
|
@ -6,11 +6,19 @@ import { KtxDescriptionGenerator } from './description-generation.js';
|
|||
import { buildKtxColumnEmbeddingText } from './embedding-text.js';
|
||||
import {
|
||||
completedKtxScanEnrichmentStateSummary,
|
||||
computeKtxScanEnrichmentInputHash,
|
||||
computeKtxDescriptionsStageHash,
|
||||
computeKtxEmbeddingsStageHash,
|
||||
computeKtxRelationshipsStageHash,
|
||||
computeKtxScanDescriptionDigest,
|
||||
KTX_SCAN_ENRICHMENT_STAGES,
|
||||
type KtxScanEmbeddingIdentity,
|
||||
type KtxScanEnrichmentStateStore,
|
||||
type KtxScanLlmIdentity,
|
||||
summarizeKtxScanEnrichmentState,
|
||||
} from './enrichment-state.js';
|
||||
import { skippedKtxScanEnrichmentSummary } from './enrichment-summary.js';
|
||||
import type { KtxScanDescriptionResumeStore } from './local-enrichment-artifacts.js';
|
||||
import { tableRefKey } from './table-ref.js';
|
||||
import type {
|
||||
KtxEmbeddingUpdate,
|
||||
KtxEnrichedColumn,
|
||||
|
|
@ -21,6 +29,7 @@ import type {
|
|||
KtxRelationshipUpdate,
|
||||
} from './enrichment-types.js';
|
||||
import type { KtxCompositeRelationshipCandidate } from './relationship-composite-candidates.js';
|
||||
import type { KtxRelationshipDetectionStopReason } from './relationship-detection-budget.js';
|
||||
import type { KtxResolvedRelationshipDiscoveryCandidate } from './relationship-graph-resolver.js';
|
||||
import { discoverKtxRelationships } from './relationship-discovery.js';
|
||||
import type { KtxRelationshipProfileArtifact } from './relationship-profiling.js';
|
||||
|
|
@ -42,7 +51,13 @@ import type {
|
|||
KtxTableRef,
|
||||
} from './types.js';
|
||||
|
||||
const DESCRIPTION_TABLE_CONCURRENCY = 4;
|
||||
// Parallel per-table description generations. Default 4; raise via
|
||||
// KTX_ENRICH_TABLE_CONCURRENCY for large schemas (the rate-limit governor still
|
||||
// throttles if the provider pushes back, so a higher cap is safe headroom).
|
||||
const DESCRIPTION_TABLE_CONCURRENCY = (() => {
|
||||
const raw = Number(process.env.KTX_ENRICH_TABLE_CONCURRENCY);
|
||||
return Number.isInteger(raw) && raw >= 1 && raw <= 64 ? raw : 4;
|
||||
})();
|
||||
|
||||
export interface KtxLocalScanEnrichmentProviders {
|
||||
llmRuntime: KtxLlmRuntimePort;
|
||||
|
|
@ -53,15 +68,45 @@ export interface KtxLocalScanEnrichmentInput {
|
|||
connectionId: string;
|
||||
mode: KtxScanMode;
|
||||
detectRelationships?: boolean;
|
||||
/**
|
||||
* Enrichment stages to (re)run this invocation. Undefined runs every eligible
|
||||
* stage and respects the completed-stage short-circuit (spec-19 resume). When
|
||||
* present, only the named stages run — each force-recomputes (bypassing the
|
||||
* short-circuit) while unselected stages are left untouched on disk (D3).
|
||||
*/
|
||||
stages?: KtxScanEnrichmentStage[];
|
||||
connector: KtxScanConnector;
|
||||
snapshot?: KtxSchemaSnapshot;
|
||||
context: KtxScanContext;
|
||||
providers: KtxLocalScanEnrichmentProviders | null;
|
||||
stateStore?: KtxScanEnrichmentStateStore | null;
|
||||
/**
|
||||
* Durable per-batch resume record for the descriptions stage. When present, an
|
||||
* interrupted descriptions stage resumes by re-enriching only the tables not
|
||||
* already flushed (inputHash-gated). Null/undefined disables incremental flush.
|
||||
*/
|
||||
descriptionResumeStore?: KtxScanDescriptionResumeStore | null;
|
||||
/**
|
||||
* Lazily loads the descriptions already persisted in the on-disk _schema, used
|
||||
* to feed embeddings + relationships their description context when the
|
||||
* descriptions stage does not run this invocation (e.g. `--stages relationships`).
|
||||
* Called at most once and only when a downstream stage needs it, so a normal
|
||||
* full run never pays the read.
|
||||
*/
|
||||
loadPriorDescriptions?: (snapshot: KtxSchemaSnapshot) => Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']>;
|
||||
syncId?: string;
|
||||
providerIdentity?: Record<string, unknown>;
|
||||
/** Description-LLM identity that keys the descriptions + relationships stage hashes. */
|
||||
llmIdentity?: KtxScanLlmIdentity;
|
||||
/** Embedding-model identity that keys the embeddings stage hash. */
|
||||
embeddingIdentity?: KtxScanEmbeddingIdentity;
|
||||
relationshipSettings?: KtxScanRelationshipConfig;
|
||||
now?: () => Date;
|
||||
/**
|
||||
* Invoked once the last non-relationship stage completes and before
|
||||
* relationship detection runs, so the descriptions + embeddings reach the
|
||||
* queryable layer even if the relationship stage is later interrupted.
|
||||
*/
|
||||
onCheckpoint?: (checkpoint: KtxLocalScanEnrichmentResult) => Promise<void>;
|
||||
}
|
||||
|
||||
export interface KtxLocalScanEnrichmentResult {
|
||||
|
|
@ -80,6 +125,7 @@ export interface KtxLocalScanEnrichmentResult {
|
|||
relationshipProfile: KtxRelationshipProfileArtifact | null;
|
||||
resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null;
|
||||
compositeRelationships: KtxCompositeRelationshipCandidate[] | null;
|
||||
relationshipPartial: { reason: KtxRelationshipDetectionStopReason } | null;
|
||||
}
|
||||
|
||||
function tableId(table: KtxSchemaTable): string {
|
||||
|
|
@ -182,6 +228,17 @@ function providerlessEnrichedWarning(relationshipDetection: boolean): KtxScanWar
|
|||
};
|
||||
}
|
||||
|
||||
function stagePrerequisiteReason(stage: KtxScanEnrichmentStage): string {
|
||||
switch (stage) {
|
||||
case 'descriptions':
|
||||
return 'LLM enrichment is not configured (set scan.enrichment.mode and an LLM provider)';
|
||||
case 'embeddings':
|
||||
return 'no embedding provider is configured (set scan.enrichment.embeddings)';
|
||||
case 'relationships':
|
||||
return 'relationship discovery is disabled (scan.relationships.enabled is false)';
|
||||
}
|
||||
}
|
||||
|
||||
export function createDeterministicLocalScanEnrichmentProviders(): KtxLocalScanEnrichmentProviders {
|
||||
return {
|
||||
llmRuntime: deterministicLlmRuntime(),
|
||||
|
|
@ -209,18 +266,25 @@ function deterministicLlmRuntime(): KtxLlmRuntimePort {
|
|||
async runAgentLoop() {
|
||||
return { stopReason: 'natural' };
|
||||
},
|
||||
subprocessForkSpec() {
|
||||
return null;
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
export function snapshotToKtxEnrichedSchema(
|
||||
snapshot: KtxSchemaSnapshot,
|
||||
embeddingsByColumnId: ReadonlyMap<string, number[]> = new Map(),
|
||||
descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [],
|
||||
): KtxEnrichedSchema {
|
||||
const descriptionByTable = new Map(descriptions.map((item) => [tableRefKey(item.table), item]));
|
||||
const tables: KtxEnrichedTable[] = snapshot.tables.map((table) => {
|
||||
const id = tableId(table);
|
||||
const ref = tableRef(table);
|
||||
const tableDescription = descriptionByTable.get(tableRefKey(ref));
|
||||
const columns: KtxEnrichedColumn[] = table.columns.map((column) => {
|
||||
const idForColumn = columnId(table, column);
|
||||
const aiColumnDescription = tableDescription?.columnDescriptions[column.name] ?? null;
|
||||
return {
|
||||
id: idForColumn,
|
||||
tableId: id,
|
||||
|
|
@ -234,6 +298,7 @@ export function snapshotToKtxEnrichedSchema(
|
|||
parentColumnId: null,
|
||||
descriptions: {
|
||||
...(column.comment ? { db: column.comment } : {}),
|
||||
...(aiColumnDescription ? { ai: aiColumnDescription } : {}),
|
||||
},
|
||||
embedding: embeddingsByColumnId.get(idForColumn) ?? null,
|
||||
sampleValues: null,
|
||||
|
|
@ -246,6 +311,7 @@ export function snapshotToKtxEnrichedSchema(
|
|||
enabled: true,
|
||||
descriptions: {
|
||||
...(table.comment ? { db: table.comment } : {}),
|
||||
...(tableDescription?.tableDescription ? { ai: tableDescription.tableDescription } : {}),
|
||||
},
|
||||
columns,
|
||||
};
|
||||
|
|
@ -262,11 +328,31 @@ function embeddingBatchSize(maxBatchSize: number): number {
|
|||
return Number.isInteger(maxBatchSize) && maxBatchSize > 0 ? maxBatchSize : 100;
|
||||
}
|
||||
|
||||
type KtxScanDescriptionUpdate = KtxLocalScanEnrichmentResult['descriptionUpdates'][number];
|
||||
|
||||
// Per-batch flush cadence: bounds the at-risk window (and the manifest-rewrite /
|
||||
// git-commit cost) to a small number of tables.
|
||||
const DESCRIPTION_FLUSH_EVERY = 10;
|
||||
|
||||
function isEnrichedDescriptionUpdate(update: KtxScanDescriptionUpdate): boolean {
|
||||
return update.tableDescription !== null || Object.values(update.columnDescriptions).some((value) => value !== null);
|
||||
}
|
||||
|
||||
function nullDescriptionUpdate(table: KtxSchemaTable): KtxScanDescriptionUpdate {
|
||||
return {
|
||||
table: tableRef(table),
|
||||
tableDescription: null,
|
||||
columnDescriptions: Object.fromEntries(table.columns.map((column) => [column.name, null])),
|
||||
};
|
||||
}
|
||||
|
||||
async function generateDescriptions(input: {
|
||||
snapshot: KtxSchemaSnapshot;
|
||||
connector: KtxScanConnector;
|
||||
context: KtxScanContext;
|
||||
providers: KtxLocalScanEnrichmentProviders;
|
||||
inputHash: string;
|
||||
resumeStore?: KtxScanDescriptionResumeStore | null;
|
||||
progress?: KtxProgressPort;
|
||||
warnings?: KtxScanWarning[];
|
||||
}): Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']> {
|
||||
|
|
@ -289,67 +375,139 @@ async function generateDescriptions(input: {
|
|||
},
|
||||
});
|
||||
|
||||
const updates: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [];
|
||||
const totalTables = input.snapshot.tables.length;
|
||||
if (totalTables === 0) {
|
||||
await input.progress?.update(1, 'No tables to describe');
|
||||
return updates;
|
||||
return [];
|
||||
}
|
||||
|
||||
// Resume: recover already-enriched tables (inputHash-gated) and re-issue LLM
|
||||
// calls only for the remainder. A failed/skipped table carries null descriptions
|
||||
// and is not recovered, so it is retried.
|
||||
const recovered = input.resumeStore ? ((await input.resumeStore.load(input.inputHash)) ?? []) : [];
|
||||
const enrichedById = new Map<string, KtxScanDescriptionUpdate>();
|
||||
for (const update of recovered) {
|
||||
if (isEnrichedDescriptionUpdate(update)) {
|
||||
enrichedById.set(tableRefKey(update.table), update);
|
||||
}
|
||||
}
|
||||
const remaining = input.snapshot.tables.filter((table) => !enrichedById.has(tableRefKey(tableRef(table))));
|
||||
const recoveredCount = enrichedById.size;
|
||||
if (recoveredCount > 0) {
|
||||
input.context.logger?.info(
|
||||
`[enrich] resume: recovered ${recoveredCount}/${totalTables} descriptions, enriching ${remaining.length}`,
|
||||
);
|
||||
}
|
||||
|
||||
const pendingChanged = new Set<string>();
|
||||
let sinceFlush = 0;
|
||||
let flushing = false;
|
||||
const flush = async (force: boolean): Promise<void> => {
|
||||
if (!input.resumeStore || flushing || pendingChanged.size === 0) {
|
||||
return;
|
||||
}
|
||||
if (!force && sinceFlush < DESCRIPTION_FLUSH_EVERY) {
|
||||
return;
|
||||
}
|
||||
flushing = true;
|
||||
const changedTableNames = new Set(pendingChanged);
|
||||
pendingChanged.clear();
|
||||
sinceFlush = 0;
|
||||
try {
|
||||
await input.resumeStore.flush({
|
||||
inputHash: input.inputHash,
|
||||
snapshot: input.snapshot,
|
||||
descriptionUpdates: [...enrichedById.values()],
|
||||
changedTableNames,
|
||||
});
|
||||
} finally {
|
||||
flushing = false;
|
||||
}
|
||||
};
|
||||
|
||||
const limitTable = pLimit(DESCRIPTION_TABLE_CONCURRENCY);
|
||||
const tableUpdates = await Promise.all(
|
||||
input.snapshot.tables.map((table, index) =>
|
||||
await Promise.all(
|
||||
remaining.map((table, index) =>
|
||||
limitTable(async () => {
|
||||
await input.progress?.update(
|
||||
(index + 1) / totalTables,
|
||||
`Generating descriptions ${index + 1}/${totalTables} tables`,
|
||||
(recoveredCount + index + 1) / totalTables,
|
||||
`Generating descriptions ${recoveredCount + index + 1}/${totalTables} (${table.name}, ${table.columns.length} cols)`,
|
||||
{
|
||||
transient: true,
|
||||
},
|
||||
);
|
||||
const batched = await generator.generateBatchedTableDescriptions({
|
||||
connectionId: input.snapshot.connectionId,
|
||||
connector: input.connector,
|
||||
context: input.context,
|
||||
dataSourceType: input.snapshot.driver,
|
||||
supportsNestedAnalysis: input.connector.capabilities.nestedAnalysis,
|
||||
table: {
|
||||
catalog: table.catalog,
|
||||
db: table.db,
|
||||
name: table.name,
|
||||
rawDescriptions: table.comment ? { db: table.comment } : {},
|
||||
columns: table.columns.map((column) => ({
|
||||
name: column.name,
|
||||
type: column.nativeType,
|
||||
...(column.comment ? { rawDescriptions: { db: column.comment } } : {}),
|
||||
})),
|
||||
},
|
||||
});
|
||||
return {
|
||||
table: tableRef(table),
|
||||
tableDescription: batched.tableDescription,
|
||||
columnDescriptions: Object.fromEntries(batched.columnDescriptions),
|
||||
};
|
||||
// Stage-level guarantee: a single table's failure costs one missing
|
||||
// description, never the whole stage's output. (generateBatchedTableDescriptions
|
||||
// already degrades its own failures to null descriptions; this backstop keeps
|
||||
// the guarantee at the fan-out even if a future path throws.) A genuine
|
||||
// cancellation still propagates so the stage fails and resumes.
|
||||
let update: KtxScanDescriptionUpdate;
|
||||
try {
|
||||
const batched = await generator.generateBatchedTableDescriptions({
|
||||
connectionId: input.snapshot.connectionId,
|
||||
connector: input.connector,
|
||||
context: input.context,
|
||||
dataSourceType: input.snapshot.driver,
|
||||
supportsNestedAnalysis: input.connector.capabilities.nestedAnalysis,
|
||||
table: {
|
||||
catalog: table.catalog,
|
||||
db: table.db,
|
||||
name: table.name,
|
||||
rawDescriptions: table.comment ? { db: table.comment } : {},
|
||||
columns: table.columns.map((column) => ({
|
||||
name: column.name,
|
||||
type: column.nativeType,
|
||||
...(column.comment ? { rawDescriptions: { db: column.comment } } : {}),
|
||||
})),
|
||||
},
|
||||
});
|
||||
update = {
|
||||
table: tableRef(table),
|
||||
tableDescription: batched.tableDescription,
|
||||
columnDescriptions: Object.fromEntries(batched.columnDescriptions),
|
||||
};
|
||||
} catch (error) {
|
||||
if (input.context.signal?.aborted) {
|
||||
throw error;
|
||||
}
|
||||
const message = error instanceof Error ? error.message : String(error);
|
||||
input.context.logger?.warn(`[enrich] table ${table.name} failed: ${message}`);
|
||||
warningSink?.push({
|
||||
code: 'enrichment_failed',
|
||||
message: `Failed to generate description for ${table.name}: ${message}`,
|
||||
table: table.name,
|
||||
recoverable: true,
|
||||
metadata: {},
|
||||
});
|
||||
update = nullDescriptionUpdate(table);
|
||||
}
|
||||
if (isEnrichedDescriptionUpdate(update)) {
|
||||
enrichedById.set(tableRefKey(tableRef(table)), update);
|
||||
pendingChanged.add(table.name);
|
||||
sinceFlush += 1;
|
||||
await flush(false);
|
||||
}
|
||||
}),
|
||||
),
|
||||
);
|
||||
updates.push(...tableUpdates);
|
||||
await flush(true);
|
||||
await input.progress?.update(1, `Generated descriptions for ${totalTables} tables`);
|
||||
return updates;
|
||||
// Full set in snapshot order: recovered + freshly enriched, null for any still-failed.
|
||||
return input.snapshot.tables.map((table) => enrichedById.get(tableRefKey(tableRef(table))) ?? nullDescriptionUpdate(table));
|
||||
}
|
||||
|
||||
async function buildEmbeddings(input: {
|
||||
snapshot: KtxSchemaSnapshot;
|
||||
embedding: KtxEmbeddingPort;
|
||||
descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'];
|
||||
progress?: KtxProgressPort;
|
||||
}): Promise<{ updates: KtxEmbeddingUpdate[]; byColumnId: Map<string, number[]> }> {
|
||||
const descriptionByTable = new Map(input.descriptions.map((item) => [item.table.name, item]));
|
||||
// The exact per-column text fed to the embedding model. Shared by the embeddings
|
||||
// stage and the descriptionDigest so the embeddings hash content-addresses the
|
||||
// real text the model sees (D4).
|
||||
function buildKtxColumnEmbeddingTexts(
|
||||
snapshot: KtxSchemaSnapshot,
|
||||
descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'],
|
||||
): Array<{ columnId: string; text: string }> {
|
||||
const descriptionByTable = new Map(descriptions.map((item) => [tableRefKey(item.table), item]));
|
||||
const texts: Array<{ columnId: string; text: string }> = [];
|
||||
|
||||
for (const table of input.snapshot.tables) {
|
||||
const tableDescriptions = descriptionByTable.get(table.name);
|
||||
for (const table of snapshot.tables) {
|
||||
const tableDescriptions = descriptionByTable.get(tableRefKey(tableRef(table)));
|
||||
for (const column of table.columns) {
|
||||
const id = columnId(table, column);
|
||||
const text = buildKtxColumnEmbeddingText({
|
||||
tableName: table.name,
|
||||
columnName: column.name,
|
||||
|
|
@ -364,9 +522,18 @@ async function buildEmbeddings(input: {
|
|||
incoming: [],
|
||||
},
|
||||
});
|
||||
texts.push({ columnId: id, text });
|
||||
texts.push({ columnId: columnId(table, column), text });
|
||||
}
|
||||
}
|
||||
return texts;
|
||||
}
|
||||
|
||||
async function buildEmbeddings(input: {
|
||||
embedding: KtxEmbeddingPort;
|
||||
texts: Array<{ columnId: string; text: string }>;
|
||||
progress?: KtxProgressPort;
|
||||
}): Promise<{ updates: KtxEmbeddingUpdate[]; byColumnId: Map<string, number[]> }> {
|
||||
const texts = input.texts;
|
||||
|
||||
const embeddings: number[][] = [];
|
||||
const maxBatchSize = embeddingBatchSize(input.embedding.maxBatchSize);
|
||||
|
|
@ -416,17 +583,26 @@ async function runEnrichmentStage<TOutput>(input: {
|
|||
resumedStages: KtxScanEnrichmentStage[];
|
||||
completedStages: KtxScanEnrichmentStage[];
|
||||
failedStages: KtxScanEnrichmentStage[];
|
||||
/**
|
||||
* When true the stage re-enters compute() even if a completed row matches,
|
||||
* skipping the spec-19 short-circuit. The intent of naming a stage in
|
||||
* `--stages` is "recompute this" (D3); the inner compute() still honors the
|
||||
* spec-20 per-table resume record.
|
||||
*/
|
||||
forceRecompute?: boolean;
|
||||
compute: () => Promise<TOutput>;
|
||||
}): Promise<TOutput> {
|
||||
const existing = await input.stateStore?.findCompletedStage<TOutput>({
|
||||
runId: input.runId,
|
||||
stage: input.stage,
|
||||
inputHash: input.inputHash,
|
||||
});
|
||||
if (existing) {
|
||||
input.resumedStages.push(input.stage);
|
||||
input.completedStages.push(input.stage);
|
||||
return existing.output;
|
||||
if (!input.forceRecompute) {
|
||||
const existing = await input.stateStore?.findCompletedStage<TOutput>({
|
||||
connectionId: input.connectionId,
|
||||
stage: input.stage,
|
||||
inputHash: input.inputHash,
|
||||
});
|
||||
if (existing) {
|
||||
input.resumedStages.push(input.stage);
|
||||
input.completedStages.push(input.stage);
|
||||
return existing.output;
|
||||
}
|
||||
}
|
||||
|
||||
try {
|
||||
|
|
@ -493,17 +669,39 @@ export async function runLocalScanEnrichment(
|
|||
const state = completedKtxScanEnrichmentStateSummary();
|
||||
const syncId = input.syncId ?? input.context.runId;
|
||||
const relationshipSettings = input.relationshipSettings ?? buildDefaultKtxProjectConfig().scan.relationships;
|
||||
const inputHash = computeKtxScanEnrichmentInputHash({
|
||||
snapshot,
|
||||
mode: input.mode,
|
||||
detectRelationships: input.detectRelationships ?? false,
|
||||
providerIdentity: input.providerIdentity ?? {},
|
||||
relationshipSettings,
|
||||
});
|
||||
const llmIdentity: KtxScanLlmIdentity = input.llmIdentity ?? { model: null, baseUrlConfigured: false };
|
||||
const embeddingIdentity: KtxScanEmbeddingIdentity = input.embeddingIdentity ?? {
|
||||
model: null,
|
||||
dimensions: null,
|
||||
batchSize: null,
|
||||
};
|
||||
const descriptionsHash = computeKtxDescriptionsStageHash({ snapshot, llmIdentity });
|
||||
const relationshipsHash = computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity });
|
||||
const warnings: KtxScanWarning[] = [];
|
||||
const selectedStages = input.stages;
|
||||
const runsStage = (stage: KtxScanEnrichmentStage): boolean =>
|
||||
selectedStages === undefined || selectedStages.includes(stage);
|
||||
const forcesStage = (stage: KtxScanEnrichmentStage): boolean =>
|
||||
selectedStages !== undefined && selectedStages.includes(stage);
|
||||
|
||||
let descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [];
|
||||
let descriptionsRanThisInvocation = false;
|
||||
let priorDescriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] | null | undefined;
|
||||
// Best-available descriptions for the downstream stages (embeddings,
|
||||
// relationships): fresh ones when descriptions ran this invocation, else the
|
||||
// descriptions persisted in the on-disk _schema. Behavior follows the input
|
||||
// (did descriptions run?), not which stage subset the caller selected (D5).
|
||||
const resolveDownstreamDescriptions = async (): Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']> => {
|
||||
if (descriptionsRanThisInvocation) {
|
||||
return descriptions;
|
||||
}
|
||||
if (priorDescriptions === undefined) {
|
||||
priorDescriptions = input.loadPriorDescriptions ? await input.loadPriorDescriptions(snapshot) : null;
|
||||
}
|
||||
return priorDescriptions ?? [];
|
||||
};
|
||||
|
||||
let embeddingUpdates: KtxEmbeddingUpdate[] = [];
|
||||
let schema = snapshotToKtxEnrichedSchema(snapshot);
|
||||
const summary: KtxScanEnrichmentSummary = { ...skippedKtxScanEnrichmentSummary };
|
||||
const relationshipDetectionEnabled = relationshipSettings.enabled;
|
||||
const shouldDetectRelationships =
|
||||
|
|
@ -514,38 +712,70 @@ export async function runLocalScanEnrichment(
|
|||
warnings.push(providerlessEnrichedWarning(shouldDetectRelationships));
|
||||
}
|
||||
|
||||
// A stage explicitly named in --stages whose prerequisite is missing must be
|
||||
// surfaced, never silently no-op (D2).
|
||||
if (selectedStages !== undefined) {
|
||||
const stageEligible: Record<KtxScanEnrichmentStage, boolean> = {
|
||||
descriptions: input.mode === 'enriched' && input.providers != null,
|
||||
embeddings: input.mode === 'enriched' && input.providers?.embedding != null,
|
||||
relationships: shouldDetectRelationships,
|
||||
};
|
||||
for (const stage of selectedStages) {
|
||||
if (!stageEligible[stage]) {
|
||||
warnings.push({
|
||||
code: 'enrichment_stage_skipped',
|
||||
message: `Requested --stages ${stage}, but it cannot run: ${stagePrerequisiteReason(stage)}.`,
|
||||
recoverable: true,
|
||||
metadata: { stage },
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (input.mode === 'enriched' && input.providers) {
|
||||
const providers = input.providers;
|
||||
const descriptionProgress = progress?.startPhase(0.45);
|
||||
descriptions = await runEnrichmentStage({
|
||||
stateStore: input.stateStore,
|
||||
runId: input.context.runId,
|
||||
connectionId: input.connectionId,
|
||||
syncId,
|
||||
mode: input.mode,
|
||||
stage: 'descriptions',
|
||||
inputHash,
|
||||
now,
|
||||
resumedStages: state.resumedStages,
|
||||
completedStages: state.completedStages,
|
||||
failedStages: state.failedStages,
|
||||
compute: () =>
|
||||
generateDescriptions({
|
||||
snapshot,
|
||||
connector: input.connector,
|
||||
context: input.context,
|
||||
providers,
|
||||
progress: descriptionProgress,
|
||||
warnings,
|
||||
}),
|
||||
});
|
||||
summary.dataDictionary = input.connector.sampleColumn ? 'completed' : 'skipped';
|
||||
summary.tableDescriptions = 'completed';
|
||||
summary.columnDescriptions = 'completed';
|
||||
if (runsStage('descriptions')) {
|
||||
const descriptionProgress = progress?.startPhase(0.45);
|
||||
descriptions = await runEnrichmentStage({
|
||||
stateStore: input.stateStore,
|
||||
runId: input.context.runId,
|
||||
connectionId: input.connectionId,
|
||||
syncId,
|
||||
mode: input.mode,
|
||||
stage: 'descriptions',
|
||||
inputHash: descriptionsHash,
|
||||
now,
|
||||
forceRecompute: forcesStage('descriptions'),
|
||||
resumedStages: state.resumedStages,
|
||||
completedStages: state.completedStages,
|
||||
failedStages: state.failedStages,
|
||||
compute: () =>
|
||||
generateDescriptions({
|
||||
snapshot,
|
||||
connector: input.connector,
|
||||
context: input.context,
|
||||
providers,
|
||||
inputHash: descriptionsHash,
|
||||
resumeStore: input.descriptionResumeStore,
|
||||
progress: descriptionProgress,
|
||||
warnings,
|
||||
}),
|
||||
});
|
||||
descriptionsRanThisInvocation = true;
|
||||
summary.dataDictionary = input.connector.sampleColumn ? 'completed' : 'skipped';
|
||||
summary.tableDescriptions = 'completed';
|
||||
summary.columnDescriptions = 'completed';
|
||||
}
|
||||
|
||||
const embeddingProgress = progress?.startPhase(0.2);
|
||||
const embedding = providers.embedding;
|
||||
if (embedding) {
|
||||
if (embedding && runsStage('embeddings')) {
|
||||
const embeddingProgress = progress?.startPhase(0.2);
|
||||
const embeddingTexts = buildKtxColumnEmbeddingTexts(snapshot, await resolveDownstreamDescriptions());
|
||||
const embeddingsHash = computeKtxEmbeddingsStageHash({
|
||||
snapshot,
|
||||
embeddingIdentity,
|
||||
descriptionDigest: computeKtxScanDescriptionDigest(embeddingTexts.map((item) => item.text)),
|
||||
});
|
||||
embeddingUpdates = await runEnrichmentStage({
|
||||
stateStore: input.stateStore,
|
||||
runId: input.context.runId,
|
||||
|
|
@ -553,22 +783,21 @@ export async function runLocalScanEnrichment(
|
|||
syncId,
|
||||
mode: input.mode,
|
||||
stage: 'embeddings',
|
||||
inputHash,
|
||||
inputHash: embeddingsHash,
|
||||
now,
|
||||
forceRecompute: forcesStage('embeddings'),
|
||||
resumedStages: state.resumedStages,
|
||||
completedStages: state.completedStages,
|
||||
failedStages: state.failedStages,
|
||||
compute: async () => {
|
||||
const embeddings = await buildEmbeddings({
|
||||
snapshot,
|
||||
embedding,
|
||||
descriptions,
|
||||
texts: embeddingTexts,
|
||||
progress: embeddingProgress,
|
||||
});
|
||||
return embeddings.updates;
|
||||
},
|
||||
});
|
||||
schema = snapshotToKtxEnrichedSchema(snapshot, embeddingsByColumnId(embeddingUpdates));
|
||||
summary.embeddings = 'completed';
|
||||
}
|
||||
}
|
||||
|
|
@ -577,9 +806,40 @@ export async function runLocalScanEnrichment(
|
|||
let relationshipProfile: KtxRelationshipProfileArtifact | null = null;
|
||||
let resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null = null;
|
||||
let compositeRelationships: KtxCompositeRelationshipCandidate[] | null = null;
|
||||
let relationshipPartial: { reason: KtxRelationshipDetectionStopReason } | null = null;
|
||||
let relationships: KtxScanRelationshipSummary = { accepted: 0, review: 0, rejected: 0, skipped: 0 };
|
||||
if (shouldDetectRelationships) {
|
||||
|
||||
// Promote the paid descriptions + embeddings to the queryable layer at the
|
||||
// cost boundary, before the slow, kill-prone relationship stage — so an
|
||||
// interrupted relationship stage degrades to "no joins," never "no descriptions."
|
||||
if (shouldDetectRelationships && summary.tableDescriptions === 'completed' && input.onCheckpoint) {
|
||||
await input.onCheckpoint({
|
||||
snapshot,
|
||||
summary: { ...summary },
|
||||
relationships,
|
||||
state: summarizeKtxScanEnrichmentState(state),
|
||||
warnings: [...warnings],
|
||||
descriptionUpdates: descriptions,
|
||||
embeddingUpdates,
|
||||
relationshipUpdate: null,
|
||||
relationshipProfile: null,
|
||||
resolvedRelationships: null,
|
||||
compositeRelationships: null,
|
||||
relationshipPartial: null,
|
||||
});
|
||||
}
|
||||
|
||||
if (shouldDetectRelationships && runsStage('relationships')) {
|
||||
const relationshipProgress = progress?.startPhase(0.25);
|
||||
// Relationship detection (incl. llmProposals) runs against the
|
||||
// best-available descriptions + this run's embeddings, so the join-proposal
|
||||
// prompt carries descriptions on both the full-run and relationships-only
|
||||
// paths (D5). Embeddings are this run's only — they are not re-hydrated.
|
||||
const relationshipSchema = snapshotToKtxEnrichedSchema(
|
||||
snapshot,
|
||||
embeddingsByColumnId(embeddingUpdates),
|
||||
await resolveDownstreamDescriptions(),
|
||||
);
|
||||
const relationshipStage = await runEnrichmentStage({
|
||||
stateStore: input.stateStore,
|
||||
runId: input.context.runId,
|
||||
|
|
@ -587,8 +847,9 @@ export async function runLocalScanEnrichment(
|
|||
syncId,
|
||||
mode: input.mode,
|
||||
stage: 'relationships',
|
||||
inputHash,
|
||||
inputHash: relationshipsHash,
|
||||
now,
|
||||
forceRecompute: forcesStage('relationships'),
|
||||
resumedStages: state.resumedStages,
|
||||
completedStages: state.completedStages,
|
||||
failedStages: state.failedStages,
|
||||
|
|
@ -598,10 +859,12 @@ export async function runLocalScanEnrichment(
|
|||
connectionId: input.connectionId,
|
||||
dialect,
|
||||
connector: input.connector,
|
||||
schema,
|
||||
schema: relationshipSchema,
|
||||
context: input.context,
|
||||
settings: relationshipSettings,
|
||||
llmRuntime: input.providers?.llmRuntime ?? null,
|
||||
...(relationshipProgress ? { progress: relationshipProgress } : {}),
|
||||
...(input.now ? { now: () => input.now!().getTime() } : {}),
|
||||
});
|
||||
|
||||
await relationshipProgress?.update(
|
||||
|
|
@ -617,6 +880,7 @@ export async function runLocalScanEnrichment(
|
|||
statisticalValidation: detection.statisticalValidation,
|
||||
llmRelationshipValidation: detection.llmRelationshipValidation,
|
||||
warnings: detection.warnings,
|
||||
partial: detection.partial,
|
||||
};
|
||||
},
|
||||
});
|
||||
|
|
@ -629,21 +893,77 @@ export async function runLocalScanEnrichment(
|
|||
resolvedRelationships = relationshipStage.resolvedRelationships;
|
||||
compositeRelationships = relationshipStage.compositeRelationships;
|
||||
relationships = relationshipStage.relationships;
|
||||
relationshipPartial = relationshipStage.partial;
|
||||
warnings.push(...relationshipStage.warnings);
|
||||
if (relationshipPartial) {
|
||||
warnings.push({
|
||||
code: 'relationship_detection_partial',
|
||||
message:
|
||||
relationshipPartial.reason === 'aborted'
|
||||
? 'Relationship detection was cancelled before completing; the joins found so far are partial.'
|
||||
: 'Relationship detection hit its wall-clock budget (scan.relationships.detectionBudgetMs) before completing; the joins found so far are partial. Raise the budget to run a fuller pass.',
|
||||
recoverable: true,
|
||||
metadata: { reason: relationshipPartial.reason },
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Derived staleness: after a selective run, surface (never silently leave) any
|
||||
// unselected stage whose stored hash no longer matches its current inputs (D4).
|
||||
// The embeddings hash includes the description digest, so a re-describe makes
|
||||
// embeddings diverge here; relationships are deliberately decoupled (D5) and so
|
||||
// never diverge from a description change.
|
||||
if (selectedStages !== undefined && input.stateStore) {
|
||||
const currentStageHash: Record<KtxScanEnrichmentStage, () => Promise<string>> = {
|
||||
descriptions: () => Promise.resolve(descriptionsHash),
|
||||
relationships: () => Promise.resolve(relationshipsHash),
|
||||
embeddings: async () => {
|
||||
const embeddingTexts = buildKtxColumnEmbeddingTexts(snapshot, await resolveDownstreamDescriptions());
|
||||
return computeKtxEmbeddingsStageHash({
|
||||
snapshot,
|
||||
embeddingIdentity,
|
||||
descriptionDigest: computeKtxScanDescriptionDigest(embeddingTexts.map((item) => item.text)),
|
||||
});
|
||||
},
|
||||
};
|
||||
for (const stage of KTX_SCAN_ENRICHMENT_STAGES) {
|
||||
if (selectedStages.includes(stage)) {
|
||||
continue;
|
||||
}
|
||||
const completed = await input.stateStore.findLatestCompletedStage({ connectionId: input.connectionId, stage });
|
||||
if (!completed) {
|
||||
continue;
|
||||
}
|
||||
if (completed.inputHash !== (await currentStageHash[stage]())) {
|
||||
warnings.push({
|
||||
code: 'enrichment_stage_stale',
|
||||
message: `The ${stage} enrichment stage is now stale: its inputs changed since it last ran. Refresh it with \`ktx ingest ${input.connectionId} --stages ${stage}\`.`,
|
||||
recoverable: true,
|
||||
metadata: { stage },
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
await progress?.update(1, 'Enrichment complete');
|
||||
// The manifest merge treats ai/db descriptions as scan-managed and overwrites
|
||||
// them with whatever this run emits, so a subset run that skips descriptions
|
||||
// must still emit the prior on-disk ones — else the write deletes them (D3
|
||||
// "unselected stages are left untouched on disk"). Fresh-this-run if descriptions
|
||||
// ran, else loaded from the on-disk _schema.
|
||||
const writtenDescriptionUpdates = await resolveDownstreamDescriptions();
|
||||
return {
|
||||
snapshot,
|
||||
summary,
|
||||
relationships,
|
||||
state: summarizeKtxScanEnrichmentState(state),
|
||||
warnings,
|
||||
descriptionUpdates: descriptions,
|
||||
descriptionUpdates: writtenDescriptionUpdates,
|
||||
embeddingUpdates,
|
||||
relationshipUpdate,
|
||||
relationshipProfile,
|
||||
resolvedRelationships,
|
||||
compositeRelationships,
|
||||
relationshipPartial,
|
||||
};
|
||||
}
|
||||
|
|
|
|||
|
|
@ -6,25 +6,36 @@ import { getLocalStageOnlyIngestStatus, type LocalIngestRunRecord, runLocalStage
|
|||
import type { SourceAdapter } from '../../context/ingest/types.js';
|
||||
import { createLocalKtxLlmRuntimeFromConfig } from '../../context/llm/local-config.js';
|
||||
import { KtxScanEmbeddingPortAdapter } from '../../context/llm/embedding-port.js';
|
||||
import type { KtxProjectLlmConfig, KtxScanEnrichmentConfig, KtxScanRelationshipConfig } from '../project/config.js';
|
||||
import type { KtxProjectLlmConfig, KtxScanEnrichmentConfig } from '../project/config.js';
|
||||
import type { KtxLocalProject } from '../../context/project/project.js';
|
||||
import { ktxLocalStateDbPath } from '../project/local-state-db.js';
|
||||
import { redactKtxScanReport } from './credentials.js';
|
||||
import { resolveEnabledTables } from './enabled-tables.js';
|
||||
import { completedKtxScanEnrichmentStateSummary } from './enrichment-state.js';
|
||||
import {
|
||||
completedKtxScanEnrichmentStateSummary,
|
||||
type KtxScanEmbeddingIdentity,
|
||||
type KtxScanLlmIdentity,
|
||||
} from './enrichment-state.js';
|
||||
import { failedKtxScanEnrichmentSummary, ktxScanErrorMessage } from './enrichment-summary.js';
|
||||
import {
|
||||
createDeterministicLocalScanEnrichmentProviders,
|
||||
type KtxLocalScanEnrichmentProviders,
|
||||
runLocalScanEnrichment,
|
||||
} from './local-enrichment.js';
|
||||
import { writeLocalScanEnrichmentArtifacts, writeLocalScanManifestShards } from './local-enrichment-artifacts.js';
|
||||
import {
|
||||
createKtxScanDescriptionResumeStore,
|
||||
loadOnDiskDescriptionUpdates,
|
||||
writeLocalScanEnrichmentArtifacts,
|
||||
writeLocalScanEnrichmentCheckpoint,
|
||||
writeLocalScanManifestShards,
|
||||
} from './local-enrichment-artifacts.js';
|
||||
import { readLocalScanStructuralSnapshot } from './local-structural-artifacts.js';
|
||||
import { SqliteLocalScanEnrichmentStateStore } from './sqlite-local-enrichment-state-store.js';
|
||||
import type {
|
||||
KtxConnectionDriver,
|
||||
KtxProgressPort,
|
||||
KtxScanConnector,
|
||||
KtxScanEnrichmentStage,
|
||||
KtxScanEnrichmentStateSummary,
|
||||
KtxScanMode,
|
||||
KtxScanReport,
|
||||
|
|
@ -68,6 +79,8 @@ export interface RunLocalScanOptions {
|
|||
connectionId: string;
|
||||
mode?: KtxScanMode;
|
||||
detectRelationships?: boolean;
|
||||
/** Enrichment stages to (re)run; omit to run all eligible stages. */
|
||||
stages?: KtxScanEnrichmentStage[];
|
||||
dryRun?: boolean;
|
||||
trigger?: KtxScanTrigger;
|
||||
databaseIntrospectionUrl?: string;
|
||||
|
|
@ -80,6 +93,7 @@ export interface RunLocalScanOptions {
|
|||
enrichmentStateStore?: SqliteLocalScanEnrichmentStateStore | null;
|
||||
progress?: KtxProgressPort;
|
||||
embeddingProvider?: KtxEmbeddingProvider | null;
|
||||
signal?: AbortSignal;
|
||||
}
|
||||
|
||||
export interface LocalScanRunResult {
|
||||
|
|
@ -233,19 +247,18 @@ function createLocalScanEnrichmentStateStore(options: RunLocalScanOptions): Sqli
|
|||
return new SqliteLocalScanEnrichmentStateStore({ dbPath: ktxLocalStateDbPath(options.project) });
|
||||
}
|
||||
|
||||
function localScanProviderIdentity(
|
||||
config: KtxScanEnrichmentConfig,
|
||||
llmConfig: KtxProjectLlmConfig,
|
||||
relationships: KtxScanRelationshipConfig,
|
||||
): Record<string, unknown> {
|
||||
function localScanLlmIdentity(llmConfig: KtxProjectLlmConfig): KtxScanLlmIdentity {
|
||||
return {
|
||||
mode: config.mode,
|
||||
embeddingDimensions: config.embeddings?.dimensions ?? null,
|
||||
llmModel: llmConfig.models.default ?? null,
|
||||
embeddingModel: config.embeddings?.model ?? null,
|
||||
batchSize: config.embeddings?.batchSize ?? null,
|
||||
model: llmConfig.models.default ?? null,
|
||||
baseUrlConfigured: Boolean(llmConfig.provider.gateway?.base_url),
|
||||
relationships,
|
||||
};
|
||||
}
|
||||
|
||||
function localScanEmbeddingIdentity(config: KtxScanEnrichmentConfig): KtxScanEmbeddingIdentity {
|
||||
return {
|
||||
model: config.embeddings?.model ?? null,
|
||||
dimensions: config.embeddings?.dimensions ?? null,
|
||||
batchSize: config.embeddings?.batchSize ?? null,
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -458,6 +471,13 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise<LocalS
|
|||
const enrichmentStateStore = connector ? createLocalScanEnrichmentStateStore(options) : null;
|
||||
let enrichmentState: KtxScanEnrichmentStateSummary = completedKtxScanEnrichmentStateSummary();
|
||||
let enrichmentSnapshot: KtxSchemaSnapshot | null = null;
|
||||
// On a `--stages` subset run, the structural manifest write below (and the
|
||||
// later enrichment write) merge with on-disk shards, but the merge treats ai/db
|
||||
// descriptions as scan-managed and overwrites them with whatever the run emits.
|
||||
// A subset that skips `descriptions` emits none, so without this the structural
|
||||
// write would delete the prior descriptions before enrichment can preserve them.
|
||||
// Capture them up front (only for subset runs) and feed them to both writes.
|
||||
let priorDescriptionUpdates: Awaited<ReturnType<typeof loadOnDiskDescriptionUpdates>> | null = null;
|
||||
if (!reusedExistingScanArtifacts && !report.dryRun && report.artifactPaths.rawSourcesDir) {
|
||||
await options.progress?.update(0.7, 'Writing schema artifacts');
|
||||
const rawSnapshot = await readLocalScanStructuralSnapshot({
|
||||
|
|
@ -471,12 +491,20 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise<LocalS
|
|||
if (rawSnapshot.warnings?.length) {
|
||||
report.warnings.push(...rawSnapshot.warnings);
|
||||
}
|
||||
if (options.stages !== undefined && connector) {
|
||||
priorDescriptionUpdates = await loadOnDiskDescriptionUpdates(
|
||||
options.project,
|
||||
options.connectionId,
|
||||
rawSnapshot,
|
||||
);
|
||||
}
|
||||
const manifestArtifacts = await writeLocalScanManifestShards({
|
||||
project: options.project,
|
||||
connectionId: options.connectionId,
|
||||
syncId: record.syncId,
|
||||
driver,
|
||||
snapshot: rawSnapshot,
|
||||
...(priorDescriptionUpdates ? { descriptionUpdates: priorDescriptionUpdates } : {}),
|
||||
dryRun: false,
|
||||
});
|
||||
report.artifactPaths.manifestShards = manifestArtifacts.manifestShards;
|
||||
|
|
@ -494,19 +522,43 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise<LocalS
|
|||
connectionId: options.connectionId,
|
||||
mode,
|
||||
detectRelationships: options.detectRelationships,
|
||||
...(options.stages ? { stages: options.stages } : {}),
|
||||
connector,
|
||||
...(enrichmentSnapshot ? { snapshot: enrichmentSnapshot } : {}),
|
||||
context: { runId: record.runId, progress: options.progress?.startPhase(0.18) },
|
||||
context: {
|
||||
runId: record.runId,
|
||||
...(options.signal ? { signal: options.signal } : {}),
|
||||
...(options.progress ? { progress: options.progress.startPhase(0.18) } : {}),
|
||||
},
|
||||
providers: enrichmentProviders,
|
||||
stateStore: enrichmentStateStore,
|
||||
descriptionResumeStore: options.dryRun
|
||||
? null
|
||||
: createKtxScanDescriptionResumeStore({
|
||||
project: options.project,
|
||||
connectionId: options.connectionId,
|
||||
syncId: record.syncId,
|
||||
driver,
|
||||
}),
|
||||
syncId: record.syncId,
|
||||
providerIdentity: localScanProviderIdentity(
|
||||
options.project.config.scan.enrichment,
|
||||
options.project.config.llm,
|
||||
options.project.config.scan.relationships,
|
||||
),
|
||||
loadPriorDescriptions: (enrichedSnapshot) =>
|
||||
priorDescriptionUpdates
|
||||
? Promise.resolve(priorDescriptionUpdates)
|
||||
: loadOnDiskDescriptionUpdates(options.project, options.connectionId, enrichedSnapshot),
|
||||
llmIdentity: localScanLlmIdentity(options.project.config.llm),
|
||||
embeddingIdentity: localScanEmbeddingIdentity(options.project.config.scan.enrichment),
|
||||
relationshipSettings: options.project.config.scan.relationships,
|
||||
now: options.now,
|
||||
onCheckpoint: async (checkpoint) => {
|
||||
await writeLocalScanEnrichmentCheckpoint({
|
||||
project: options.project,
|
||||
connectionId: options.connectionId,
|
||||
syncId: record.syncId,
|
||||
driver,
|
||||
enrichment: checkpoint,
|
||||
dryRun: options.dryRun ?? false,
|
||||
});
|
||||
},
|
||||
});
|
||||
const artifacts = await writeLocalScanEnrichmentArtifacts({
|
||||
project: options.project,
|
||||
|
|
|
|||
|
|
@ -45,8 +45,14 @@ const scanWarningCodes = new Set<KtxScanWarning['code']>([
|
|||
'enrichment_failed',
|
||||
'description_fallback_used',
|
||||
'constraint_discovery_unauthorized',
|
||||
'object_introspection_failed',
|
||||
]);
|
||||
|
||||
/** @internal */
|
||||
export function isKtxScanWarningCode(code: string): code is KtxScanWarning['code'] {
|
||||
return scanWarningCodes.has(code as KtxScanWarning['code']);
|
||||
}
|
||||
|
||||
function parseWarning(rawWarning: unknown, path: string): KtxScanWarning {
|
||||
if (
|
||||
!isRecord(rawWarning) ||
|
||||
|
|
|
|||
50
packages/cli/src/context/scan/object-introspection.ts
Normal file
50
packages/cli/src/context/scan/object-introspection.ts
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
import { isNativeProgrammingFault } from '../../errors.js';
|
||||
import type { KtxScanWarning } from './types.js';
|
||||
|
||||
export interface IntrospectObjectContext {
|
||||
/** Bare object name (table or view). */
|
||||
object: string;
|
||||
catalog?: string | null;
|
||||
db?: string | null;
|
||||
}
|
||||
|
||||
export type IntrospectObjectOutcome<T> = { ok: true; table: T } | { ok: false; warning: KtxScanWarning };
|
||||
|
||||
function objectLabel(ctx: IntrospectObjectContext): string {
|
||||
return [ctx.catalog, ctx.db, ctx.object].filter((part): part is string => Boolean(part)).join('.');
|
||||
}
|
||||
|
||||
function objectIntrospectionWarning(ctx: IntrospectObjectContext, error: unknown): KtxScanWarning {
|
||||
const reason = error instanceof Error ? error.message : String(error);
|
||||
return {
|
||||
code: 'object_introspection_failed',
|
||||
message: reason,
|
||||
table: ctx.object,
|
||||
recoverable: true,
|
||||
metadata: {
|
||||
object: objectLabel(ctx),
|
||||
...(ctx.db ? { db: ctx.db } : {}),
|
||||
...(ctx.catalog ? { catalog: ctx.catalog } : {}),
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Runs a single-object metadata/profiling read and isolates its failure: a
|
||||
* broken or inaccessible object becomes a recoverable warning instead of
|
||||
* aborting the whole scan. Native programming faults (a ktx bug, not a broken
|
||||
* object) still propagate so they are not masked as object skips.
|
||||
*/
|
||||
export async function tryIntrospectObject<T>(
|
||||
ctx: IntrospectObjectContext,
|
||||
fn: () => T | Promise<T>,
|
||||
): Promise<IntrospectObjectOutcome<T>> {
|
||||
try {
|
||||
return { ok: true, table: await fn() };
|
||||
} catch (error) {
|
||||
if (isNativeProgrammingFault(error)) {
|
||||
throw error;
|
||||
}
|
||||
return { ok: false, warning: objectIntrospectionWarning(ctx, error) };
|
||||
}
|
||||
}
|
||||
|
|
@ -1,10 +1,11 @@
|
|||
import type { KtxSqlDialect } from '../connections/dialects.js';
|
||||
import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable, KtxRelationshipType } from './enrichment-types.js';
|
||||
import type { KtxRelationshipDetectionBudget } from './relationship-detection-budget.js';
|
||||
import {
|
||||
type KtxRelationshipProfileArtifact,
|
||||
type KtxRelationshipReadOnlyExecutor,
|
||||
} from './relationship-profiling.js';
|
||||
import type { KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
||||
import type { KtxProgressPort, KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
||||
|
||||
type KtxCompositeRelationshipStatus = 'accepted' | 'review' | 'rejected';
|
||||
|
||||
|
|
@ -66,6 +67,8 @@ export interface DiscoverKtxCompositeRelationshipsInput {
|
|||
minPrimaryKeyUniqueness?: number;
|
||||
minSourceCoverage?: number;
|
||||
maxViolationRatio?: number;
|
||||
budget?: KtxRelationshipDetectionBudget;
|
||||
progress?: KtxProgressPort;
|
||||
}
|
||||
|
||||
export interface DiscoverKtxCompositeRelationshipsResult {
|
||||
|
|
@ -536,7 +539,13 @@ export async function discoverKtxCompositeRelationships(
|
|||
const primaryKeys: KtxCompositePrimaryKeyCandidate[] = [];
|
||||
let queryCount = 0;
|
||||
|
||||
for (const table of tables) {
|
||||
for (const [index, table] of tables.entries()) {
|
||||
if (input.budget?.check()) {
|
||||
break;
|
||||
}
|
||||
await input.progress?.update((index + 1) / tables.length, `Probing composite keys ${index + 1}/${tables.length}`, {
|
||||
transient: true,
|
||||
});
|
||||
const result = await detectCompositePrimaryKeys({
|
||||
connectionId: input.connectionId,
|
||||
dialect: input.dialect,
|
||||
|
|
@ -554,6 +563,9 @@ export async function discoverKtxCompositeRelationships(
|
|||
|
||||
const relationships: KtxCompositeRelationshipCandidate[] = [];
|
||||
for (const targetKey of primaryKeys) {
|
||||
if (input.budget?.check()) {
|
||||
break;
|
||||
}
|
||||
const targetTable = tableByName.get(targetKey.table.name);
|
||||
if (!targetTable) {
|
||||
continue;
|
||||
|
|
@ -568,6 +580,9 @@ export async function discoverKtxCompositeRelationships(
|
|||
}
|
||||
|
||||
for (const sourceTable of tables) {
|
||||
if (input.budget?.check()) {
|
||||
break;
|
||||
}
|
||||
if (sourceTable.id === targetTable.id) {
|
||||
continue;
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,93 @@
|
|||
export type KtxRelationshipDetectionStopReason = 'budget' | 'aborted';
|
||||
|
||||
export interface KtxRelationshipDetectionBudget {
|
||||
/**
|
||||
* Returns a stop reason when the relationship stage must stop scheduling new
|
||||
* work, else null. Calling it at a unit boundary records the first observed
|
||||
* stop so the stage can be finalized as partial.
|
||||
*/
|
||||
check(): KtxRelationshipDetectionStopReason | null;
|
||||
/** The first stop reason observed via check(), or null if the stage ran to completion. */
|
||||
stopReason(): KtxRelationshipDetectionStopReason | null;
|
||||
}
|
||||
|
||||
export interface CreateKtxRelationshipDetectionBudgetInput {
|
||||
budgetMs: number;
|
||||
signal?: AbortSignal;
|
||||
now?: () => number;
|
||||
}
|
||||
|
||||
export function createKtxRelationshipDetectionBudget(
|
||||
input: CreateKtxRelationshipDetectionBudgetInput,
|
||||
): KtxRelationshipDetectionBudget {
|
||||
const now = input.now ?? (() => Date.now());
|
||||
const deadline = now() + Math.max(0, input.budgetMs);
|
||||
let tripped: KtxRelationshipDetectionStopReason | null = null;
|
||||
return {
|
||||
check() {
|
||||
if (input.signal?.aborted) {
|
||||
tripped = 'aborted';
|
||||
return 'aborted';
|
||||
}
|
||||
if (now() >= deadline) {
|
||||
tripped ??= 'budget';
|
||||
return 'budget';
|
||||
}
|
||||
return null;
|
||||
},
|
||||
stopReason() {
|
||||
return tripped;
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
export interface MapWithBudgetInput<TInput, TOutput> {
|
||||
inputs: readonly TInput[];
|
||||
concurrency: number;
|
||||
budget?: KtxRelationshipDetectionBudget;
|
||||
onStart?: (index: number, total: number, item: TInput) => Promise<void> | void;
|
||||
mapOne: (item: TInput, index: number) => Promise<TOutput>;
|
||||
}
|
||||
|
||||
export interface MapWithBudgetResult<TOutput> {
|
||||
/** Output aligned with inputs; entries skipped on budget exhaustion are undefined. */
|
||||
results: Array<TOutput | undefined>;
|
||||
processedCount: number;
|
||||
}
|
||||
|
||||
/**
|
||||
* Concurrent map that stops claiming new items once the budget trips. In-flight
|
||||
* items finish; pending items are left undefined. With no budget it is a plain
|
||||
* bounded-concurrency map.
|
||||
*/
|
||||
export async function mapWithBudget<TInput, TOutput>(
|
||||
input: MapWithBudgetInput<TInput, TOutput>,
|
||||
): Promise<MapWithBudgetResult<TOutput>> {
|
||||
const total = input.inputs.length;
|
||||
const results: Array<TOutput | undefined> = new Array(total);
|
||||
const safeConcurrency = Math.max(1, Math.floor(input.concurrency));
|
||||
let nextIndex = 0;
|
||||
let processedCount = 0;
|
||||
|
||||
async function worker(): Promise<void> {
|
||||
while (true) {
|
||||
const index = nextIndex;
|
||||
if (index >= total) {
|
||||
return;
|
||||
}
|
||||
// Check the budget only when work remains, so a deadline that elapses
|
||||
// after the last item is claimed never marks a fully-processed stage partial.
|
||||
if (input.budget?.check()) {
|
||||
return;
|
||||
}
|
||||
nextIndex += 1;
|
||||
const item = input.inputs[index] as TInput;
|
||||
await input.onStart?.(index, total, item);
|
||||
results[index] = await input.mapOne(item, index);
|
||||
processedCount += 1;
|
||||
}
|
||||
}
|
||||
|
||||
await Promise.all(Array.from({ length: Math.min(safeConcurrency, total) }, () => worker()));
|
||||
return { results, processedCount };
|
||||
}
|
||||
|
|
@ -79,6 +79,8 @@ export interface KtxRelationshipDiagnosticsArtifact {
|
|||
generatedAt: string;
|
||||
summary: KtxRelationshipDiagnosticsSummary;
|
||||
noAcceptedReason: string | null;
|
||||
partial: boolean;
|
||||
partialReason: string | null;
|
||||
candidateCountsBySource: Record<string, number>;
|
||||
validation: KtxRelationshipDiagnosticsValidation;
|
||||
thresholds: KtxRelationshipDiagnosticsThresholds;
|
||||
|
|
@ -101,6 +103,7 @@ export interface BuildKtxRelationshipDiagnosticsInput {
|
|||
warnings?: readonly KtxScanWarning[];
|
||||
thresholds?: Partial<KtxRelationshipDiagnosticsThresholds>;
|
||||
policy?: Partial<KtxRelationshipDiagnosticsPolicy>;
|
||||
partial?: { reason: string } | null;
|
||||
generatedAt?: string;
|
||||
}
|
||||
|
||||
|
|
@ -352,6 +355,8 @@ export function buildKtxRelationshipDiagnostics(
|
|||
generatedAt: input.generatedAt ?? new Date().toISOString(),
|
||||
summary,
|
||||
noAcceptedReason: noAcceptedReason({ artifacts: input.artifacts, profile: input.profile }),
|
||||
partial: Boolean(input.partial),
|
||||
partialReason: input.partial?.reason ?? null,
|
||||
candidateCountsBySource: candidateCountsBySource(input.artifacts),
|
||||
validation: {
|
||||
available: input.profile.sqlAvailable,
|
||||
|
|
|
|||
|
|
@ -11,6 +11,11 @@ import {
|
|||
discoverKtxCompositeRelationships,
|
||||
type KtxCompositeRelationshipCandidate,
|
||||
} from './relationship-composite-candidates.js';
|
||||
import {
|
||||
createKtxRelationshipDetectionBudget,
|
||||
type KtxRelationshipDetectionBudget,
|
||||
type KtxRelationshipDetectionStopReason,
|
||||
} from './relationship-detection-budget.js';
|
||||
import { collectKtxFormalMetadataRelationships } from './relationship-formal-metadata.js';
|
||||
import {
|
||||
type KtxResolvedRelationshipDiscoveryCandidate,
|
||||
|
|
@ -25,6 +30,7 @@ import {
|
|||
} from './relationship-profiling.js';
|
||||
import { validateKtxRelationshipDiscoveryCandidates } from './relationship-validation.js';
|
||||
import type {
|
||||
KtxProgressPort,
|
||||
KtxScanConnector,
|
||||
KtxScanContext,
|
||||
KtxScanEnrichmentSummary,
|
||||
|
|
@ -40,6 +46,8 @@ export interface DiscoverKtxRelationshipsInput {
|
|||
context: KtxScanContext;
|
||||
settings: KtxScanRelationshipConfig;
|
||||
llmRuntime?: KtxLlmRuntimePort | null;
|
||||
progress?: KtxProgressPort;
|
||||
now?: () => number;
|
||||
}
|
||||
|
||||
export interface DiscoverKtxRelationshipsResult {
|
||||
|
|
@ -51,6 +59,7 @@ export interface DiscoverKtxRelationshipsResult {
|
|||
statisticalValidation: KtxScanEnrichmentSummary['statisticalValidation'];
|
||||
llmRelationshipValidation: KtxScanEnrichmentSummary['llmRelationshipValidation'];
|
||||
warnings: KtxScanWarning[];
|
||||
partial: { reason: KtxRelationshipDetectionStopReason } | null;
|
||||
}
|
||||
|
||||
function relationshipFromResolved(candidate: KtxResolvedRelationshipDiscoveryCandidate): KtxEnrichedRelationship {
|
||||
|
|
@ -128,6 +137,8 @@ async function detectCompositeRelationships(input: {
|
|||
executor: KtxRelationshipReadOnlyExecutor | null;
|
||||
context: DiscoverKtxRelationshipsInput['context'];
|
||||
warnings: KtxScanWarning[];
|
||||
budget: KtxRelationshipDetectionBudget;
|
||||
progress?: KtxProgressPort;
|
||||
}): Promise<KtxCompositeRelationshipCandidate[]> {
|
||||
if (!input.executor || !input.profile.sqlAvailable || !input.dialect) {
|
||||
return [];
|
||||
|
|
@ -141,6 +152,8 @@ async function detectCompositeRelationships(input: {
|
|||
profiles: input.profile,
|
||||
executor: input.executor,
|
||||
ctx: input.context,
|
||||
budget: input.budget,
|
||||
...(input.progress ? { progress: input.progress } : {}),
|
||||
});
|
||||
for (const warning of compositeDetection.warnings) {
|
||||
input.warnings.push({
|
||||
|
|
@ -220,6 +233,11 @@ export async function discoverKtxRelationships(
|
|||
input: DiscoverKtxRelationshipsInput,
|
||||
): Promise<DiscoverKtxRelationshipsResult> {
|
||||
const { executor, warnings } = sqlExecutor(input);
|
||||
const budget = createKtxRelationshipDetectionBudget({
|
||||
budgetMs: input.settings.detectionBudgetMs,
|
||||
...(input.context.signal ? { signal: input.context.signal } : {}),
|
||||
...(input.now ? { now: input.now } : {}),
|
||||
});
|
||||
const formalMetadata = collectKtxFormalMetadataRelationships(input.schema);
|
||||
const profileCache = createKtxRelationshipProfileCache();
|
||||
const profile = await profileKtxRelationshipSchema({
|
||||
|
|
@ -232,6 +250,8 @@ export async function discoverKtxRelationships(
|
|||
profileSampleRows: input.settings.profileSampleRows,
|
||||
profileConcurrency: input.settings.profileConcurrency,
|
||||
cache: profileCache,
|
||||
budget,
|
||||
...(input.progress ? { progress: input.progress } : {}),
|
||||
});
|
||||
const deterministicCandidates: KtxRelationshipDiscoveryCandidate[] = generateKtxRelationshipDiscoveryCandidates(
|
||||
input.schema,
|
||||
|
|
@ -240,17 +260,21 @@ export async function discoverKtxRelationships(
|
|||
profiles: profile,
|
||||
},
|
||||
);
|
||||
const llmProposalResult = input.settings.llmProposals
|
||||
? await proposeKtxRelationshipCandidatesWithLlm({
|
||||
connectionId: input.connectionId,
|
||||
schema: input.schema,
|
||||
profile,
|
||||
llmRuntime: input.llmRuntime ?? null,
|
||||
settings: {
|
||||
maxTablesPerBatch: input.settings.maxLlmTablesPerBatch,
|
||||
},
|
||||
})
|
||||
: { candidates: [], warnings: [], llmCalls: 0, summary: 'skipped' as const };
|
||||
// The LLM proposal is one more unit of relationship work, so it honors the same
|
||||
// budget/abort gate as profiling, validation, and composite probing — a stage
|
||||
// that already exhausted its budget (or was aborted) must not start a fresh call.
|
||||
const llmProposalResult =
|
||||
input.settings.llmProposals && !budget.check()
|
||||
? await proposeKtxRelationshipCandidatesWithLlm({
|
||||
connectionId: input.connectionId,
|
||||
schema: input.schema,
|
||||
profile,
|
||||
llmRuntime: input.llmRuntime ?? null,
|
||||
settings: {
|
||||
maxTablesPerBatch: input.settings.maxLlmTablesPerBatch,
|
||||
},
|
||||
})
|
||||
: { candidates: [], warnings: [], llmCalls: 0, summary: 'skipped' as const };
|
||||
const candidates = mergeKtxRelationshipDiscoveryCandidates([
|
||||
...deterministicCandidates,
|
||||
...llmProposalResult.candidates,
|
||||
|
|
@ -271,6 +295,8 @@ export async function discoverKtxRelationships(
|
|||
concurrency: input.settings.validationConcurrency,
|
||||
validationBudget: input.settings.validationBudget,
|
||||
},
|
||||
budget,
|
||||
...(input.progress ? { progress: input.progress } : {}),
|
||||
});
|
||||
const graph = resolveKtxRelationshipGraph({
|
||||
schema: input.schema,
|
||||
|
|
@ -290,6 +316,8 @@ export async function discoverKtxRelationships(
|
|||
executor,
|
||||
context: input.context,
|
||||
warnings,
|
||||
budget,
|
||||
...(input.progress ? { progress: input.progress } : {}),
|
||||
});
|
||||
const inferredAccepted = nonFormalAcceptedRelationships({
|
||||
formalIds: formalMetadata.acceptedIds,
|
||||
|
|
@ -312,6 +340,7 @@ export async function discoverKtxRelationships(
|
|||
resolvedRelationships: graph.relationships,
|
||||
});
|
||||
const compositeCounts = compositeSummary(compositeRelationships);
|
||||
const stopReason = budget.stopReason();
|
||||
|
||||
return {
|
||||
relationshipUpdate: {
|
||||
|
|
@ -329,8 +358,11 @@ export async function discoverKtxRelationships(
|
|||
profile,
|
||||
resolvedRelationships: graph.relationships,
|
||||
compositeRelationships,
|
||||
statisticalValidation: profile.sqlAvailable ? 'completed' : 'skipped',
|
||||
// A budget/abort stop means profiling did not finish, so report it as not
|
||||
// completed even though the SQL capability was available.
|
||||
statisticalValidation: profile.sqlAvailable && !stopReason ? 'completed' : 'skipped',
|
||||
llmRelationshipValidation: llmProposalResult.summary,
|
||||
warnings,
|
||||
partial: stopReason ? { reason: stopReason } : null,
|
||||
};
|
||||
}
|
||||
|
|
|
|||
|
|
@ -96,6 +96,10 @@ function rowCountForTable(profile: KtxRelationshipProfileArtifact, table: KtxEnr
|
|||
return profile.tables.find((item) => item.table.name.toLowerCase() === table.ref.name.toLowerCase())?.rowCount ?? null;
|
||||
}
|
||||
|
||||
function resolvedDescription(descriptions: Partial<Record<string, string>>): string | null {
|
||||
return descriptions.ai ?? descriptions.db ?? null;
|
||||
}
|
||||
|
||||
function buildEvidencePacket(
|
||||
schema: KtxEnrichedSchema,
|
||||
profile: KtxRelationshipProfileArtifact,
|
||||
|
|
@ -107,13 +111,17 @@ function buildEvidencePacket(
|
|||
tables: schema.tables
|
||||
.filter((table) => table.enabled)
|
||||
.slice(0, settings.maxTablesPerBatch)
|
||||
.map((table) => ({
|
||||
.map((table) => {
|
||||
const tableDescription = resolvedDescription(table.descriptions);
|
||||
return {
|
||||
name: table.ref.name,
|
||||
catalog: table.ref.catalog,
|
||||
db: table.ref.db,
|
||||
rowCount: rowCountForTable(profile, table),
|
||||
...(tableDescription ? { description: tableDescription } : {}),
|
||||
columns: table.columns.slice(0, settings.maxColumnsPerTable).map((column) => {
|
||||
const columnProfile = profileForColumn(profile, table, column);
|
||||
const columnDescription = resolvedDescription(column.descriptions);
|
||||
return {
|
||||
name: column.name,
|
||||
nativeType: column.nativeType,
|
||||
|
|
@ -121,6 +129,7 @@ function buildEvidencePacket(
|
|||
dimensionType: column.dimensionType,
|
||||
nullable: column.nullable,
|
||||
declaredPrimaryKey: column.primaryKey,
|
||||
...(columnDescription ? { description: columnDescription } : {}),
|
||||
profile: columnProfile
|
||||
? {
|
||||
rowCount: columnProfile.rowCount,
|
||||
|
|
@ -133,7 +142,8 @@ function buildEvidencePacket(
|
|||
: null,
|
||||
};
|
||||
}),
|
||||
})),
|
||||
};
|
||||
}),
|
||||
};
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,8 +1,9 @@
|
|||
import type { KtxSqlDialect } from '../connections/dialects.js';
|
||||
import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable } from './enrichment-types.js';
|
||||
import { mapWithConcurrency } from './relationship-validation.js';
|
||||
import { type KtxRelationshipDetectionBudget, mapWithBudget } from './relationship-detection-budget.js';
|
||||
import type {
|
||||
KtxConnectionDriver,
|
||||
KtxProgressPort,
|
||||
KtxQueryResult,
|
||||
KtxReadOnlyQueryInput,
|
||||
KtxScanContext,
|
||||
|
|
@ -65,6 +66,8 @@ export interface ProfileKtxRelationshipSchemaInput {
|
|||
profileSampleRows?: number;
|
||||
profileConcurrency?: number;
|
||||
cache?: KtxRelationshipProfileCache;
|
||||
budget?: KtxRelationshipDetectionBudget;
|
||||
progress?: KtxProgressPort;
|
||||
}
|
||||
|
||||
export function createKtxRelationshipProfileCache(): KtxRelationshipProfileCache {
|
||||
|
|
@ -341,10 +344,14 @@ export async function profileKtxRelationshipSchema(
|
|||
const dialect = input.dialect;
|
||||
|
||||
const enabledTables = input.schema.tables.filter((candidate) => candidate.enabled);
|
||||
const tableResults = await mapWithConcurrency<KtxEnrichedTable, TableProfileResult>(
|
||||
enabledTables,
|
||||
input.profileConcurrency ?? 4,
|
||||
async (table) => {
|
||||
const { results: tableResults } = await mapWithBudget<KtxEnrichedTable, TableProfileResult>({
|
||||
inputs: enabledTables,
|
||||
concurrency: input.profileConcurrency ?? 4,
|
||||
budget: input.budget,
|
||||
onStart: async (index, total) => {
|
||||
await input.progress?.update((index + 1) / total, `Profiling table ${index + 1}/${total}`, { transient: true });
|
||||
},
|
||||
mapOne: async (table) => {
|
||||
const sampleValuesPerColumn = input.sampleValuesPerColumn ?? 5;
|
||||
const profileSampleRows = input.profileSampleRows ?? 10000;
|
||||
const cacheKey = tableProfileCacheKey({
|
||||
|
|
@ -387,9 +394,12 @@ export async function profileKtxRelationshipSchema(
|
|||
return { cached: cachedFailure, queryCount: 0 };
|
||||
}
|
||||
},
|
||||
);
|
||||
});
|
||||
|
||||
for (const result of tableResults) {
|
||||
if (!result) {
|
||||
continue;
|
||||
}
|
||||
if ('tableProfile' in result) {
|
||||
queryTotal += result.tableProfile.queryCount;
|
||||
tables.push(result.tableProfile.table);
|
||||
|
|
|
|||
|
|
@ -1,12 +1,14 @@
|
|||
import { KtxQueryError } from '../../errors.js';
|
||||
import type { KtxSqlDialect } from '../connections/dialects.js';
|
||||
import type { KtxRelationshipEndpoint } from './enrichment-types.js';
|
||||
import { applyKtxRelationshipValidationBudget, type KtxRelationshipValidationBudget } from './relationship-budget.js';
|
||||
import type { KtxRelationshipDiscoveryCandidate } from './relationship-candidates.js';
|
||||
import { type KtxRelationshipDetectionBudget, mapWithBudget } from './relationship-detection-budget.js';
|
||||
import {
|
||||
type KtxRelationshipProfileArtifact,
|
||||
type KtxRelationshipReadOnlyExecutor,
|
||||
} from './relationship-profiling.js';
|
||||
import type { KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
||||
import type { KtxProgressPort, KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
||||
|
||||
type KtxValidatedRelationshipStatus = 'accepted' | 'review' | 'rejected';
|
||||
|
||||
|
|
@ -51,6 +53,8 @@ export interface ValidateKtxRelationshipDiscoveryCandidatesInput {
|
|||
ctx: KtxScanContext;
|
||||
tableCount?: number;
|
||||
settings?: Partial<KtxRelationshipValidationSettings>;
|
||||
budget?: KtxRelationshipDetectionBudget;
|
||||
progress?: KtxProgressPort;
|
||||
}
|
||||
|
||||
const DEFAULT_SETTINGS: KtxRelationshipValidationSettings = {
|
||||
|
|
@ -182,31 +186,10 @@ function statusFor(input: {
|
|||
return 'rejected';
|
||||
}
|
||||
|
||||
export async function mapWithConcurrency<TInput, TOutput>(
|
||||
inputs: readonly TInput[],
|
||||
concurrency: number,
|
||||
mapOne: (input: TInput) => Promise<TOutput>,
|
||||
): Promise<TOutput[]> {
|
||||
const safeConcurrency = Math.max(1, Math.floor(concurrency));
|
||||
const outputs: TOutput[] = new Array(inputs.length);
|
||||
let nextIndex = 0;
|
||||
|
||||
async function worker(): Promise<void> {
|
||||
while (nextIndex < inputs.length) {
|
||||
const index = nextIndex;
|
||||
nextIndex += 1;
|
||||
outputs[index] = await mapOne(inputs[index] as TInput);
|
||||
}
|
||||
}
|
||||
|
||||
await Promise.all(Array.from({ length: Math.min(safeConcurrency, inputs.length) }, () => worker()));
|
||||
return outputs;
|
||||
}
|
||||
|
||||
function reviewWithoutValidation(
|
||||
candidate: KtxRelationshipDiscoveryCandidate,
|
||||
profiles: KtxRelationshipProfileArtifact,
|
||||
reason: 'validation_unavailable' | 'profile_unavailable' | 'validation_unattempted',
|
||||
reason: 'validation_unavailable' | 'profile_unavailable' | 'validation_unattempted' | 'validation_query_failed',
|
||||
): KtxValidatedRelationshipDiscoveryCandidate {
|
||||
const sourceColumn = singleRelationshipColumn(candidate.from);
|
||||
const targetColumn = singleRelationshipColumn(candidate.to);
|
||||
|
|
@ -257,21 +240,35 @@ export async function validateKtxRelationshipDiscoveryCandidates(
|
|||
return reviewWithoutValidation(candidate, input.profiles, 'profile_unavailable');
|
||||
}
|
||||
|
||||
const result = await executor.executeReadOnly(
|
||||
{
|
||||
connectionId: input.connectionId,
|
||||
sql: buildCoverageSql({
|
||||
dialect,
|
||||
childTable: candidate.from.table,
|
||||
childColumn: sourceColumn,
|
||||
parentTable: candidate.to.table,
|
||||
parentColumn: targetColumn,
|
||||
maxDistinctSourceValues: settings.maxDistinctSourceValues,
|
||||
}),
|
||||
maxRows: 1,
|
||||
},
|
||||
input.ctx,
|
||||
);
|
||||
let result: KtxQueryResult;
|
||||
try {
|
||||
result = await executor.executeReadOnly(
|
||||
{
|
||||
connectionId: input.connectionId,
|
||||
sql: buildCoverageSql({
|
||||
dialect,
|
||||
childTable: candidate.from.table,
|
||||
childColumn: sourceColumn,
|
||||
parentTable: candidate.to.table,
|
||||
parentColumn: targetColumn,
|
||||
maxDistinctSourceValues: settings.maxDistinctSourceValues,
|
||||
}),
|
||||
maxRows: 1,
|
||||
},
|
||||
input.ctx,
|
||||
);
|
||||
} catch (error) {
|
||||
// A bounded-query timeout (or other query rejection) on this one coverage
|
||||
// probe is best-effort: skip the candidate to review rather than aborting
|
||||
// the whole validation pass.
|
||||
if (error instanceof KtxQueryError) {
|
||||
input.ctx.logger?.warn(
|
||||
`relationship validation query skipped for ${candidate.from.table.name}.${sourceColumn} -> ${candidate.to.table.name}.${targetColumn}: ${error.message}`,
|
||||
);
|
||||
return reviewWithoutValidation(candidate, input.profiles, 'validation_query_failed');
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
const childDistinct = numberAt(result, 'child_distinct');
|
||||
const parentDistinct = numberAt(result, 'parent_distinct');
|
||||
const overlap = numberAt(result, 'overlap');
|
||||
|
|
@ -330,18 +327,29 @@ export async function validateKtxRelationshipDiscoveryCandidates(
|
|||
budget: settings.validationBudget,
|
||||
score: (candidate) => candidate.confidence,
|
||||
});
|
||||
const validated = await mapWithConcurrency(
|
||||
budgeted.toValidate.map((entry) => entry.candidate),
|
||||
settings.concurrency,
|
||||
validateCandidate,
|
||||
);
|
||||
const { results: validated } = await mapWithBudget({
|
||||
inputs: budgeted.toValidate,
|
||||
concurrency: settings.concurrency,
|
||||
budget: input.budget,
|
||||
onStart: async (index, total) => {
|
||||
await input.progress?.update((index + 1) / total, `Validating candidate ${index + 1}/${total}`, {
|
||||
transient: true,
|
||||
});
|
||||
},
|
||||
mapOne: (entry) => validateCandidate(entry.candidate),
|
||||
});
|
||||
const byOriginalIndex = new Map<number, KtxValidatedRelationshipDiscoveryCandidate>();
|
||||
for (let index = 0; index < budgeted.toValidate.length; index += 1) {
|
||||
const originalIndex = budgeted.toValidate[index]?.originalIndex;
|
||||
const candidate = validated[index];
|
||||
if (originalIndex !== undefined && candidate) {
|
||||
byOriginalIndex.set(originalIndex, candidate);
|
||||
const entry = budgeted.toValidate[index];
|
||||
if (!entry) {
|
||||
continue;
|
||||
}
|
||||
// A candidate left unvalidated by the wall-clock budget degrades to the
|
||||
// same review status as one deferred by the validation count budget.
|
||||
byOriginalIndex.set(
|
||||
entry.originalIndex,
|
||||
validated[index] ?? reviewWithoutValidation(entry.candidate, input.profiles, 'validation_unattempted'),
|
||||
);
|
||||
}
|
||||
for (const entry of budgeted.deferred) {
|
||||
byOriginalIndex.set(
|
||||
|
|
|
|||
|
|
@ -61,6 +61,9 @@ function isSafeRunId(runId: string): boolean {
|
|||
return /^[a-zA-Z0-9][a-zA-Z0-9_.-]*$/.test(runId);
|
||||
}
|
||||
|
||||
const STAGES_TABLE = 'local_scan_enrichment_stages';
|
||||
const STAGES_PRIMARY_KEY = ['connection_id', 'stage', 'input_hash'] as const;
|
||||
|
||||
export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentStateStore {
|
||||
private readonly db: Database.Database;
|
||||
|
||||
|
|
@ -68,6 +71,10 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
|||
mkdirSync(dirname(options.dbPath), { recursive: true });
|
||||
this.db = new Database(options.dbPath);
|
||||
this.db.pragma('journal_mode = WAL');
|
||||
// Disposable local resume cache: if a prior ktx wrote the table with a
|
||||
// different primary key, drop it rather than migrate. Losing it only means
|
||||
// one ingest cannot resume; it never corrupts a queryable artifact.
|
||||
this.dropStagesTableIfPrimaryKeyDiffers();
|
||||
this.db.exec(`
|
||||
CREATE TABLE IF NOT EXISTS local_scan_enrichment_stages (
|
||||
run_id TEXT NOT NULL,
|
||||
|
|
@ -80,32 +87,53 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
|||
output_json TEXT,
|
||||
error_message TEXT,
|
||||
updated_at TEXT NOT NULL,
|
||||
PRIMARY KEY (run_id, stage)
|
||||
PRIMARY KEY (connection_id, stage, input_hash)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS local_scan_enrichment_stages_content_idx
|
||||
ON local_scan_enrichment_stages (connection_id, stage, input_hash, updated_at);
|
||||
CREATE INDEX IF NOT EXISTS local_scan_enrichment_stages_run_idx
|
||||
ON local_scan_enrichment_stages (run_id, updated_at, stage);
|
||||
`);
|
||||
}
|
||||
|
||||
private dropStagesTableIfPrimaryKeyDiffers(): void {
|
||||
const columns = this.db.prepare(`PRAGMA table_info(${STAGES_TABLE})`).all() as Array<{
|
||||
name: string;
|
||||
pk: number;
|
||||
}>;
|
||||
if (columns.length === 0) {
|
||||
return;
|
||||
}
|
||||
const primaryKey = columns
|
||||
.filter((column) => column.pk > 0)
|
||||
.sort((left, right) => left.pk - right.pk)
|
||||
.map((column) => column.name);
|
||||
const matches =
|
||||
primaryKey.length === STAGES_PRIMARY_KEY.length &&
|
||||
primaryKey.every((name, index) => name === STAGES_PRIMARY_KEY[index]);
|
||||
if (!matches) {
|
||||
this.db.exec(`DROP TABLE ${STAGES_TABLE}`);
|
||||
}
|
||||
}
|
||||
|
||||
async findCompletedStage<TOutput = unknown>(
|
||||
input: KtxScanEnrichmentStageLookup,
|
||||
): Promise<KtxScanEnrichmentCompletedStage<TOutput> | null> {
|
||||
if (!isSafeRunId(input.runId)) {
|
||||
return null;
|
||||
}
|
||||
const row = this.db
|
||||
.prepare(
|
||||
`
|
||||
SELECT *
|
||||
FROM local_scan_enrichment_stages
|
||||
WHERE run_id = ?
|
||||
WHERE connection_id = ?
|
||||
AND stage = ?
|
||||
AND input_hash = ?
|
||||
AND status = 'completed'
|
||||
ORDER BY updated_at DESC
|
||||
LIMIT 1
|
||||
`,
|
||||
)
|
||||
.get(input.runId, input.stage, input.inputHash) as StageRow | undefined;
|
||||
.get(input.connectionId, input.stage, input.inputHash) as StageRow | undefined;
|
||||
|
||||
if (!row) {
|
||||
return null;
|
||||
|
|
@ -114,6 +142,31 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
|||
return parsed.status === 'completed' ? parsed : null;
|
||||
}
|
||||
|
||||
async findLatestCompletedStage(input: {
|
||||
connectionId: string;
|
||||
stage: KtxScanEnrichmentStage;
|
||||
}): Promise<KtxScanEnrichmentCompletedStage | null> {
|
||||
const row = this.db
|
||||
.prepare(
|
||||
`
|
||||
SELECT *
|
||||
FROM local_scan_enrichment_stages
|
||||
WHERE connection_id = ?
|
||||
AND stage = ?
|
||||
AND status = 'completed'
|
||||
ORDER BY updated_at DESC
|
||||
LIMIT 1
|
||||
`,
|
||||
)
|
||||
.get(input.connectionId, input.stage) as StageRow | undefined;
|
||||
|
||||
if (!row) {
|
||||
return null;
|
||||
}
|
||||
const parsed = parseStageRow(row);
|
||||
return parsed.status === 'completed' ? parsed : null;
|
||||
}
|
||||
|
||||
async saveCompletedStage<TOutput = unknown>(
|
||||
input: Omit<KtxScanEnrichmentCompletedStage<TOutput>, 'status' | 'errorMessage'>,
|
||||
): Promise<void> {
|
||||
|
|
@ -144,9 +197,8 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
|||
NULL,
|
||||
@updatedAt
|
||||
)
|
||||
ON CONFLICT(run_id, stage) DO UPDATE SET
|
||||
input_hash = excluded.input_hash,
|
||||
connection_id = excluded.connection_id,
|
||||
ON CONFLICT(connection_id, stage, input_hash) DO UPDATE SET
|
||||
run_id = excluded.run_id,
|
||||
sync_id = excluded.sync_id,
|
||||
mode = excluded.mode,
|
||||
status = excluded.status,
|
||||
|
|
@ -195,9 +247,8 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
|||
@errorMessage,
|
||||
@updatedAt
|
||||
)
|
||||
ON CONFLICT(run_id, stage) DO UPDATE SET
|
||||
input_hash = excluded.input_hash,
|
||||
connection_id = excluded.connection_id,
|
||||
ON CONFLICT(connection_id, stage, input_hash) DO UPDATE SET
|
||||
run_id = excluded.run_id,
|
||||
sync_id = excluded.sync_id,
|
||||
mode = excluded.mode,
|
||||
status = excluded.status,
|
||||
|
|
|
|||
|
|
@ -385,12 +385,17 @@ type KtxScanWarningCode =
|
|||
| 'embedding_unavailable'
|
||||
| 'scan_enrichment_backend_not_configured'
|
||||
| 'relationship_validation_failed'
|
||||
| 'relationship_detection_partial'
|
||||
| 'enrichment_stage_skipped'
|
||||
| 'enrichment_stage_stale'
|
||||
| 'relationship_llm_invalid_reference'
|
||||
| 'relationship_llm_proposal_failed'
|
||||
| 'credential_redacted'
|
||||
| 'enrichment_failed'
|
||||
| 'enrichment_timeout'
|
||||
| 'description_fallback_used'
|
||||
| 'constraint_discovery_unauthorized';
|
||||
| 'constraint_discovery_unauthorized'
|
||||
| 'object_introspection_failed';
|
||||
|
||||
export interface KtxScanWarning {
|
||||
code: KtxScanWarningCode;
|
||||
|
|
|
|||
|
|
@ -93,7 +93,7 @@ async function loadCandidates(
|
|||
listed.files
|
||||
.map((path) => path.split('/')[1])
|
||||
.filter((connectionId): connectionId is string =>
|
||||
typeof connectionId === 'string' && /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId),
|
||||
typeof connectionId === 'string' && /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId),
|
||||
),
|
||||
),
|
||||
].sort();
|
||||
|
|
|
|||
|
|
@ -20,7 +20,7 @@ interface WriteSourceOptions {
|
|||
}
|
||||
|
||||
const SL_DIR_PREFIX = 'semantic-layer';
|
||||
const CONNECTION_ID_PATTERN = /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/;
|
||||
const CONNECTION_ID_PATTERN = /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/;
|
||||
|
||||
export interface LoadAllSourcesResult {
|
||||
sources: SemanticLayerSource[];
|
||||
|
|
|
|||
|
|
@ -39,7 +39,7 @@ export function assertSafeConnectionId(connectionId: string): string {
|
|||
}
|
||||
|
||||
export function isSafeConnectionId(connectionId: string | undefined): connectionId is string {
|
||||
return typeof connectionId === 'string' && /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId);
|
||||
return typeof connectionId === 'string' && /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId);
|
||||
}
|
||||
|
||||
export function sourceNameFromPath(path: string): string {
|
||||
|
|
|
|||
|
|
@ -3,4 +3,4 @@ import { z } from 'zod';
|
|||
export const slToolConnectionIdSchema = z
|
||||
.string()
|
||||
.min(1)
|
||||
.regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/, 'Connection id must be alphanumeric and may contain _ or -');
|
||||
.regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/, 'Connection id must be alphanumeric and may contain _ or -');
|
||||
|
|
|
|||
49
packages/cli/src/context/sql-analysis/dialect-notes.ts
Normal file
49
packages/cli/src/context/sql-analysis/dialect-notes.ts
Normal file
|
|
@ -0,0 +1,49 @@
|
|||
import { readFileSync } from 'node:fs';
|
||||
import { fileURLToPath } from 'node:url';
|
||||
import type { SqlAnalysisDialect } from './ports.js';
|
||||
|
||||
// Per-engine SQL syntax notes live as markdown files under ./dialects (one per
|
||||
// dialect), served by the sql_dialect_notes MCP tool. They are package-internal:
|
||||
// copy-runtime-assets.mjs ships them to dist, and they are never installed onto an
|
||||
// agent target. The set covers every dialect reachable from a configured warehouse
|
||||
// driver; duckdb/databricks are intentionally absent because no connector produces
|
||||
// them.
|
||||
|
||||
/** @internal Dialects with an authored ./dialects/<dialect>.md file. */
|
||||
export const DIALECTS_WITH_NOTES = [
|
||||
'postgres',
|
||||
'mysql',
|
||||
'snowflake',
|
||||
'bigquery',
|
||||
'sqlite',
|
||||
'clickhouse',
|
||||
'tsql',
|
||||
] as const;
|
||||
|
||||
type DialectWithNotes = (typeof DIALECTS_WITH_NOTES)[number];
|
||||
|
||||
const notesCache = new Map<DialectWithNotes, string>();
|
||||
|
||||
function readDialectNotes(dialect: DialectWithNotes): string {
|
||||
const cached = notesCache.get(dialect);
|
||||
if (cached !== undefined) {
|
||||
return cached;
|
||||
}
|
||||
const path = fileURLToPath(new URL(`./dialects/${dialect}.md`, import.meta.url));
|
||||
const content = readFileSync(path, 'utf-8').trimEnd();
|
||||
notesCache.set(dialect, content);
|
||||
return content;
|
||||
}
|
||||
|
||||
function hasNotes(dialect: SqlAnalysisDialect): dialect is DialectWithNotes {
|
||||
return (DIALECTS_WITH_NOTES as readonly string[]).includes(dialect);
|
||||
}
|
||||
|
||||
/**
|
||||
* SQL syntax notes for a resolved dialect. Falls back to `postgres` — the
|
||||
* resolver's own default for unrecognized drivers — so any SQL connection yields
|
||||
* usable guidance rather than an empty string.
|
||||
*/
|
||||
export function sqlDialectNotes(dialect: SqlAnalysisDialect): string {
|
||||
return readDialectNotes(hasNotes(dialect) ? dialect : 'postgres');
|
||||
}
|
||||
13
packages/cli/src/context/sql-analysis/dialects/bigquery.md
Normal file
13
packages/cli/src/context/sql-analysis/dialects/bigquery.md
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
**bigquery** SQL conventions:
|
||||
- **FQTN:** backtick-quoted `` `project.dataset.table` `` (e.g. `` `my-proj.analytics.orders` ``); backticks are required when a name contains a dash.
|
||||
- **Identifiers:** backtick to quote; column and field names are case-insensitive, dataset and table names are case-sensitive.
|
||||
- **Date/time:** `DATE_TRUNC(d, MONTH)`, `EXTRACT(YEAR FROM ts)`, `PARSE_DATE('%Y-%m-%d', s)`, `FORMAT_DATE('%Y-%m', d)`, `CURRENT_DATE()`.
|
||||
- **Series:** build a spine with `UNNEST(GENERATE_DATE_ARRAY('2023-01-01', '2023-12-01', INTERVAL 1 MONTH))` for dates (or `GENERATE_ARRAY(1, n)` for integers), then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||
- **Rolling window over time:** `RANGE` frames are numeric, so range over an integer day key — `AVG(amount) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)` is a trailing 30-day average that tolerates gaps; or build a spine (see **Series**) and use a `ROWS` frame.
|
||||
- **Safe cast:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(x AS NUMERIC)`) returns `NULL` instead of erroring on a value that does not parse, so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed.
|
||||
- **Safe divide:** `SAFE_DIVIDE(num, den)` returns `NULL` instead of erroring when the denominator is `0`, so a rate/ratio/share is one expression with no `CASE den = 0` guard; multiply by `100` for a percentage. Prefer it over `num / den` for any computed measure whose denominator can be zero.
|
||||
- **Top-N / windows:** `QUALIFY` filters on a window result, e.g. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1`.
|
||||
- **JSON:** `JSON_VALUE(col, '$.k')` returns a scalar STRING, `JSON_QUERY(col, '$.k')` returns a subtree.
|
||||
- **Nested & repeated data (ARRAY / STRUCT):** the defining BigQuery shape (e.g. GA360 `ga_sessions.hits`, GA4 `event_params`/`user_properties`). Flatten a repeated column by cross-joining `UNNEST` correlated to its row — `FROM t, UNNEST(t.hits) AS h, UNNEST(h.product) AS p` — and read STRUCT fields with dot notation (`h.page.pagePath`, `p.productRevenue`). Pull one value out of a key-value parameter array with a scalar subquery: `(SELECT ep.value.int_value FROM UNNEST(event_params) AS ep WHERE ep.key = 'page_view')`. An `UNNEST` multiplies the parent row by the array's length, so a `COUNT(*)`/`SUM` after it double-counts the parent — count the parent key with `COUNT(DISTINCT visitId)` (or aggregate *inside* the unnest); use `LEFT JOIN UNNEST(arr)` to keep rows whose array is empty.
|
||||
- **Geospatial (GEOGRAPHY):** build a point with `ST_GEOGPOINT(longitude, latitude)` — **longitude first** — or parse text with `ST_GEOGFROMTEXT(wkt)` / `ST_GEOGFROMGEOJSON(s)`. Predicates: containment `ST_CONTAINS(area, pt)` / `ST_WITHIN(pt, area)` (`ST_WITHIN(a,b)=ST_CONTAINS(b,a)`); proximity `ST_DWITHIN(g1, g2, meters)` (geodesic); distance `ST_DISTANCE(g1, g2)` (meters); overlap `ST_INTERSECTS`. For areal allocation use `ST_AREA(g)` (m²) and `ST_AREA(ST_INTERSECTION(a, b))` for the overlapping area. Prefer these predicates over hand-rolled lat/lon `BETWEEN` boxes.
|
||||
- **Sharded tables:** query a wildcard table `` `dataset.events_*` `` and filter the shard with the `_TABLE_SUFFIX` pseudo-column, e.g. `WHERE _TABLE_SUFFIX BETWEEN '20240101' AND '20240131'`. The wildcard spans only the shards that exist — before a measure that pins specific dates/periods, confirm the matching shards are actually present (an absent endpoint silently yields no rows, not an error).
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
**clickhouse** SQL conventions:
|
||||
- **FQTN:** `database.table` (e.g. `analytics.orders`).
|
||||
- **Identifiers:** quote with backticks (`` `Order` ``) or double quotes; identifiers are case-sensitive.
|
||||
- **Date/time:** native `Date`/`DateTime` types. Bucket with `toStartOfMonth(ts)`, `toStartOfDay(ts)`, `toYYYYMM(ts)`; parse with `toDate(s)` / `parseDateTimeBestEffort(s)`; format with `formatDateTime(ts, '%Y-%m')`.
|
||||
- **Series:** `numbers(n)` / `range(n)` generate an integer sequence; offset a start date with `addMonths(toDate('2023-01-01'), number)` (or `arrayJoin`) to form a spine, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||
- **Rolling window over time:** a numeric range frame over a `Date` column counts in days and tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)` is a trailing 30-day average (use seconds for a `DateTime` key; the `INTERVAL` form is unsupported); or build a spine (see **Series**) and use a `ROWS` frame.
|
||||
- **Safe cast:** `toFloat64OrNull(x)` / `toDecimal64OrNull(x, s)` returns `NULL` on a value that does not parse (the `...OrZero` variants return `0` instead), so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed.
|
||||
- **Top-N / windows:** use the `LIMIT n BY key` clause for n rows per key, or rank in a CTE with `ROW_NUMBER() OVER (...)` and filter outside it.
|
||||
- **JSON:** extract from a String column with `JSONExtractString(col, 'k')`, `JSONExtractInt(col, 'k')`, etc.; a native `JSON`-typed column is traversed by dot path `col.k`.
|
||||
9
packages/cli/src/context/sql-analysis/dialects/mysql.md
Normal file
9
packages/cli/src/context/sql-analysis/dialects/mysql.md
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
**mysql** SQL conventions:
|
||||
- **FQTN:** `database.table` (MySQL has no separate schema layer — a schema is a database).
|
||||
- **Identifiers:** quote with backticks (`` `order` ``); table-name case-sensitivity follows the server filesystem, while column names are case-insensitive.
|
||||
- **Date/time:** `DATE_FORMAT(ts, '%Y-%m')`, `STR_TO_DATE(s, fmt)`, `YEAR(ts)`/`MONTH(ts)`, `CURDATE()`, `NOW()`.
|
||||
- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH RECURSIVE months(d) AS (SELECT '2023-01-01' UNION ALL SELECT DATE_ADD(d, INTERVAL 1 MONTH) FROM months WHERE d < '2023-12-01')`, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||
- **Rolling window over time:** a native interval range frame over a temporal order key tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||
- **Safe cast:** MySQL has no `TRY_CAST`, and `CAST('abc' AS DECIMAL)` returns `0` with a warning rather than erroring — guard with a pattern test first: `CASE WHEN x REGEXP '^-?[0-9.]+$' THEN CAST(x AS DECIMAL(18,4)) END` makes a value that does not parse `NULL`, so a residual-`NULL` count catches an encoding the sample missed (`REGEXP_REPLACE` can strip symbols).
|
||||
- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (...)` and filter outside it; use `ORDER BY ... LIMIT n` for a global top-N.
|
||||
- **JSON:** `JSON_EXTRACT(col, '$.k')`, or the `col->'$.k'` / `col->>'$.k'` shortcuts (`->>` unquotes to text).
|
||||
10
packages/cli/src/context/sql-analysis/dialects/postgres.md
Normal file
10
packages/cli/src/context/sql-analysis/dialects/postgres.md
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
**postgres** SQL conventions:
|
||||
- **FQTN:** `schema.table` (e.g. `public.orders`); one query targets a single database, so qualify by schema, not by database.
|
||||
- **Identifiers:** unquoted names fold to lower-case; double-quote (`"Name"`) only to keep case or use a reserved word.
|
||||
- **Date/time:** `date_trunc('month', ts)`, `EXTRACT(YEAR FROM ts)`, `to_char(ts, 'YYYY-MM')`, `CURRENT_DATE`; cast text to a date with `col::date`.
|
||||
- **Series:** build a date/number spine with `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')` (or `generate_series(1, n)` for integers), then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||
- **Rolling window over time:** a native calendar-range frame spans real dates and tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||
- **Integer division:** `/` between two integers truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; cast one operand first — `a::numeric / b` or `a * 1.0 / b` — and round only in the final projection.
|
||||
- **Safe cast:** postgres has no `TRY_CAST`; guard a text-encoded number with a pattern test before casting — `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END` yields `NULL` for a value that does not parse, so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed (`regexp_replace` can strip symbols, but chained `REPLACE` is the portable default).
|
||||
- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)` and filter in the outer query, or use `DISTINCT ON (key) ... ORDER BY key, ...` for one row per key.
|
||||
- **JSON:** `col->'k'` returns json, `col->>'k'` returns text, deep path `col#>>'{a,b}'`; prefer `jsonb` operators on `jsonb` columns.
|
||||
10
packages/cli/src/context/sql-analysis/dialects/snowflake.md
Normal file
10
packages/cli/src/context/sql-analysis/dialects/snowflake.md
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
**snowflake** SQL conventions:
|
||||
- **FQTN:** three-part `DATABASE.SCHEMA.TABLE` (e.g. `analytics.public.orders`).
|
||||
- **Identifiers:** unquoted names fold to UPPER-case; double-quote for a case-sensitive or reserved name — `orders` resolves to `"ORDERS"`, which is a different object from `"orders"`.
|
||||
- **Date/time:** `DATE_TRUNC('month', ts)`, `TO_DATE(s[, fmt])`, `DATEADD(day, -7, CURRENT_DATE)`, `CURRENT_DATE`.
|
||||
- **Series:** generate rows with `TABLE(GENERATOR(ROWCOUNT => n))` and offset a start date via `DATEADD('month', SEQ4(), '2023-01-01')` (or a recursive CTE) to form a spine, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||
- **Rolling window over time:** a native interval range frame over a date/timestamp order key tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||
- **Safe cast:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` (or `TRY_CAST(x AS NUMBER)`) returns `NULL` instead of erroring on a value that does not parse, so a residual-`NULL` count among non-sentinel rows catches an encoding the sample missed.
|
||||
- **Top-N / windows:** `QUALIFY` filters on a window result without a subquery, e.g. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1`.
|
||||
- **Semi-structured (VARIANT):** traverse with a colon path and cast with `::`, e.g. `src:vehicle[0].make::string`, `payload:events.date::date`; expand arrays with `LATERAL FLATTEN`.
|
||||
- **Geospatial (GEOGRAPHY):** build a point with `ST_MAKEPOINT(longitude, latitude)` — **longitude first** — or `TO_GEOGRAPHY(wkt_or_geojson)`; an area polygon from a closed ring of corner points with `ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_CONSTRUCT(p1, p2, …, p1)))` (repeat the first point last to close). Predicates: proximity `ST_DWITHIN(g1, g2, meters)` (geodesic) and distance `ST_DISTANCE(g1, g2)` (meters); containment `ST_CONTAINS(area, pt)` / `ST_WITHIN(pt, area)` where `ST_WITHIN(a,b)=ST_CONTAINS(b,a)`; overlap `ST_INTERSECTS`. Prefer these predicates over hand-rolled lat/lon `BETWEEN` boxes.
|
||||
11
packages/cli/src/context/sql-analysis/dialects/sqlite.md
Normal file
11
packages/cli/src/context/sql-analysis/dialects/sqlite.md
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
**sqlite** SQL conventions:
|
||||
- **FQTN:** usually the bare `table`; `main.table` to be explicit, `attached.table` for an attached database.
|
||||
- **Identifiers:** case-insensitive; double-quote (`"Name"`) to preserve a name with spaces or a keyword.
|
||||
- **Date/time:** there is no native date type — values are TEXT, INTEGER, or REAL. Format and bucket with `strftime('%Y-%m', col)`, `date(col)`, `datetime(col)`, and take day differences with `julianday(a) - julianday(b)`. Confirm the stored encoding (ISO text vs Unix epoch) before comparing.
|
||||
- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH RECURSIVE months(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d, '+1 month') FROM months WHERE d < '2023-12-01')`, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||
- **Rolling window over time:** there is no date-interval range frame (a `RANGE` offset needs a single numeric order key, and dates are TEXT), so build a gap-free date spine (see **Series**) and use a row frame — `AVG(amount) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)` then equals a trailing 30-day average; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||
- **Integer division:** `/` between two integers truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; force real division with `a * 1.0 / b` (or `CAST(a AS REAL) / b`) and round only in the final projection.
|
||||
- **Safe cast:** sqlite has no failure-signaling cast — `CAST('abc' AS REAL)` returns `0.0` and `CAST('12abc' AS REAL)` returns `12.0` (no error, no `NULL`), so an `IS NULL` coverage check silently passes. Detect a value that did not parse with a pattern guard before casting, e.g. `CASE WHEN cleaned NOT GLOB '*[^0-9.]*' THEN CAST(cleaned AS REAL) END` (strip any leading sign first), then count the residual `NULL`s.
|
||||
- **Rounding (exact half-up at `.5` boundaries):** `ROUND(x, n)` rounds half-away-from-zero, but binary floating-point stores an exact half-way value just *below* it, so the round goes the wrong way — `ROUND(6.475, 2)` returns `6.47`, not `6.48`. When a rounded measure must match exact half-up (a displayed average, rate, or price), nudge by a tiny epsilon below display precision before rounding: `ROUND(x + 1e-9, n)` lifts `6.4749999…` back to `6.475` so it rounds to `6.48` (it leaves non-boundary values unchanged). Round once, at full precision, in the final projection — never in intermediate CTEs.
|
||||
- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (...)` and filter in the outer query; use `ORDER BY ... LIMIT n` for a global top-N.
|
||||
- **JSON:** `json_extract(col, '$.k')`, or the `col->'$.k'` / `col->>'$.k'` operators (`->>` returns text).
|
||||
10
packages/cli/src/context/sql-analysis/dialects/tsql.md
Normal file
10
packages/cli/src/context/sql-analysis/dialects/tsql.md
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
**tsql** (SQL Server) SQL conventions:
|
||||
- **FQTN:** `schema.table` (e.g. `dbo.orders`), or `database.schema.table` across databases.
|
||||
- **Identifiers:** quote with square brackets (`[Order]`), or double quotes when `QUOTED_IDENTIFIER` is on; case-sensitivity is set by the database collation (commonly case-insensitive).
|
||||
- **Date/time:** `DATEPART(year, ts)`, `DATEADD(day, -7, ts)`, `DATEDIFF(day, a, b)`, `CONVERT(date, ts)`, `FORMAT(ts, 'yyyy-MM')`, `GETDATE()`.
|
||||
- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH months AS (SELECT CAST('2023-01-01' AS date) AS d UNION ALL SELECT DATEADD(month, 1, d) FROM months WHERE d < '2023-12-01')` (cap with `OPTION (MAXRECURSION 0)`), or a numbers/tally table, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||
- **Rolling window over time:** `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame), so build a gap-free date spine (see **Series**) and use a row frame — `AVG(amount) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)` — or a date-keyed self-join on `f.day BETWEEN DATEADD(day, -29, d.day) AND d.day`; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||
- **Integer division:** `/` between two `int`s truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; cast one operand first — `CAST(a AS decimal(18,4)) / b` or `a * 1.0 / b` — and round only in the final projection.
|
||||
- **Safe cast:** `TRY_CAST(x AS DECIMAL(18,4))` (or `TRY_CONVERT(decimal(18,4), x)`) returns `NULL` instead of erroring on a value that does not parse, so a residual-`NULL` count among non-sentinel rows catches an encoding the sample missed.
|
||||
- **Top-N / windows:** `SELECT TOP (n) ... ORDER BY ...` for a global top-N; for per-group, rank in a CTE with `ROW_NUMBER() OVER (...)` and filter in the outer query.
|
||||
- **JSON:** `JSON_VALUE(col, '$.k')` returns a scalar, `JSON_QUERY(col, '$.k')` returns an object/array, and `OPENJSON(col)` shreds JSON into rows.
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
const FLAT_WIKI_KEY_PATTERN = /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/;
|
||||
const FLAT_WIKI_KEY_PATTERN = /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/;
|
||||
|
||||
export function suggestFlatWikiKey(key: string): string {
|
||||
const suggested = key
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
import { createHash } from 'node:crypto';
|
||||
import YAML from 'yaml';
|
||||
import type { KtxEmbeddingPort } from '../../context/core/embedding.js';
|
||||
import type { KtxFileStorePort } from '../../context/core/file-store.js';
|
||||
import type { KtxFileStorePort, KtxFileWriteResult } from '../../context/core/file-store.js';
|
||||
import type { KtxLogger } from '../../context/core/config.js';
|
||||
import { noopLogger } from '../../context/core/config.js';
|
||||
import type { ReindexWorkResult } from '../index-sync/types.js';
|
||||
|
|
@ -232,11 +232,21 @@ export class KnowledgeWikiService {
|
|||
author: string,
|
||||
authorEmail: string,
|
||||
commitMessage?: string,
|
||||
): Promise<void> {
|
||||
await this.writePage(scope, scopeId, pageKey, frontmatter, content, author, authorEmail, commitMessage);
|
||||
): Promise<KtxFileWriteResult> {
|
||||
const writeResult = await this.writePage(
|
||||
scope,
|
||||
scopeId,
|
||||
pageKey,
|
||||
frontmatter,
|
||||
content,
|
||||
author,
|
||||
authorEmail,
|
||||
commitMessage,
|
||||
);
|
||||
const serialized = this.serializePage(frontmatter, content);
|
||||
const contentHash = createHash('sha256').update(serialized).digest('hex');
|
||||
await this.syncSinglePage(scope, scopeId, pageKey, frontmatter, content, contentHash);
|
||||
return writeResult;
|
||||
}
|
||||
|
||||
// ── Index sync (files → DB) ───────────────────────────────────
|
||||
|
|
|
|||
|
|
@ -21,6 +21,7 @@ export interface LocalKnowledgePage {
|
|||
tags: string[];
|
||||
refs: string[];
|
||||
slRefs: string[];
|
||||
connections: string[];
|
||||
}
|
||||
|
||||
export interface LocalKnowledgeSummary {
|
||||
|
|
@ -52,6 +53,7 @@ export interface WriteLocalKnowledgePageInput {
|
|||
representativeSql?: string;
|
||||
usage?: HistoricSqlWikiUsageFrontmatter;
|
||||
fingerprints?: string[];
|
||||
connections?: string[];
|
||||
}
|
||||
|
||||
const LOCAL_AUTHOR = 'ktx';
|
||||
|
|
@ -75,6 +77,19 @@ function stringArray(value: unknown): string[] {
|
|||
return Array.isArray(value) ? value.filter((item): item is string => typeof item === 'string') : [];
|
||||
}
|
||||
|
||||
/** Coerce a YAML scalar or list into a string list — `connections` accepts a single id or a list. */
|
||||
function stringList(value: unknown): string[] {
|
||||
if (typeof value === 'string') {
|
||||
return value.trim().length > 0 ? [value] : [];
|
||||
}
|
||||
return stringArray(value);
|
||||
}
|
||||
|
||||
/** A page applies to `connectionId` when it is unscoped (empty) or lists that id. */
|
||||
function pageMatchesConnection(connections: string[], connectionId: string | undefined): boolean {
|
||||
return connectionId === undefined || connections.length === 0 || connections.includes(connectionId);
|
||||
}
|
||||
|
||||
function knowledgePath(scope: LocalKnowledgeScope, userId: string | undefined, key: string): string {
|
||||
const safeKey = assertFlatWikiKey(key);
|
||||
if (scope === 'GLOBAL') {
|
||||
|
|
@ -104,6 +119,7 @@ function parseKnowledgePage(key: string, path: string, scope: LocalKnowledgeScop
|
|||
tags: [],
|
||||
refs: [],
|
||||
slRefs: [],
|
||||
connections: [],
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -117,6 +133,7 @@ function parseKnowledgePage(key: string, path: string, scope: LocalKnowledgeScop
|
|||
tags: stringArray(frontmatter.tags),
|
||||
refs: stringArray(frontmatter.refs),
|
||||
slRefs: stringArray(frontmatter.sl_refs),
|
||||
connections: stringList(frontmatter.connections),
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -133,6 +150,7 @@ function serializeKnowledgePage(input: WriteLocalKnowledgePageInput): string {
|
|||
...(input.representativeSql === undefined ? {} : { representative_sql: input.representativeSql }),
|
||||
...(input.usage === undefined ? {} : { usage: input.usage }),
|
||||
...(input.fingerprints === undefined ? {} : { fingerprints: input.fingerprints }),
|
||||
...(input.connections === undefined ? {} : { connections: input.connections }),
|
||||
};
|
||||
return `---\n${YAML.stringify(frontmatter, { indent: 2, lineWidth: 0 }).trimEnd()}\n---\n\n${input.content.trim()}\n`;
|
||||
}
|
||||
|
|
@ -180,7 +198,7 @@ export async function readLocalKnowledgePage(
|
|||
|
||||
export async function listLocalKnowledgePages(
|
||||
project: KtxLocalProject,
|
||||
input: { userId?: string } = {},
|
||||
input: { userId?: string; connectionId?: string } = {},
|
||||
): Promise<LocalKnowledgeSummary[]> {
|
||||
const userId = input.userId ?? 'local';
|
||||
const pages: LocalKnowledgeSummary[] = [];
|
||||
|
|
@ -193,7 +211,7 @@ export async function listLocalKnowledgePages(
|
|||
continue;
|
||||
}
|
||||
const page = await readPageAtPath(project, key, path, scope);
|
||||
if (page) {
|
||||
if (page && pageMatchesConnection(page.connections, input.connectionId)) {
|
||||
pages.push({ key, path, scope, summary: page.summary });
|
||||
}
|
||||
}
|
||||
|
|
@ -227,6 +245,26 @@ export async function listLocalKnowledgePageKeys(
|
|||
return [...keys].sort();
|
||||
}
|
||||
|
||||
/**
|
||||
* Connection ids referenced by any stored page's `connections` frontmatter,
|
||||
* sorted and deduped. Derived from files; an id here that is not configured in
|
||||
* `ktx.yaml` is a warn-only condition (config and content evolve independently)
|
||||
* and never blocks loading, searching, or reading.
|
||||
*/
|
||||
export async function listReferencedConnectionIds(
|
||||
project: KtxLocalProject,
|
||||
input: { userId?: string } = {},
|
||||
): Promise<string[]> {
|
||||
const pages = await loadAllKnowledgePages(project, { userId: input.userId });
|
||||
const ids = new Set<string>();
|
||||
for (const page of pages) {
|
||||
for (const id of page.connections) {
|
||||
ids.add(id);
|
||||
}
|
||||
}
|
||||
return [...ids].sort();
|
||||
}
|
||||
|
||||
function scorePage(page: LocalKnowledgePage, terms: string[]): number {
|
||||
const haystack = buildKnowledgeSearchText(page.key, page.summary, page.content, page.tags).toLowerCase();
|
||||
return terms.some((term) => haystack.includes(term)) ? 3 : 0;
|
||||
|
|
@ -266,9 +304,12 @@ function tokenLaneCandidates(pages: LocalKnowledgePage[], terms: string[]) {
|
|||
|
||||
async function loadAllKnowledgePages(
|
||||
project: KtxLocalProject,
|
||||
input: { userId?: string } = {},
|
||||
input: { userId?: string; connectionId?: string } = {},
|
||||
): Promise<LocalKnowledgePage[]> {
|
||||
const summaries = await listLocalKnowledgePages(project, { userId: input.userId });
|
||||
const summaries = await listLocalKnowledgePages(project, {
|
||||
userId: input.userId,
|
||||
connectionId: input.connectionId,
|
||||
});
|
||||
const pages: LocalKnowledgePage[] = [];
|
||||
for (const summary of summaries) {
|
||||
const page = await readPageAtPath(project, summary.key, summary.path, summary.scope);
|
||||
|
|
@ -281,10 +322,27 @@ async function loadAllKnowledgePages(
|
|||
|
||||
async function searchLocalKnowledgePagesWithSqlite(
|
||||
project: KtxLocalProject,
|
||||
input: { query: string; userId?: string; embeddingService?: KtxEmbeddingPort | null; limit?: number },
|
||||
input: {
|
||||
query: string;
|
||||
userId?: string;
|
||||
connectionId?: string;
|
||||
embeddingService?: KtxEmbeddingPort | null;
|
||||
limit?: number;
|
||||
},
|
||||
): Promise<LocalKnowledgeSearchResult[]> {
|
||||
// The sqlite index is shared across connections and `index.sync` deletes any
|
||||
// page not in its input, so sync the FULL corpus and apply the connection
|
||||
// filter only to the candidate/result set (`allowedPaths`), never to sync.
|
||||
const pages = await loadAllKnowledgePages(project, { userId: input.userId });
|
||||
const byPath = new Map(pages.map((page) => [page.path, page]));
|
||||
const allowedPaths = new Set(
|
||||
pages.filter((page) => pageMatchesConnection(page.connections, input.connectionId)).map((page) => page.path),
|
||||
);
|
||||
const allowedPages = pages.filter((page) => allowedPaths.has(page.path));
|
||||
// Scope the lexical/semantic lanes inside the query so their LIMIT applies to
|
||||
// in-scope rows; only narrow when a connection is requested (otherwise every
|
||||
// path is allowed and the filter is a no-op).
|
||||
const scopedPaths = input.connectionId === undefined ? undefined : [...allowedPaths];
|
||||
const byPath = new Map(allowedPages.map((page) => [page.path, page]));
|
||||
const embeddingService = input.embeddingService ?? null;
|
||||
const index = new SqliteKnowledgeIndex({ dbPath: sqliteKnowledgeDbPath(project) });
|
||||
const existingPages = index.getExistingPages();
|
||||
|
|
@ -309,7 +367,7 @@ async function searchLocalKnowledgePagesWithSqlite(
|
|||
|
||||
index.sync(indexPages);
|
||||
|
||||
const finalLimit = input.limit ?? Math.max(1, indexPages.length);
|
||||
const finalLimit = input.limit ?? Math.max(1, allowedPages.length);
|
||||
const core = new HybridSearchCore();
|
||||
const generators: SearchCandidateGenerator[] = [
|
||||
{
|
||||
|
|
@ -318,6 +376,7 @@ async function searchLocalKnowledgePagesWithSqlite(
|
|||
const rows = index.searchLexicalCandidates({
|
||||
queryText: args.queryText,
|
||||
limit: args.laneCandidatePoolLimit,
|
||||
allowedPaths: scopedPaths,
|
||||
});
|
||||
return {
|
||||
candidates: rows.map((row) => ({ id: row.id, rank: row.rank, rawScore: row.rawScore })),
|
||||
|
|
@ -327,7 +386,10 @@ async function searchLocalKnowledgePagesWithSqlite(
|
|||
{
|
||||
lane: 'token',
|
||||
async generate(args) {
|
||||
const rows = tokenLaneCandidates(pages, args.normalizedQuery.terms).slice(0, args.laneCandidatePoolLimit);
|
||||
const rows = tokenLaneCandidates(allowedPages, args.normalizedQuery.terms).slice(
|
||||
0,
|
||||
args.laneCandidatePoolLimit,
|
||||
);
|
||||
return {
|
||||
candidates: rows.map((row, index) => ({
|
||||
id: row.page.path,
|
||||
|
|
@ -349,6 +411,7 @@ async function searchLocalKnowledgePagesWithSqlite(
|
|||
const rows = index.searchSemanticCandidates({
|
||||
queryEmbedding,
|
||||
limit: args.laneCandidatePoolLimit,
|
||||
allowedPaths: scopedPaths,
|
||||
});
|
||||
return {
|
||||
candidates: rows
|
||||
|
|
@ -387,14 +450,14 @@ async function searchLocalKnowledgePagesWithSqlite(
|
|||
|
||||
async function searchLocalKnowledgePagesWithScan(
|
||||
project: KtxLocalProject,
|
||||
input: { query: string; userId?: string; limit?: number },
|
||||
input: { query: string; userId?: string; connectionId?: string; limit?: number },
|
||||
): Promise<LocalKnowledgeSearchResult[]> {
|
||||
const terms = input.query
|
||||
.toLowerCase()
|
||||
.split(/\s+/)
|
||||
.map((term) => term.trim())
|
||||
.filter(Boolean);
|
||||
const pages = await loadAllKnowledgePages(project, { userId: input.userId });
|
||||
const pages = await loadAllKnowledgePages(project, { userId: input.userId, connectionId: input.connectionId });
|
||||
const results: LocalKnowledgeSearchResult[] = [];
|
||||
for (const page of pages) {
|
||||
const score = scorePage(page, terms);
|
||||
|
|
@ -416,7 +479,13 @@ async function searchLocalKnowledgePagesWithScan(
|
|||
|
||||
export async function searchLocalKnowledgePages(
|
||||
project: KtxLocalProject,
|
||||
input: { query: string; userId?: string; embeddingService?: KtxEmbeddingPort | null; limit?: number },
|
||||
input: {
|
||||
query: string;
|
||||
userId?: string;
|
||||
connectionId?: string;
|
||||
embeddingService?: KtxEmbeddingPort | null;
|
||||
limit?: number;
|
||||
},
|
||||
): Promise<LocalKnowledgeSearchResult[]> {
|
||||
if (project.config.storage.search === 'sqlite-fts5') {
|
||||
return searchLocalKnowledgePagesWithSqlite(project, input);
|
||||
|
|
|
|||
|
|
@ -85,6 +85,22 @@ function parseEmbedding(raw: string | null): number[] | null {
|
|||
}
|
||||
}
|
||||
|
||||
/** A provided-but-empty allowlist means "no page is in scope", distinct from an absent (unfiltered) one. */
|
||||
function isEmptyAllowlist(allowedPaths: readonly string[] | undefined): boolean {
|
||||
return allowedPaths !== undefined && allowedPaths.length === 0;
|
||||
}
|
||||
|
||||
/** Build a `<keyword> path IN (?, …)` fragment so the scope filter applies inside the query, before any LIMIT. */
|
||||
function pathInClause(
|
||||
keyword: 'AND' | 'WHERE',
|
||||
allowedPaths: readonly string[] | undefined,
|
||||
): { sql: string; params: string[] } {
|
||||
if (allowedPaths === undefined || allowedPaths.length === 0) {
|
||||
return { sql: '', params: [] };
|
||||
}
|
||||
return { sql: ` ${keyword} path IN (${allowedPaths.map(() => '?').join(', ')})`, params: [...allowedPaths] };
|
||||
}
|
||||
|
||||
function normalizeFtsQuery(query: string): string {
|
||||
const terms = query
|
||||
.toLowerCase()
|
||||
|
|
@ -217,23 +233,28 @@ export class SqliteKnowledgeIndex {
|
|||
);
|
||||
}
|
||||
|
||||
searchLexicalCandidates(input: { queryText: string; limit: number }): WikiSqliteLaneCandidate[] {
|
||||
searchLexicalCandidates(input: {
|
||||
queryText: string;
|
||||
limit: number;
|
||||
allowedPaths?: readonly string[];
|
||||
}): WikiSqliteLaneCandidate[] {
|
||||
const ftsQuery = normalizeFtsQuery(input.queryText);
|
||||
if (!ftsQuery) {
|
||||
if (!ftsQuery || isEmptyAllowlist(input.allowedPaths)) {
|
||||
return [];
|
||||
}
|
||||
|
||||
const pathFilter = pathInClause('AND', input.allowedPaths);
|
||||
const rows = this.db
|
||||
.prepare(
|
||||
`
|
||||
SELECT path, bm25(knowledge_pages_fts) AS rank
|
||||
FROM knowledge_pages_fts
|
||||
WHERE knowledge_pages_fts MATCH ?
|
||||
WHERE knowledge_pages_fts MATCH ?${pathFilter.sql}
|
||||
ORDER BY rank ASC, path ASC
|
||||
LIMIT ?
|
||||
`,
|
||||
)
|
||||
.all(ftsQuery, Math.max(1, input.limit)) as SearchRow[];
|
||||
.all(ftsQuery, ...pathFilter.params, Math.max(1, input.limit)) as SearchRow[];
|
||||
|
||||
return rows.map((row, index) => ({
|
||||
id: row.path,
|
||||
|
|
@ -243,16 +264,25 @@ export class SqliteKnowledgeIndex {
|
|||
}));
|
||||
}
|
||||
|
||||
searchSemanticCandidates(input: { queryEmbedding: number[]; limit: number }): WikiSqliteLaneCandidate[] {
|
||||
searchSemanticCandidates(input: {
|
||||
queryEmbedding: number[];
|
||||
limit: number;
|
||||
allowedPaths?: readonly string[];
|
||||
}): WikiSqliteLaneCandidate[] {
|
||||
if (isEmptyAllowlist(input.allowedPaths)) {
|
||||
return [];
|
||||
}
|
||||
|
||||
const pathFilter = pathInClause('WHERE', input.allowedPaths);
|
||||
const rows = this.db
|
||||
.prepare(
|
||||
`
|
||||
SELECT path, embedding_json
|
||||
FROM knowledge_pages
|
||||
FROM knowledge_pages${pathFilter.sql}
|
||||
ORDER BY path ASC
|
||||
`,
|
||||
)
|
||||
.all() as IndexedPageRow[];
|
||||
.all(...pathFilter.params) as IndexedPageRow[];
|
||||
|
||||
return rows
|
||||
.flatMap((row) => {
|
||||
|
|
|
|||
|
|
@ -35,6 +35,12 @@ const wikiWriteInputSchema = z.object({
|
|||
tags: z.array(z.string()).optional(),
|
||||
refs: z.array(z.string()).optional(),
|
||||
sl_refs: z.array(z.string()).optional(),
|
||||
connections: z
|
||||
.union([z.string(), z.array(z.string())])
|
||||
.optional()
|
||||
.describe(
|
||||
'Connection ids this page applies to. Set [connectionId] on database-specific pages (with a connection-distinctive key); omit or leave empty for org-wide content. REPLACE semantics like tags.',
|
||||
),
|
||||
source: z.string().optional(),
|
||||
intent: z.string().optional(),
|
||||
tables: z.array(z.string()).optional(),
|
||||
|
|
@ -150,6 +156,33 @@ Keys must be flat file names, not directory paths. Use tags/source frontmatter f
|
|||
const resolvedTags = input.tags === undefined ? existingFm?.tags : input.tags;
|
||||
const resolvedRefs = input.refs === undefined ? existingFm?.refs : input.refs;
|
||||
const resolvedSlRefs = input.sl_refs === undefined ? existingFm?.sl_refs : input.sl_refs;
|
||||
const incomingConnections =
|
||||
input.connections === undefined
|
||||
? undefined
|
||||
: typeof input.connections === 'string'
|
||||
? [input.connections]
|
||||
: input.connections;
|
||||
const resolvedConnections = incomingConnections === undefined ? existingFm?.connections : incomingConnections;
|
||||
|
||||
// Data-loss guard: page keys are a flat global namespace, so a write whose
|
||||
// incoming connection scope is disjoint from an existing same-key page would
|
||||
// silently overwrite a different connection's page. Surface it instead.
|
||||
const existingConnections = existingFm?.connections ?? [];
|
||||
if (
|
||||
existing &&
|
||||
incomingConnections !== undefined &&
|
||||
incomingConnections.length > 0 &&
|
||||
existingConnections.length > 0 &&
|
||||
!incomingConnections.some((id) => existingConnections.includes(id))
|
||||
) {
|
||||
return {
|
||||
markdown:
|
||||
`Error: page "${input.key}" already exists scoped to a different connection ` +
|
||||
`(connections: ${existingConnections.join(', ')}); writing it for ${incomingConnections.join(', ')} ` +
|
||||
`would overwrite that page. Use a connection-distinctive key (e.g. "${input.key}_${incomingConnections[0]}").`,
|
||||
structured: { success: false, key: input.key },
|
||||
};
|
||||
}
|
||||
|
||||
let finalContent: string;
|
||||
const finalFm: WikiFrontmatter = {
|
||||
|
|
@ -159,6 +192,7 @@ Keys must be flat file names, not directory paths. Use tags/source frontmatter f
|
|||
tags: resolvedTags,
|
||||
refs: resolvedRefs,
|
||||
sl_refs: resolvedSlRefs,
|
||||
connections: resolvedConnections,
|
||||
source: input.source === undefined ? existingFm?.source : input.source,
|
||||
intent: input.intent === undefined ? existingFm?.intent : input.intent,
|
||||
tables: input.tables === undefined ? existingFm?.tables : input.tables,
|
||||
|
|
|
|||
|
|
@ -16,6 +16,12 @@ export interface WikiFrontmatter {
|
|||
tags?: string[];
|
||||
refs?: string[];
|
||||
sl_refs?: string[];
|
||||
/**
|
||||
* Connection ids this page applies to. Absent or empty ⇒ unscoped: the page
|
||||
* applies to all connections. Additive metadata, orthogonal to GLOBAL/USER
|
||||
* scope; it does not namespace page keys.
|
||||
*/
|
||||
connections?: string[];
|
||||
usage_mode: 'always' | 'auto' | 'never';
|
||||
sort_order?: number;
|
||||
source?: string;
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js';
|
||||
import type { KtxEmbeddingPort } from './context/core/embedding.js';
|
||||
import { loadKtxProject } from './context/project/project.js';
|
||||
import { assertConfiguredConnectionId } from './context/connections/configured-connections.js';
|
||||
import {
|
||||
type LocalKnowledgeSearchResult,
|
||||
type LocalKnowledgeSummary,
|
||||
|
|
@ -17,12 +18,21 @@ import { createRankBadgeFormatter, printList, type PrintListColumn } from './io/
|
|||
import { emitTelemetryEvent } from './telemetry/index.js';
|
||||
|
||||
export type KtxKnowledgeArgs =
|
||||
| { command: 'list'; projectDir: string; userId: string; output?: string; json?: boolean; cliVersion: string }
|
||||
| {
|
||||
command: 'list';
|
||||
projectDir: string;
|
||||
userId: string;
|
||||
connectionId?: string;
|
||||
output?: string;
|
||||
json?: boolean;
|
||||
cliVersion: string;
|
||||
}
|
||||
| {
|
||||
command: 'search';
|
||||
projectDir: string;
|
||||
query: string;
|
||||
userId: string;
|
||||
connectionId?: string;
|
||||
output?: string;
|
||||
json?: boolean;
|
||||
limit?: number;
|
||||
|
|
@ -120,7 +130,14 @@ export async function runKtxKnowledge(
|
|||
try {
|
||||
const project = await loadKtxProject({ projectDir: args.projectDir });
|
||||
if (args.command === 'list') {
|
||||
const pages = await listLocalKnowledgePages(project, { userId: args.userId });
|
||||
const connectionId =
|
||||
args.connectionId === undefined
|
||||
? undefined
|
||||
: assertConfiguredConnectionId(project.config.connections, args.connectionId);
|
||||
const pages = await listLocalKnowledgePages(project, {
|
||||
userId: args.userId,
|
||||
...(connectionId !== undefined ? { connectionId } : {}),
|
||||
});
|
||||
const mode = resolveOutputMode({ explicit: args.output, json: args.json, io });
|
||||
printList<LocalKnowledgeSummary>({
|
||||
rows: pages,
|
||||
|
|
@ -145,6 +162,10 @@ export async function runKtxKnowledge(
|
|||
return 0;
|
||||
}
|
||||
if (args.command === 'search') {
|
||||
const connectionId =
|
||||
args.connectionId === undefined
|
||||
? undefined
|
||||
: assertConfiguredConnectionId(project.config.connections, args.connectionId);
|
||||
const embeddingService = await wikiSearchEmbeddingService(project, deps, { cliVersion: args.cliVersion }, io);
|
||||
const search = deps.searchLocalKnowledgePages ?? defaultSearchLocalKnowledgePages;
|
||||
const results = await search(project, {
|
||||
|
|
@ -152,6 +173,7 @@ export async function runKtxKnowledge(
|
|||
userId: args.userId,
|
||||
embeddingService,
|
||||
limit: args.limit,
|
||||
...(connectionId !== undefined ? { connectionId } : {}),
|
||||
});
|
||||
await emitTelemetryEvent({
|
||||
name: 'wiki_query_completed',
|
||||
|
|
|
|||
|
|
@ -5,6 +5,7 @@ import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
|||
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js';
|
||||
import { isInitializeRequest } from '@modelcontextprotocol/sdk/types.js';
|
||||
import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js';
|
||||
import { createMcpLogger, serializeMcpError } from './context/mcp/logger.js';
|
||||
import { createKtxMcpServerFactory } from './mcp-server-factory.js';
|
||||
|
||||
const DEFAULT_ALLOWED_HOSTS = ['localhost', '127.0.0.1', '::1'] as const;
|
||||
|
|
@ -173,6 +174,9 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
|||
options.createMcpServer === undefined
|
||||
? await (options.loadProject ?? loadKtxProject)({ projectDir: options.projectDir })
|
||||
: undefined;
|
||||
// One logger per process, shared by the tool layer (via the factory) and the
|
||||
// transport lifecycle below. Falls back to a no-op sink for programmatic callers.
|
||||
const logger = createMcpLogger(options.io ?? { stdout: { write() {} }, stderr: { write() {} } });
|
||||
const createMcpServer =
|
||||
options.createMcpServer ??
|
||||
(await createKtxMcpServerFactory({
|
||||
|
|
@ -180,6 +184,7 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
|||
projectDir: options.projectDir,
|
||||
cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version,
|
||||
io: options.io,
|
||||
logger,
|
||||
}));
|
||||
const sessions = new Map<string, StreamableHTTPServerTransport>();
|
||||
|
||||
|
|
@ -189,6 +194,7 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
|||
sessionIdGenerator: () => randomUUID(),
|
||||
onsessioninitialized: (sessionId) => {
|
||||
sessions.set(sessionId, transport);
|
||||
logger.info({ sessionId }, 'session.open');
|
||||
},
|
||||
onsessionclosed: (sessionId) => {
|
||||
sessions.delete(sessionId);
|
||||
|
|
@ -197,15 +203,25 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
|||
allowedOrigins: config.allowedOrigins,
|
||||
enableDnsRebindingProtection: true,
|
||||
});
|
||||
// onclose is the universal session-end signal (clean DELETE and dropped connection both
|
||||
// close the transport), so session.close is logged here rather than in onsessionclosed.
|
||||
transport.onclose = () => {
|
||||
if (transport.sessionId) {
|
||||
sessions.delete(transport.sessionId);
|
||||
logger.info({ sessionId: transport.sessionId }, 'session.close');
|
||||
}
|
||||
};
|
||||
transport.onerror = (error) => {
|
||||
logger.error(
|
||||
{ ...(transport.sessionId ? { sessionId: transport.sessionId } : {}), err: serializeMcpError(error) },
|
||||
'transport.error',
|
||||
);
|
||||
};
|
||||
await createMcpServer().connect(transport);
|
||||
return transport;
|
||||
}
|
||||
|
||||
const startedAt = performance.now();
|
||||
const server = createServer(async (req, res) => {
|
||||
const path = requestPath(req);
|
||||
const auth = isMcpRequestAuthorized({ path, headers: req.headers }, config);
|
||||
|
|
@ -216,7 +232,8 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
|||
|
||||
if (path === '/health' && req.method === 'GET') {
|
||||
const port = listenerPort(server, config.port);
|
||||
writeJson(res, 200, { status: 'ok', projectDir: options.projectDir, port });
|
||||
const uptimeMs = Math.round(performance.now() - startedAt);
|
||||
writeJson(res, 200, { status: 'ok', projectDir: options.projectDir, port, uptimeMs });
|
||||
return;
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -2,6 +2,9 @@ import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js';
|
|||
import { createDefaultKtxMcpServer } from './context/mcp/server.js';
|
||||
import { createLocalProjectMcpContextPorts } from './context/mcp/local-project-ports.js';
|
||||
import { createLocalProjectMemoryIngest } from './context/memory/local-memory.js';
|
||||
import { assertConfiguredConnectionId } from './context/connections/configured-connections.js';
|
||||
import type { KtxMcpLogger } from './context/mcp/logger.js';
|
||||
import type { MemoryIngestPort } from './context/mcp/types.js';
|
||||
import type { KtxLocalProject } from './context/project/project.js';
|
||||
import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
||||
import type { KtxCliIo } from './cli-runtime.js';
|
||||
|
|
@ -23,6 +26,7 @@ export async function createKtxMcpServerFactory(input: {
|
|||
projectDir: string;
|
||||
cliVersion: string;
|
||||
io?: KtxCliIo;
|
||||
logger?: KtxMcpLogger;
|
||||
}): Promise<() => McpServer> {
|
||||
const io = input.io ?? noopMcpIo();
|
||||
const queryExecutor = createKtxCliIngestQueryExecutor(input.project);
|
||||
|
|
@ -57,13 +61,25 @@ export async function createKtxMcpServerFactory(input: {
|
|||
},
|
||||
});
|
||||
|
||||
let memoryIngest: ReturnType<typeof createLocalProjectMemoryIngest> | undefined;
|
||||
let memoryIngest: MemoryIngestPort | undefined;
|
||||
try {
|
||||
memoryIngest = createLocalProjectMemoryIngest(input.project, {
|
||||
const baseMemoryIngest = createLocalProjectMemoryIngest(input.project, {
|
||||
semanticLayerCompute,
|
||||
queryExecutor,
|
||||
embeddingProvider,
|
||||
});
|
||||
// Validate the explicit connectionId argument here so a typo is rejected with the
|
||||
// configured ids before the ingest run starts; persisted page scope is validated
|
||||
// separately (warn-only) and must not fail.
|
||||
memoryIngest = {
|
||||
ingest: (ingestInput) => {
|
||||
if (ingestInput.connectionId !== undefined) {
|
||||
assertConfiguredConnectionId(input.project.config.connections, ingestInput.connectionId);
|
||||
}
|
||||
return baseMemoryIngest.ingest(ingestInput);
|
||||
},
|
||||
status: (runId) => baseMemoryIngest.status(runId),
|
||||
};
|
||||
} catch (error) {
|
||||
io.stderr.write(`ktx MCP memory_ingest disabled: ${error instanceof Error ? error.message : String(error)}\n`);
|
||||
}
|
||||
|
|
@ -75,6 +91,7 @@ export async function createKtxMcpServerFactory(input: {
|
|||
userContext: { userId: 'local' },
|
||||
projectDir: input.projectDir,
|
||||
io,
|
||||
...(input.logger ? { logger: input.logger } : {}),
|
||||
contextTools: {
|
||||
...contextTools,
|
||||
...(memoryIngest ? { memoryIngest } : {}),
|
||||
|
|
|
|||
|
|
@ -4,6 +4,7 @@ import { loadKtxProject } from './context/project/project.js';
|
|||
import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
||||
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
|
||||
import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js';
|
||||
import { createMcpLogger, serializeMcpError } from './context/mcp/logger.js';
|
||||
import { createKtxMcpServerFactory } from './mcp-server-factory.js';
|
||||
|
||||
export interface RunKtxMcpStdioServerOptions {
|
||||
|
|
@ -25,6 +26,8 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions)
|
|||
stdout: { write() {} },
|
||||
stderr: options.io?.stderr ?? process.stderr,
|
||||
};
|
||||
// stdout is reserved for JSON-RPC, so the logger writes to stderr only.
|
||||
const logger = createMcpLogger(protocolIo);
|
||||
const createMcpServer =
|
||||
options.createMcpServer ??
|
||||
(await createKtxMcpServerFactory({
|
||||
|
|
@ -32,6 +35,7 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions)
|
|||
projectDir: options.projectDir,
|
||||
cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version,
|
||||
io: protocolIo,
|
||||
logger,
|
||||
}));
|
||||
const stdin = options.stdin ?? process.stdin;
|
||||
const transport = new StdioServerTransport(stdin, options.stdout);
|
||||
|
|
@ -50,13 +54,17 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions)
|
|||
settle(() => reject(error instanceof Error ? error : new Error(String(error))));
|
||||
});
|
||||
};
|
||||
transport.onclose = () => settle(resolve);
|
||||
transport.onclose = () => {
|
||||
logger.info({}, 'session.close');
|
||||
settle(resolve);
|
||||
};
|
||||
transport.onerror = (error) => {
|
||||
options.io?.stderr.write(`ktx MCP stdio transport error: ${error.message}\n`);
|
||||
logger.error({ err: serializeMcpError(error) }, 'transport.error');
|
||||
settle(() => reject(error));
|
||||
};
|
||||
stdin.once('end', closeTransport);
|
||||
stdin.once('close', closeTransport);
|
||||
logger.info({}, 'session.open');
|
||||
createMcpServer().connect(transport).catch((error: unknown) => {
|
||||
settle(() => reject(error instanceof Error ? error : new Error(String(error))));
|
||||
});
|
||||
|
|
|
|||
|
|
@ -46,7 +46,7 @@ const NOTION_SCRIPTED_MODE_HINT =
|
|||
'Notion picker requires a TTY. Use --no-input --notion-root-page-id <UUID> for scripted mode.';
|
||||
|
||||
function assertSafeNotionPickerConnectionId(connectionId: string): void {
|
||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
||||
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
||||
throw new Error(`Unsafe connection id: ${connectionId}`);
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -19,6 +19,8 @@ A single artifact typically produces multiple actions: one SL source per table/v
|
|||
|
||||
<scope>
|
||||
All wiki writes go to the GLOBAL scope - they will be visible to every user of this ktx project. Phrase wiki pages as objective business knowledge, not personal preference. The `wiki_write` tool handles scope selection automatically for external ingest.
|
||||
|
||||
When a `connectionId` is shown in the prompt context, tag database-specific pages with `connections: [<that id>]` and give them connection-distinctive keys (`orders_sales_db`, not `orders`) so same-concept pages from other databases do not collide or pollute each other's searches. Leave `connections` empty for org-wide knowledge that applies across every database. See the `wiki_capture` skill's "Connection scoping" section.
|
||||
</scope>
|
||||
|
||||
<do_not>
|
||||
|
|
|
|||
|
|
@ -20,7 +20,7 @@ import {
|
|||
import { createAggregateProgressPort } from './progress-port-adapter.js';
|
||||
import { resolvePublicIngestRuntimeRequirements } from './runtime-requirements.js';
|
||||
import type { KtxScanArgs, KtxScanDeps } from './scan.js';
|
||||
import type { KtxTableRef } from './context/scan/types.js';
|
||||
import type { KtxScanEnrichmentStage, KtxTableRef } from './context/scan/types.js';
|
||||
import { profileMark } from './startup-profile.js';
|
||||
import { isDemoConnection } from './telemetry/demo-detect.js';
|
||||
import { emitProjectStackSnapshot, emitTelemetryEvent, reportException } from './telemetry/index.js';
|
||||
|
|
@ -46,6 +46,7 @@ export type KtxPublicIngestArgs =
|
|||
queryHistory?: KtxPublicIngestQueryHistoryFlag;
|
||||
queryHistoryWindowDays?: number;
|
||||
scanMode?: Extract<KtxScanArgs, { command: 'run' }>['mode'];
|
||||
stages?: KtxScanEnrichmentStage[];
|
||||
detectRelationships?: boolean;
|
||||
cliVersion?: string;
|
||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||
|
|
@ -123,6 +124,7 @@ interface KtxPublicContextBuildArgs {
|
|||
queryHistory?: KtxPublicIngestQueryHistoryFlag;
|
||||
queryHistoryWindowDays?: number;
|
||||
scanMode?: Extract<KtxScanArgs, { command: 'run' }>['mode'];
|
||||
stages?: KtxScanEnrichmentStage[];
|
||||
detectRelationships?: boolean;
|
||||
cliVersion?: string;
|
||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||
|
|
@ -974,6 +976,7 @@ async function runIngestTargetSteps(
|
|||
mode: 'enriched',
|
||||
detectRelationships: target.detectRelationships === true,
|
||||
dryRun: false,
|
||||
...(args.stages ? { stages: args.stages } : {}),
|
||||
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
||||
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
||||
};
|
||||
|
|
@ -1153,6 +1156,7 @@ export async function runKtxPublicIngest(
|
|||
...(args.queryHistory ? { queryHistory: args.queryHistory } : {}),
|
||||
...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}),
|
||||
...(args.scanMode ? { scanMode: args.scanMode } : {}),
|
||||
...(args.stages ? { stages: args.stages } : {}),
|
||||
...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}),
|
||||
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
||||
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
||||
|
|
|
|||
|
|
@ -1,4 +1,10 @@
|
|||
import type { KtxProgressPort, KtxScanMode, KtxScanReport, KtxScanWarning } from './context/scan/types.js';
|
||||
import type {
|
||||
KtxProgressPort,
|
||||
KtxScanEnrichmentStage,
|
||||
KtxScanMode,
|
||||
KtxScanReport,
|
||||
KtxScanWarning,
|
||||
} from './context/scan/types.js';
|
||||
import { runLocalScan } from './context/scan/local-scan.js';
|
||||
import { loadKtxProject, type KtxLocalProject } from './context/project/project.js';
|
||||
import { getKtxCliPackageInfo } from './cli-runtime.js';
|
||||
|
|
@ -21,6 +27,8 @@ export interface KtxScanArgs {
|
|||
mode: KtxScanMode;
|
||||
detectRelationships: boolean;
|
||||
dryRun: boolean;
|
||||
/** Enrichment stages to (re)run; omit to run all eligible stages. */
|
||||
stages?: KtxScanEnrichmentStage[];
|
||||
databaseIntrospectionUrl?: string;
|
||||
cliVersion?: string;
|
||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||
|
|
@ -180,8 +188,14 @@ function describeWarningGroup(code: string, count: number): string {
|
|||
return `${count} LLM relationship ${plural(count, 'proposal')} failed.`;
|
||||
case 'scan_enrichment_backend_not_configured':
|
||||
return 'Scan enrichment backend is not configured; AI stages were skipped.';
|
||||
case 'enrichment_stage_skipped':
|
||||
return `${count} requested ${plural(count, 'enrichment stage')} could not run (prerequisite missing).`;
|
||||
case 'enrichment_stage_stale':
|
||||
return `${count} enrichment ${plural(count, 'stage')} are stale after a selective run; re-run them to refresh.`;
|
||||
case 'credential_redacted':
|
||||
return `${count} ${plural(count, 'credential')} were redacted from scan output.`;
|
||||
case 'object_introspection_failed':
|
||||
return `${count} ${plural(count, 'object')} skipped during introspection (broken or inaccessible objects were excluded; the rest were ingested).`;
|
||||
default:
|
||||
return `${count} ${plural(count, 'warning')} (${code})`;
|
||||
}
|
||||
|
|
@ -348,6 +362,7 @@ export async function runKtxScan(args: KtxScanArgs, io: KtxCliIo = process, deps
|
|||
connectionId: args.connectionId,
|
||||
mode: args.mode,
|
||||
detectRelationships: args.detectRelationships,
|
||||
...(args.stages ? { stages: args.stages } : {}),
|
||||
dryRun: args.dryRun,
|
||||
trigger: 'cli',
|
||||
databaseIntrospectionUrl: args.databaseIntrospectionUrl,
|
||||
|
|
|
|||
|
|
@ -320,7 +320,7 @@ function unique(values: string[]): string[] {
|
|||
}
|
||||
|
||||
function assertSafeDatabaseConnectionId(connectionId: string): void {
|
||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
||||
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
||||
throw new Error(`Unsafe connection id: ${connectionId}`);
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -13,12 +13,15 @@ You have access to ktx MCP tools for data discovery, semantic-layer analysis, ra
|
|||
- `kind: 'wiki'` -> `wiki_read`
|
||||
- `kind: 'sl_source'`, `kind: 'sl_measure'`, or `kind: 'sl_dimension'` -> `sl_read_source`
|
||||
- `kind: 'table'` or `kind: 'column'` -> `entity_details`
|
||||
- For tables you intend to query, sample a few rows (`entity_details` plus a small `sql_execution` sample) to confirm date encoding, null prevalence in join/filter keys, and the real enum values — see the `<sql_craft>` Schema-discovery rules.
|
||||
3. **Resolve business values** - if the user named a value such as "Acme Corp", "enterprise", or "status=shipped", call `dictionary_search` to find which column holds it.
|
||||
4. **Plan the analysis** - identify the grain, metrics, dimensions, filters, time window, and expected row limits before querying.
|
||||
4. **Plan the analysis** - identify the grain, metrics, dimensions, filters, time window, and expected row limits before querying. Confirm each filter/join column's real type before comparing it (see the `<sql_craft>` Schema-discovery rules). **Write down the exact output-column list first** — enumerate, from the question, every column the answer must have (each requested metric/attribute; for every grouped or named entity BOTH its id and its name; every input to each derived value) and treat that list as the contract your final `SELECT` must match column-for-column. Decide this list *before* writing SQL, not after — building the projection to a pre-stated list is far more reliable than reviewing for omissions at the end.
|
||||
5. **Query** -
|
||||
- Prefer `sl_query` when the semantic layer covers the question.
|
||||
- Use `sql_execution` only for questions the semantic layer does not cover.
|
||||
6. **Validate and explain** - sanity-check totals, filters, null handling, and time zones. State the source tables or semantic-layer objects used.
|
||||
- Before writing raw `sql_execution` SQL against a connection, call `sql_dialect_notes` with its connection id to get that engine's FQTN, identifier-quoting, date, top-N, series/calendar, rolling-window, safe-cast, and JSON conventions.
|
||||
- When authoring raw SQL, apply the `<sql_craft>` rules: build incrementally, keep window ordering deterministic, compute at full precision, and match the answer's grain to the question.
|
||||
6. **Validate and explain** - sanity-check totals, filters, null handling, and time zones. **Always run the final completeness check before emitting:** re-read the question and confirm every requested output, each named entity's identity, each derived value's inputs, and the question's grain are all in the projection — see the `<sql_craft>` Final completeness check. If a result is unexpectedly empty or its grain looks wrong, work through the `<sql_craft>` Answer-completeness rules to diagnose. State the source tables or semantic-layer objects used.
|
||||
7. **Capture durable learnings** - call `memory_ingest` whenever a turn produces something worth remembering (business rules, metric definitions, schema gotchas, recurring findings) **or** whenever the user asks you to remember something. Pass markdown in `content` including any source context the memory agent should weigh. Each call is a feedback loop; better notes today mean smarter `discover_data` and `wiki_search` results tomorrow.
|
||||
</workflow>
|
||||
|
||||
|
|
@ -38,6 +41,201 @@ You have access to ktx MCP tools for data discovery, semantic-layer analysis, ra
|
|||
- Ask a concise clarification only when the metric, date range, entity, or grain is genuinely ambiguous and cannot be inferred from context.
|
||||
</rules>
|
||||
|
||||
<sql_craft>
|
||||
Heuristics for writing *correct* (not merely runnable) SQL. Each is a default plus the reason it holds on any database; apply judgment to the question and the data.
|
||||
|
||||
**Schema discovery before writing SQL**
|
||||
- **Sample before you compose.** Inspect representative rows of every table you will touch (`entity_details` plus a small `sql_execution` sample) to confirm date/time encoding (`YYYYMMDD` integer vs ISO text vs epoch), null prevalence in join/filter keys, and the real set of categorical/enum values. Assumptions about encoding and nullability are the most common source of silently-wrong filters.
|
||||
- **Cast to the real type before comparing.** Compare a column against a literal of its actual type in `WHERE`/`JOIN`. A string column compared to a numeric literal (or the reverse) can silently match nothing instead of raising an error.
|
||||
- **Parse text-encoded numerics before doing math on them.** When a column the question treats as a number is stored as text, sample its **distinct** values (the *Sample before you compose* habit) to learn the encodings actually present — unit suffixes (`K`/`M`/`B`), currency symbols, thousands separators, percent signs, and non-numeric sentinels (`-`, `N/A`, empty) — and never infer the format from the column name. *Why:* aggregated or compared as-is the text sorts lexically (`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL, so the query runs but the number is silently wrong instead of erroring.
|
||||
- **Strip, scale, and cast in one early CTE.** Strip currency/separator/percent characters, multiply by the suffix scale (`K`=10^3, `M`=10^6, `B`=10^9), map sentinels to `0` **or** `NULL` (by the *Default by additivity* rule below), then cast to a numeric type — all in a single early CTE so every layer above sees clean numbers. This is the *meaning-is-numeric* complement to *Cast to the real type before comparing*. *Why:* one clean conversion at the base keeps the lexical-sort-and-cast-to-0 failure out of every downstream layer.
|
||||
- **Confirm the parse covered every value.** After parsing, count the non-sentinel rows that failed to parse — a failed parse should surface as `NULL`, visible only with a **failure-detecting cast** from `sql_dialect_notes` (a plain `CAST` errors on some engines and on sqlite silently returns `0`/partial, so an `IS NULL` check is meaningless there). *Why:* an encoding the sample missed would otherwise vanish into `0`/NULL instead of being caught.
|
||||
- **Parse code/dependency text by its real grammar, not one broad regex.** When a question extracts imported/required/loaded packages or modules from stored source text or dependency manifests, parse by the *language or format*, not a single pattern: Java `import`/`import static` — drop the terminal class/member, keep the package path, and allow valid identifier segments with underscores and mixed case (e.g. com.planet_ink.coffee_mud); Python — handle both `import a, b as c` and `from a.b import c`, stripping aliases; R — handle `library(...)` and `require(...)`; notebooks (`.ipynb`) — parse the JSON and read each cell's `source` lines *before* applying the language rules (never regex the raw notebook file, whose prose contains the words "import"/"from"); JSON/manifest files — `PARSE_JSON` and flatten the dependency object's keys (e.g. `require`). Strip comments/prose lines first and split multi-import lines so each declared dependency is counted once. *Why:* a single lowercase-segment regex silently drops real identifiers and matches prose, so the ranking is wrong though the query runs.
|
||||
- **Decide the counting population explicitly when a table is deduplicated.** If the source table is de-duplicated and carries a documented copy/occurrence count (e.g. a `copies` column = "repositories sharing this exact content"), the count grain is a real modeling choice: weight by that column only when the question's population is clearly the represented files/repositories; otherwise count the distinct stored rows. State which population the question names and match it — do not default to one silently. *Why:* on a deduplicated table `COUNT(*)` and `SUM(copies)` give different rankings, so the right metric depends on the population the question asks about, not on which is larger.
|
||||
|
||||
```sql
|
||||
-- "Total trade volume" where value_text holds '1.2K', '3M', '$1,200', '-'.
|
||||
-- WRONG: a naive cast collapses the formatted values ('1.2K'->1.2, '$1,200'->0,
|
||||
-- '-'->0) instead of erroring, so the SUM comes back silently far too low.
|
||||
SELECT SUM(CAST(value_text AS REAL)) AS total_volume FROM metrics;
|
||||
|
||||
-- RIGHT: strip symbols/suffixes, scale by the K/M/B suffix, map sentinels to 0, and
|
||||
-- cast once in an early CTE; the SUM then runs over clean numbers.
|
||||
WITH parsed AS (
|
||||
SELECT CASE WHEN value_text IN ('-', 'N/A', '') THEN 0
|
||||
ELSE CAST(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(value_text,
|
||||
'$', ''), ',', ''), 'K', ''), 'M', ''), 'B', '') AS DECIMAL(18, 4))
|
||||
* CASE WHEN value_text LIKE '%K' THEN 1000
|
||||
WHEN value_text LIKE '%M' THEN 1000000
|
||||
WHEN value_text LIKE '%B' THEN 1000000000 ELSE 1 END
|
||||
END AS volume
|
||||
FROM metrics
|
||||
)
|
||||
SELECT SUM(volume) AS total_volume FROM parsed;
|
||||
```
|
||||
|
||||
- **Canonicalize observed URL-path variants before page-level analysis.** When a question groups, filters, or sequences web pages by a `path`/`url` column, sample its distinct values first. If the data itself shows route-label variants — `/route` and `/route/` for the same page context — define a canonical page-path expression in an early CTE and use it everywhere above that CTE: preserve `/` as root, strip trailing slashes only from non-root paths, and map an observed empty path to `/` *only* when the column is a URL path and the sampled rows show blank root-page events. Do **not** merge different route names (`/input` ≠ `/regist/input`), strip query strings/fragments/host/scheme, lowercase paths, or canonicalize at all when the question asks for the raw stored URL/path or for slash-vs-no-slash differences. *Why:* raw request logs routinely store the same user-visible page both with and without a trailing slash, so grouping or sequencing the raw labels silently splits one page into several — but inventing aliases the data doesn't show would just as silently merge distinct pages.
|
||||
|
||||
**Composition**
|
||||
- **Build incrementally.** Assemble complex queries one CTE at a time, checking each layer's output on a small sample before stacking the next; a wrong intermediate layer is far cheaper to catch early than to debug in the final number.
|
||||
- **Avoid fan-out joins — the danger is cumulative.** Any one-to-many hop on the path between a measure's owning table and the aggregate inflates that measure, even when the offending join sits several hops below the `SUM`/`COUNT` and is easy to miss. The fix is the single-hop one applied per measure-owning table along the whole chain: pre-aggregate each coarse-grained measure to its own grain in a CTE, then join the already-aggregated result.
|
||||
- **Verify the grain holds across each join.** As you compose, confirm a join you intend to be one-to-one / many-to-one did not change the grain you aggregate at — e.g. the row count (or the count of the aggregate's key) is unchanged across it. When a join is genuinely one-to-many, reach for the default fix (pre-aggregate to grain); for a pure count, `COUNT(DISTINCT key)` is an acceptable escape hatch. A `SUM`/`AVG` of a fanned-out measure must pre-aggregate — `DISTINCT` cannot de-duplicate a sum.
|
||||
- **A join that only attaches a label must not drop rows — `LEFT JOIN` it, and key the aggregate on the fact column.** Fan-out's mirror image is just as silent: when you join a dimension table *only to fetch a display attribute* (a name for an id, a category for a product), an **incomplete** dimension — and dimensions are routinely incomplete: trimmed catalogs, late-arriving rows, slowly-changing-dimension gaps — makes a plain inner `JOIN` quietly **discard every fact row whose key has no parent**, shrinking the counts, sums, and the universe over which any share / average / median is computed (a measure halves with no error and no empty result). Two guards: (1) inner-join a dimension only when you *intend it as a filter* — you want exactly the rows that have a parent — never merely to read a column off it; for pure enrichment use `LEFT JOIN`. (2) Key the aggregation and `GROUP BY` on the **fact** column (`sales.prod_id`), not the dimension column (`products.prod_id`), so an unmatched key yields a `NULL` label on its own row rather than dropping or collapsing it. Use the same row-count check as above, but for an enrichment join confirm the fact row count is *unchanged* (not merely un-inflated); if a dimension you only wanted a name from removed rows, that is the bug.
|
||||
- **Source each filter, date, and measure from the table that OWNS it at the question's grain.** When two joined fact tables carry similarly-named columns at *different* grains — a parent (one row per order: its `status`, placement `created_at`, `num_of_item`) and its child (one row per line item: line `created_at`, `sale_price`, `cost`) — read each predicate/measure from the table whose grain the question names, not from whichever is in scope after the join. "Orders that are Complete", "for each month of the orders", "the order's creation date" are *order*-grain, so the status filter and the month bucket come from the parent order row, even though the child also has `status`/`created_at` columns; line price and cost come from the child. *Why:* the parent's and child's copies of a column diverge (an item's placement month or status can differ from its order's), so anchoring an order-grain filter or calendar on the line table silently buckets/filters the wrong rows. The mirror at metric grain: never combine a parent-grain count with child rows after the join (e.g. `num_of_item * SUM(line_price)` once per line) — compute each measure at its own grain (sum line prices to the order, take `num_of_item` once per order) before combining.
|
||||
|
||||
```sql
|
||||
-- "How many orders per region contain a returned item?" — count each order once.
|
||||
-- WRONG: order_lines is joined to apply the line-level filter, which multiplies
|
||||
-- orders; an order with two returned lines is counted twice, three joins below
|
||||
-- the COUNT, where the inflation is easy to miss.
|
||||
SELECT r.region_id, COUNT(*) AS n_orders
|
||||
FROM regions r
|
||||
JOIN stores s ON s.region_id = r.region_id
|
||||
JOIN orders o ON o.store_id = s.store_id
|
||||
JOIN order_lines l ON l.order_id = o.order_id
|
||||
WHERE l.status = 'returned'
|
||||
GROUP BY r.region_id;
|
||||
|
||||
-- RIGHT: collapse order_lines to one row per qualifying order first, then join up
|
||||
-- so each order contributes exactly once.
|
||||
WITH returned_orders AS (
|
||||
SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id
|
||||
)
|
||||
SELECT r.region_id, COUNT(*) AS n_orders
|
||||
FROM regions r
|
||||
JOIN stores s ON s.region_id = r.region_id
|
||||
JOIN orders o ON o.store_id = s.store_id
|
||||
JOIN returned_orders ro ON ro.order_id = o.order_id
|
||||
GROUP BY r.region_id;
|
||||
-- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an
|
||||
-- order-level measure fanned out this way must pre-aggregate — DISTINCT can't
|
||||
-- de-duplicate a sum.
|
||||
```
|
||||
|
||||
**Ordering & aggregation determinism**
|
||||
- **Make the ordering deterministic.** Give every ranking/ordering window a complete tie-breaker by appending unique key column(s) to `ORDER BY`, so `RANK`/`ROW_NUMBER`/`LAG` results are stable instead of flickering between runs.
|
||||
- **Order inside string/array aggregation.** When concatenating rows into a delimited string or building an ordered array (`GROUP_CONCAT` / `string_agg` / `array_agg`), the element order is **undefined unless you specify it** — put an explicit `ORDER BY` on the aggregate. Be deliberate about collation: the default text sort is **binary/case-sensitive** (so `'BBQ'` sorts before `'Bacon'` because uppercase code points precede lowercase), which differs from a case-insensitive sort; pick the one the question implies and apply it consistently (`ORDER BY ... COLLATE NOCASE` for case-insensitive). *Why:* an unordered or differently-collated concatenation produces a string with the right elements in the wrong order — runnable but not matching the expected text.
|
||||
- **Emit a list-valued answer cell as a delimited STRING, not a raw ARRAY/repeated column.** When the answer needs several values in one cell (a set of names/codes/tags for an entity), build a delimited scalar with `STRING_AGG(x, ',' ORDER BY x)` (or `ARRAY_TO_STRING(ARRAY_AGG(x ORDER BY x), ',')`) — do not return a SQL `ARRAY`/repeated column. *Why:* an array column serializes to an engine-specific representation (e.g. `['a' 'b']` or `["a","b"]`) that won't compare equal to a plain delimited list (`a,b`), so a values-correct answer still mismatches when materialized to rows.
|
||||
- **Filter after the window, not before**, for sequence / "first" / "most recent" / "since" questions: compute the window over the full partition, then keep the rows you want. A pre-filter shrinks the partition the window ranks over, so "first"/"most recent" is measured against the wrong set.
|
||||
|
||||
```sql
|
||||
-- "Each customer's first order, restricted to orders since 2024-01-01."
|
||||
-- Wrong: the filter runs before the window, so it ranks only 2024 rows and
|
||||
-- misses customers whose true first order was earlier.
|
||||
SELECT customer_id, order_id,
|
||||
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date, order_id) AS seq
|
||||
FROM orders
|
||||
WHERE order_date >= '2024-01-01'; -- then keep seq = 1
|
||||
|
||||
-- Right: rank the full partition in a CTE, then filter in the outer query.
|
||||
WITH ranked AS (
|
||||
SELECT customer_id, order_id, order_date,
|
||||
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date, order_id) AS seq
|
||||
FROM orders
|
||||
)
|
||||
SELECT customer_id, order_id, order_date
|
||||
FROM ranked
|
||||
WHERE seq = 1 AND order_date >= '2024-01-01';
|
||||
```
|
||||
|
||||
- **Cumulative / running total.** Use an explicit frame — `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` — with a complete tie-breaker on the `ORDER BY` (per the deterministic-ordering rule above). *Why:* a bare `ORDER BY` defaults to a `RANGE`-based frame bounded at the current row, which on ties in the order key folds every tied peer into one cumulative value — it runs and looks plausible, but the running total jumps at each tie boundary.
|
||||
- **Rolling window over calendar time, plus minimum periods.** "Rolling N days/months" spans a *calendar range*, not a fixed row count: a `ROWS BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are missing. Two sanctioned paths — (a) build a gap-free date spine first (the **Series** idiom from `sql_dialect_notes`) so one row exists per calendar unit, then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the intended span (fully portable); or (b) where the engine supports it, a native calendar range frame — or a date-keyed self-join — expresses the window directly: get the rolling-window idiom from `sql_dialect_notes`, do not inline it. For **minimum periods** ("only after N periods of data"), emit `NULL` until the window is full — guard on `COUNT(*) OVER (<same frame>) = N`, counting non-null observations instead when "N periods" means N data points rather than N calendar slots. *Why:* a row-count frame over missing dates measures the wrong span, and a partial early window is not the requested metric.
|
||||
- **Period-over-period.** Compare against the prior period with `LAG(metric) OVER (PARTITION BY k ORDER BY period)`; compute growth as `(cur - prev) / prev` at full precision, rounding only in the final projection (per the round-at-the-end rule below), and guard the divide against a zero or absent prior — e.g. `… / NULLIF(prev, 0)`. *Why:* without `LAG`, or ordered against the wrong neighbor, the comparison lands on the wrong period, and an unguarded ratio errors or returns garbage when the prior period is zero or missing.
|
||||
|
||||
```sql
|
||||
-- "Each account's running balance over time" — a cumulative sum of net per
|
||||
-- account, in date order.
|
||||
-- WRONG: a bare ORDER BY defaults to a RANGE-based frame, so two txns dated the
|
||||
-- same day share one inflated balance (every tied peer folds into that value).
|
||||
SELECT account_id, txn_date, net,
|
||||
SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date) AS running_balance
|
||||
FROM account_txns;
|
||||
|
||||
-- RIGHT: an explicit ROWS frame accumulates row by row, and a complete tie-breaker
|
||||
-- (txn_id) makes the order — and the running total — deterministic across ties.
|
||||
SELECT account_id, txn_date, net,
|
||||
SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date, txn_id
|
||||
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_balance
|
||||
FROM account_txns;
|
||||
```
|
||||
|
||||
**Numeric precision**
|
||||
- **Integer division truncates on postgres/sqlite/tsql.** The `/` operator between two integers does integer division on **postgres, sqlite, and SQL Server** — `5 / 2` is `2`, `wins / games` is `0` — so a rate, share, or `SUM(a) / COUNT(*)` silently floors to an integer. Cast one operand to a fractional type before dividing: `wins * 1.0 / games`, `CAST(wins AS REAL) / games`, or `SUM(a)::numeric / COUNT(*)`, then round at the end. mysql and bigquery already return a fractional result from `/` (on bigquery prefer `SAFE_DIVIDE` to also guard a zero denominator).
|
||||
- **Round only at the end.** Compute at full precision and round in the final projection, never inside intermediate CTEs. Be explicit about truncation: an integer cast (`CAST(x AS INT)`) truncates toward zero, so use explicit rounding when rounding is what you mean.
|
||||
- **Macro vs micro average.** Match the average to the wording. "Average of per-group averages" is `AVG(group_metric)`; an "overall" or "weighted" average is `SUM(numerator) / SUM(denominator)`. The two diverge whenever group sizes differ.
|
||||
|
||||
**Answer completeness / interpretation**
|
||||
- **"Top / highest / most / lowest"** returns only the winning row(s) — keep the top-ranked row from the window result — not the full ranked list, unless the question asks for a list.
|
||||
- **"For each X / per X / by X"** returns exactly one row per X. Do not collapse to a single value unless the question says "overall" or "total across X".
|
||||
- **A named business measure means its amount, not a row count.** When a question asks for "sales", "revenue", "spend", "value", or "volume" of money/goods without an explicit "number / count of", aggregate the monetary/quantity **amount** (`SUM(price)` / `SUM(amount)`), not `COUNT(*)` of rows. *Why:* "toy sales" reads as sales revenue; counting order rows silently answers a different question.
|
||||
- **Answer literally — do not add unrequested transformations.** Apply exactly the filters, joins, grouping, and computation the question (and any `external_knowledge` doc) states; do not add "helpful" extras the task never asked for — extra status/category predicates, area/residential *weighting* of an average the question states plainly, entity-name *normalization* that forces joins the source leaves unmatched, or a re-derived value where the question names a specific stored measure/column. When the wording bounds an **aggregate** ("committees whose *total* is between $0 and $200", "entities with 5+ orders"), filter the aggregate with `HAVING`, not each row with `WHERE`. When an `external_knowledge` doc gives an explicit formula or function/UDF definition, implement it **verbatim** — same operators, constants, and ordering — rather than substituting your own "more correct" math. *Why:* each unrequested predicate silently drops valid rows, each unrequested weighting/normalization or re-derivation changes the value, and a row-level filter for an aggregate bound answers a different question — so a more-sophisticated-looking query is wrong against the literal ask. Prefer the simplest reading that satisfies the question.
|
||||
- **Don't project free-text columns the question didn't ask for.** A description/body/comment/notes column whose values contain commas or newlines corrupts the row-delimited output and is almost never the requested value — leave it out of the final projection unless the question explicitly asks for it.
|
||||
- **"Inter-event duration / gap / interval" is the time between consecutive events, not a magnitude.** When the question asks the typical gap/interval/time *between* occurrences (releases, visits, orders), order rows by the event timestamp and take `LEAD`/`LAG` date differences, then aggregate — never a duration/length/runtime *column*.
|
||||
- **Anchor a period bucket to the lifecycle event the wording names.** When a record carries several lifecycle timestamps (created/placed, approved, shipped, delivered, completed, settled) and the question counts/measures records in a *named completed state* by period ("delivered orders by month", "shipped items per week", "completed payments by day"), bucket the period by that named event's own timestamp (`order_delivered_customer_date`, `shipped_at`, `settled_at`) — the state value is the qualifying filter, the matching timestamp is the time anchor. Use the creation/placed/purchased/submitted timestamp only when the question names that *start* event (purchased, placed, created, ordered, submitted) or no matching event timestamp exists. If several timestamps fit, pick the one for the event as experienced by the question's subject (customer delivery = the customer-receipt date, not the carrier-handoff or estimated date). If the named state is used only as a non-temporal filter (counts by customer/city/seller with no period bucket), it is just a filter — introduce no date anchor. Confirm each timestamp's meaning from column names, semantic-layer descriptions, and sample rows first. *Why:* bucketing a completed-state count by the record's creation date silently answers a different question — "records that later reached that state, grouped by when they started" — than the one asked.
|
||||
- **"Highest / most across several achievements" aggregates per metric over the whole history.** When a question asks for top values across multiple metrics or a career/lifetime total ("most runs, most wickets, longest span"), emit one row per metric with that metric summed/maxed over all the entity's records — not a single top-season or top-row snapshot.
|
||||
- **An aggregate scoped to a per-entity selected set is computed across that set.** "The average revenue per actor **in those top-3 films**", "the mean order value over each customer's **last 5 orders**" means, per entity, the aggregate over the items it selected — one value per entity spanning its chosen items — NOT the per-item value. The per-item formula the question gives ("divide film revenue among its actors") computes each item's contribution; the average/total then spans the selected items. When the question states both a per-item computation AND an aggregate over the items, compute and project BOTH (the per-item value and the across-set aggregate, e.g. `AVG(item_value) OVER (PARTITION BY entity)`). The set is chosen by the ranking measure the question names — "top-N **revenue-generating** films" ranks each entity's items by the item's **own total revenue** — and that ranking is independent of the per-item value (the share), which feeds only the aggregate, never the top-N selection.
|
||||
- **Coverage over a selected group is a set-membership aggregate (one value for the whole group), not a per-entity metric.** When a question first selects a group of entities ("the top 5 actors", "these products", "the eligible stores") and then asks what count/share/percentage of a **different** subject domain has any relationship to *these* selected entities ("what % of **customers** rented films featuring these actors"), the subject set is the **UNION across the whole group**: select the entity ids in a CTE, join to the subject facts, `COUNT(DISTINCT subject_id)` **once** across the group, and return one aggregate at the subject-domain grain (with the numerator/denominator projected if the question states a ratio). Counting the subject per selected entity and reporting N rows answers a different question and double-counts subjects that relate to more than one entity. This is the **collective-coverage** cousin of the per-entity rule above: emit one row per selected entity **only** when the wording says "for each / per / by / list" or asks for each entity's *own* metric ("top 5 players **and their** batting averages"); a bare "what share … of these" is one collective value.
|
||||
- **Complete the panel for "each / every / all / per <period or category>".** These cues mean the answer's rows should be the *full expected domain* — every month in the asked range, every region in the dimension — not only the groups that happen to have fact rows; a plain inner `GROUP BY` emits only non-empty groups, so empty periods/categories silently drop and a "12 months" answer comes back short. Build the full set of groups (the **spine**), `LEFT JOIN` the aggregated facts onto it, then default the gaps:
|
||||
- **Spine source.** For a category, take the distinct domain from the **dimension/entity table** (e.g. every region from `regions`) — not `SELECT DISTINCT` over the facts, which can only list categories that already occur; with no dimension table, distinct values from the *unfiltered* facts are the best available domain. For a period or number range, generate the series across the question's stated range (when the range is "all periods present", derive its bounds from `MIN`/`MAX` over the *unfiltered* facts). Series syntax is engine-specific — get the series/calendar idiom from `sql_dialect_notes` rather than inlining one dialect's generator.
|
||||
- **Default by additivity.** `COALESCE(metric, 0)` only for **additive** measures (a `COUNT`/`SUM` of events or amounts, where "no activity" genuinely reads as 0); leave **non-additive** measures (`AVG`, a rate, a ratio, a price, a running balance) as `NULL` — absence is "no data", and 0 would be a wrong reading.
|
||||
- **Don't over-apply.** *each / every / all* wants the complete domain; *which / that have* ("which months had orders") wants only the groups that exist — there the spine is wrong, so emit observed groups only.
|
||||
- **Selecting the extreme group needs the spine too.** When you pick the group with the highest/lowest count or total over a period/category domain ("the month with the **lowest** number of active customers", "the region with the **fewest** orders"), rank over the COMPLETE spine, not only groups that have fact rows — an empty period/category is a genuine 0 and is frequently the true minimum, yet ranking over observed groups alone silently makes it unselectable and returns the wrong extreme. A period with NO rows at all never appears in a `GROUP BY` of the facts: generate the full calendar of the stated range first ("each month of 2020" → all 12 months, even if only 4 have transactions), `LEFT JOIN` the per-group aggregates, `COALESCE` the count to 0, and only THEN rank — otherwise a zero-activity month that is the true lowest is invisible to the ranking.
|
||||
- **Answer every requested output.** When a question asks for several things — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a value plus its components ("X, Y, and their ratio") — the projection needs one column per requested output, not just the first clause. *Why:* answering only the first clause is the most common way a runnable query is still wrong — the grain and methodology can be perfect yet the answer is short by columns. This is the umbrella over the next two rules: *keep the inputs* is its "value + components" case and *expose identity* is its "entity identity" case, so a **complete projection** is exactly every requested metric/attribute, plus the identifier of each named entity, plus the inputs to each derived value, at the question's grain. It governs *which columns* appear — distinct from *Top …* and *For each X* above, which govern *which rows* — and composes with them ("highest and lowest per region" needs one row per region and a column per clause).
|
||||
- **Keep the inputs to a derived value.** When the question asks for inputs and something derived from them ("X, Y, and their ratio"), project the inputs as columns alongside the derived value.
|
||||
- **A comparison BETWEEN two specific extremes is one wide row.** When the question asks for a single value derived by comparing two named extremes — "the **difference between** the highest and the lowest month", "the ratio of the best to the worst" — present BOTH extremes side by side in ONE row: each extreme's attributes as their own columns (e.g. `highest_month`, `highest_value`, `lowest_month`, `lowest_value`) plus the comparison as a column (`difference`). The comparison is a single fact about the pair, so the answer is one wide row — NOT one row per extreme with the comparison repeated. (Contrast: "report a metric **for each** group/category" — e.g. "a percentage for each helmet group", "the top player for each outcome" — has no cross-item comparison and stays long, one row per group.)
|
||||
- **Project BOTH identity and label.** When the result is per-entity, project the entity's **identifier and its human-readable name together** — whichever you grouped by, add the other. The id disambiguates duplicate names, and a consumer may legitimately expect either; supplying both is the safe, complete choice (a per-entity answer that gives only one is a frequent cause of an otherwise-correct result not matching).
|
||||
- **Diagnose empty results.** When a result is unexpectedly empty, relax filters one at a time to find which predicate removed the rows instead of guessing.
|
||||
- **Spatial predicates ("within area / within N meters / inside this polygon / nearest").** When a question filters or relates rows by geography, use the engine's geospatial functions — get the exact ones from `sql_dialect_notes` — rather than hand-rolling latitude/longitude `BETWEEN` boxes (which are wrong off the equator and ignore polygon shape). Recipe: (1) turn each location into a geography point with the point constructor — **mind argument order, most take longitude before latitude**; (2) for an area of interest build a polygon from its boundary/corner coordinates, closing the ring (first point repeated last); (3) test the relation with the engine's containment (`contains`/`within`), proximity (`dwithin(g1,g2,meters)`), or overlap (`intersects`) predicate. For "the features within the same area as entity X", first resolve X's own geometry in a CTE, then join candidates on the spatial predicate against it. *Why:* spatial relationships are not axis-aligned ranges; the geodesic predicates are both correct and index-assisted, while a raw coordinate box silently includes/excludes the wrong rows.
|
||||
- **Collapse a multi-valued attribute to one representative per entity before counting classes or a concentration metric.** When an entity carries a multi-valued classification array (IPC/CPC codes, tags, categories) and the methodology counts *entities per class* or computes a concentration/diversity measure (HHI, originality, a share), pick exactly **one representative value per entity** in a CTE first — use the array's `main`/`primary`/`first` flag when present, else a defined fallback (e.g. the most-frequent value) — then aggregate. Equally, when a metric's denominator is defined as a count of **entities** ("the number of patents cited"), use `COUNT(DISTINCT entity)`, not the count of exploded array rows. *Why:* `LATERAL FLATTEN`/unnest of the array multiplies an entity's weight by how many codes it has, inflating per-class frequencies and skewing any concentration metric — the query runs but the ranking/score is wrong. (Take the representative rule from the methodology/`external_knowledge` doc when it specifies one; do not invent a selection the source does not state.)
|
||||
- **Final completeness check.** Before emitting the final SQL, re-read the question and confirm the projection covers: (1) every named **metric / attribute** asked for (→ *answer every requested output*); (2) the **identifier** of each grouped or named entity (→ *expose identity*); (3) every **input** to each derived value (→ *keep the inputs*); (4) all at the **grain** the question specifies (→ *for each X* / *complete the panel*). Run this on every query, not only when a result looks off. **Don't over-project:** anything outside that set — a column the question never asked for, added "to be safe" — adds noise, misleads the reader into thinking it matters, and makes the result harder to consume. Match the request exactly: neither short nor padded.
|
||||
|
||||
```sql
|
||||
-- "How many orders per region, including regions with no orders?" — every region
|
||||
-- must appear, even one with zero orders.
|
||||
-- WRONG: grouping the facts can only emit regions that have at least one order,
|
||||
-- so a zero-order region silently drops and the panel comes back short a row.
|
||||
SELECT region_id, COUNT(*) AS n_orders
|
||||
FROM orders
|
||||
GROUP BY region_id;
|
||||
|
||||
-- RIGHT: start from the full region domain (the dimension table), LEFT JOIN the
|
||||
-- per-region counts onto it, and COALESCE the additive count to 0 so empty
|
||||
-- regions read 0 instead of vanishing.
|
||||
WITH region_domain AS (
|
||||
SELECT DISTINCT region_id FROM regions
|
||||
),
|
||||
region_orders AS (
|
||||
SELECT region_id, COUNT(*) AS n_orders
|
||||
FROM orders
|
||||
GROUP BY region_id
|
||||
)
|
||||
SELECT d.region_id, COALESCE(ro.n_orders, 0) AS n_orders
|
||||
FROM region_domain d
|
||||
LEFT JOIN region_orders ro ON ro.region_id = d.region_id;
|
||||
```
|
||||
|
||||
```sql
|
||||
-- "For each region, report the highest and the lowest monthly order count and the
|
||||
-- difference between them." A complete answer is five columns: the region's id and
|
||||
-- name, the highest, the lowest, and their difference.
|
||||
-- WRONG: answers only the first clause and drops the region id, the lowest, and the
|
||||
-- difference — four of the five requested columns are missing.
|
||||
SELECT region_name, MAX(monthly_orders) AS highest
|
||||
FROM region_monthly
|
||||
GROUP BY region_name;
|
||||
|
||||
-- RIGHT: one column per requested output plus the entity's identity, at the region
|
||||
-- grain — id and name, the highest, the lowest, and their difference.
|
||||
SELECT r.region_id, r.region_name,
|
||||
MAX(rm.monthly_orders) AS highest,
|
||||
MIN(rm.monthly_orders) AS lowest,
|
||||
MAX(rm.monthly_orders) - MIN(rm.monthly_orders) AS order_count_range
|
||||
FROM regions r
|
||||
JOIN region_monthly rm ON rm.region_id = r.region_id
|
||||
GROUP BY r.region_id, r.region_name;
|
||||
```
|
||||
</sql_craft>
|
||||
|
||||
<examples>
|
||||
**Input:** "How many orders did Acme Corp place last month?"
|
||||
|
||||
|
|
|
|||
|
|
@ -112,6 +112,30 @@ All three fields use REPLACE semantics on update:
|
|||
- Pass `[]` → field is cleared.
|
||||
- Pass `[values]` → replaces existing with exactly those values (no merging).
|
||||
|
||||
## Connection scoping
|
||||
|
||||
A project may have several databases whose schemas reuse the same concept names
|
||||
(two warehouses each with `orders`, `customers`, …). The `connections`
|
||||
frontmatter field keeps database-specific pages from polluting searches about
|
||||
other databases.
|
||||
|
||||
- The `wiki_write` tool accepts a `connections` field (list of connection ids,
|
||||
same REPLACE semantics as `tags`). Absent or empty ⇒ the page is **unscoped**
|
||||
and applies to every connection.
|
||||
- When this ingest/turn is scoped to a connection (its id appears in the prompt
|
||||
context — e.g. `connectionId: warehouse` in the SL Sources header or the
|
||||
`<context>` block), set `connections: [<that id>]` on pages whose content is
|
||||
**specific to that database** ("in this warehouse `user_id` is the device id,
|
||||
not the account id"). Pair this with a connection-distinctive key so two
|
||||
databases' same-concept pages can coexist: `orders_sales_db`, not `orders`.
|
||||
- Leave `connections` empty for clearly **org-wide** knowledge ("fiscal year
|
||||
starts in February") so it stays visible everywhere. Do not scope a page to a
|
||||
connection just because the turn happened to be connection-scoped.
|
||||
- Keys are still a flat, global namespace; `connections` does not namespace
|
||||
them. A connection-scoped write whose key already belongs to a page scoped to
|
||||
a *different* connection is rejected to prevent silently overwriting it — pick
|
||||
a connection-distinctive key instead.
|
||||
|
||||
## Editing existing pages
|
||||
|
||||
Two modes:
|
||||
|
|
|
|||
|
|
@ -9,6 +9,7 @@ import { runCodexAuthProbe } from './context/llm/codex-runtime.js';
|
|||
import type { KtxConfigIssue, KtxProjectConfig, KtxProjectConnectionConfig, KtxProjectEmbeddingConfig, KtxProjectLlmConfig } from './context/project/config.js';
|
||||
import type { KtxLocalProject } from './context/project/project.js';
|
||||
import { ktxLocalStateDbPath } from './context/project/local-state-db.js';
|
||||
import { listReferencedConnectionIds } from './context/wiki/local-knowledge.js';
|
||||
import {
|
||||
isQueryHistoryEnabled,
|
||||
queryHistoryDialectForConnection,
|
||||
|
|
@ -109,6 +110,7 @@ interface LocalStatsIngestPerConnection {
|
|||
connectionId: string;
|
||||
adapter: string;
|
||||
lastCompletedAt: string;
|
||||
skippedObjects: Array<{ name: string; reason: string }>;
|
||||
}
|
||||
|
||||
interface LocalStatsSemanticLayerEntry {
|
||||
|
|
@ -581,6 +583,29 @@ function buildStorageStatus(config: KtxProjectConfig): StorageStatus {
|
|||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Warn (never fail) when stored wiki pages reference connection ids that are no
|
||||
* longer in `ktx.yaml`. Config and page content evolve independently, so a
|
||||
* dangling reference is a soft condition — the pages still load, search, and
|
||||
* read; it just signals a typo or a removed connection.
|
||||
*/
|
||||
async function buildUnknownConnectionWarning(project: KtxLocalProject): Promise<WarningItem | null> {
|
||||
let referenced: string[];
|
||||
try {
|
||||
referenced = await listReferencedConnectionIds(project);
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
const unknown = referenced.filter((id) => !Object.hasOwn(project.config.connections, id));
|
||||
if (unknown.length === 0) {
|
||||
return null;
|
||||
}
|
||||
return {
|
||||
message: `Wiki pages reference connection id(s) not in ktx.yaml: ${unknown.join(', ')}. Those pages still load and search.`,
|
||||
fix: 'Add the connection(s) via `ktx setup`, or update the pages’ `connections` frontmatter.',
|
||||
};
|
||||
}
|
||||
|
||||
function buildWarnings(
|
||||
config: KtxProjectConfig,
|
||||
connections: ConnectionStatus[],
|
||||
|
|
@ -782,6 +807,20 @@ function tryQuery<T>(run: () => T, fallback: T): T {
|
|||
}
|
||||
}
|
||||
|
||||
function skippedObjectsFromReportBody(bodyJson: string): Array<{ name: string; reason: string }> {
|
||||
try {
|
||||
const body = JSON.parse(bodyJson) as { fetch?: { skipped?: Array<{ entityId?: unknown; message?: unknown }> } };
|
||||
const skipped = body.fetch?.skipped;
|
||||
if (!Array.isArray(skipped)) return [];
|
||||
return skipped.map((issue) => ({
|
||||
name: typeof issue.entityId === 'string' && issue.entityId.length > 0 ? issue.entityId : 'object',
|
||||
reason: typeof issue.message === 'string' ? issue.message : 'introspection failed',
|
||||
}));
|
||||
} catch {
|
||||
return [];
|
||||
}
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
export async function buildLocalStatsStatus(project: KtxLocalProject): Promise<LocalStatsStatus> {
|
||||
const dbPath = ktxLocalStateDbPath(project);
|
||||
|
|
@ -819,17 +858,19 @@ export async function buildLocalStatsStatus(project: KtxLocalProject): Promise<L
|
|||
0,
|
||||
);
|
||||
|
||||
type IngestStatsRow = { connection_id: string; adapter: string; last_completed_at: string; body_json: string };
|
||||
const ingestRows = tryQuery(
|
||||
() =>
|
||||
// SQLite returns body_json from the MAX(completed_at) row for each group.
|
||||
db
|
||||
.prepare(
|
||||
`SELECT connection_id, adapter, MAX(completed_at) AS last_completed_at
|
||||
`SELECT connection_id, adapter, MAX(completed_at) AS last_completed_at, body_json
|
||||
FROM local_ingest_reports
|
||||
WHERE status = 'done'
|
||||
GROUP BY connection_id, adapter`,
|
||||
)
|
||||
.all() as Array<{ connection_id: string; adapter: string; last_completed_at: string }>,
|
||||
[] as Array<{ connection_id: string; adapter: string; last_completed_at: string }>,
|
||||
.all() as IngestStatsRow[],
|
||||
[] as IngestStatsRow[],
|
||||
);
|
||||
const perConnectionMap = new Map<string, LocalStatsIngestPerConnection>();
|
||||
for (const row of ingestRows) {
|
||||
|
|
@ -839,6 +880,7 @@ export async function buildLocalStatsStatus(project: KtxLocalProject): Promise<L
|
|||
connectionId: row.connection_id,
|
||||
adapter: row.adapter,
|
||||
lastCompletedAt: row.last_completed_at,
|
||||
skippedObjects: skippedObjectsFromReportBody(row.body_json),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
|
@ -953,6 +995,10 @@ export async function buildProjectStatus(project: KtxLocalProject, options: Buil
|
|||
const queryHistory = await buildQueryHistoryStatus(project, options);
|
||||
const pipeline = buildPipelineStatus(config);
|
||||
const warnings = buildWarnings(config, connections, llm, embeddings);
|
||||
const unknownConnectionWarning = await buildUnknownConnectionWarning(project);
|
||||
if (unknownConnectionWarning) {
|
||||
warnings.push(unknownConnectionWarning);
|
||||
}
|
||||
const localStats = await buildLocalStatsStatus(project);
|
||||
const { verdict, reason, nextActions } = buildVerdict(llm, embeddings, connections, queryHistory, warnings);
|
||||
|
||||
|
|
@ -1084,6 +1130,14 @@ function renderLocalStats(
|
|||
lines.push(
|
||||
` ${entry.connectionId.padEnd(nameWidth)} ${dim(entry.adapter.padEnd(adapterWidth))} ${dim(`last ${formatRelativeFromNow(entry.lastCompletedAt)}`)}`,
|
||||
);
|
||||
if (entry.skippedObjects.length > 0) {
|
||||
const first = entry.skippedObjects[0]!;
|
||||
const extra = entry.skippedObjects.length - 1;
|
||||
const detail = `${first.name}: ${first.reason}${extra > 0 ? ` (+${extra} more)` : ''}`;
|
||||
lines.push(
|
||||
` ${' '.repeat(nameWidth)} ${dim(`${entry.skippedObjects.length} object${entry.skippedObjects.length === 1 ? '' : 's'} skipped — ${detail}`)}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -8,6 +8,13 @@ import type { KtxCliIo } from './cli-runtime.js';
|
|||
import { createRepainter, initViewState, renderContextBuildView, type ContextBuildTargetState } from './context-build-view.js';
|
||||
import { formatDuration } from './demo-metrics.js';
|
||||
import type { KtxPublicIngestPlanTarget } from './public-ingest.js';
|
||||
import {
|
||||
createLocalProjectVerbatimIngestor,
|
||||
type VerbatimIngestItem,
|
||||
type VerbatimIngestOrigin,
|
||||
type VerbatimIngestorPort,
|
||||
type VerbatimIngestResult,
|
||||
} from './verbatim-ingest.js';
|
||||
|
||||
export interface KtxTextIngestArgs {
|
||||
projectDir: string;
|
||||
|
|
@ -17,6 +24,8 @@ export interface KtxTextIngestArgs {
|
|||
userId: string;
|
||||
json: boolean;
|
||||
failFast: boolean;
|
||||
/** Code-driven verbatim ingest: store the document body unchanged, LLM derives metadata only. */
|
||||
verbatim?: boolean;
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
|
|
@ -29,6 +38,7 @@ export interface TextMemoryIngestPort {
|
|||
interface TextIngestItem {
|
||||
label: string;
|
||||
content: string;
|
||||
origin: VerbatimIngestOrigin;
|
||||
}
|
||||
|
||||
interface TextIngestResult {
|
||||
|
|
@ -43,6 +53,7 @@ interface TextIngestResult {
|
|||
export interface KtxTextIngestDeps {
|
||||
loadProject?: (options: { projectDir: string }) => Promise<KtxLocalProject>;
|
||||
createMemoryIngest?: (project: KtxLocalProject) => TextMemoryIngestPort;
|
||||
createVerbatimIngestor?: (project: KtxLocalProject) => VerbatimIngestorPort;
|
||||
readFile?: (path: string) => Promise<string>;
|
||||
readStdin?: () => Promise<string>;
|
||||
now?: () => number;
|
||||
|
|
@ -55,6 +66,10 @@ function defaultCreateMemoryIngest(project: KtxLocalProject): TextMemoryIngestPo
|
|||
return createLocalProjectMemoryIngest(project);
|
||||
}
|
||||
|
||||
function defaultCreateVerbatimIngestor(project: KtxLocalProject): VerbatimIngestorPort {
|
||||
return createLocalProjectVerbatimIngestor(project);
|
||||
}
|
||||
|
||||
async function defaultReadStdin(): Promise<string> {
|
||||
const chunks: string[] = [];
|
||||
process.stdin.setEncoding('utf-8');
|
||||
|
|
@ -129,17 +144,17 @@ async function loadItems(args: KtxTextIngestArgs, deps: KtxTextIngestDeps): Prom
|
|||
args.texts.forEach((content, index) => {
|
||||
const label = textLabel(content, index, usedTextLabels);
|
||||
usedTextLabels.add(label);
|
||||
items.push({ label, content });
|
||||
items.push({ label, content, origin: { kind: 'text' } });
|
||||
});
|
||||
|
||||
const readFile = deps.readFile ?? defaultReadFile;
|
||||
const readStdin = deps.readStdin ?? defaultReadStdin;
|
||||
for (const file of args.files) {
|
||||
if (file === '-') {
|
||||
items.push({ label: stdinLabel(items), content: await readStdin() });
|
||||
items.push({ label: stdinLabel(items), content: await readStdin(), origin: { kind: 'stdin' } });
|
||||
} else {
|
||||
const path = resolve(file);
|
||||
items.push({ label: basename(path), content: await readFile(path) });
|
||||
items.push({ label: basename(path), content: await readFile(path), origin: { kind: 'file', path } });
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -175,13 +190,13 @@ function allTargets(state: ReturnType<typeof initViewState>): ContextBuildTarget
|
|||
return [...state.primarySources, ...state.contextSources];
|
||||
}
|
||||
|
||||
function renderTextIngestView(state: ReturnType<typeof initViewState>, styled: boolean): string {
|
||||
function renderTextIngestView(state: ReturnType<typeof initViewState>, styled: boolean, verbatim: boolean): string {
|
||||
return renderContextBuildView(state, {
|
||||
styled,
|
||||
title: 'Ingesting text memory',
|
||||
contextGroupLabel: 'Texts',
|
||||
sourceIngestRunningText: 'capturing...',
|
||||
completedItemName: { singular: 'text', plural: 'texts' },
|
||||
title: verbatim ? 'Writing verbatim pages' : 'Ingesting text memory',
|
||||
contextGroupLabel: verbatim ? 'Documents' : 'Texts',
|
||||
sourceIngestRunningText: verbatim ? 'writing...' : 'capturing...',
|
||||
completedItemName: verbatim ? { singular: 'page', plural: 'pages' } : { singular: 'text', plural: 'texts' },
|
||||
});
|
||||
}
|
||||
|
||||
|
|
@ -254,7 +269,9 @@ export async function runKtxTextIngest(
|
|||
}
|
||||
|
||||
const project = await (deps.loadProject ?? loadKtxProject)({ projectDir: args.projectDir });
|
||||
const memoryIngest = (deps.createMemoryIngest ?? defaultCreateMemoryIngest)(project);
|
||||
const isVerbatim = args.verbatim === true;
|
||||
const verbatimIngestor = isVerbatim ? (deps.createVerbatimIngestor ?? defaultCreateVerbatimIngestor)(project) : null;
|
||||
const memoryIngest = isVerbatim ? null : (deps.createMemoryIngest ?? defaultCreateMemoryIngest)(project);
|
||||
const now = deps.now ?? (() => Date.now());
|
||||
const batchId = now();
|
||||
const state = initViewState(items.map((item) => makeTarget(item.label)));
|
||||
|
|
@ -264,7 +281,7 @@ export async function runKtxTextIngest(
|
|||
const results: TextIngestResult[] = [];
|
||||
|
||||
state.startedAt = now();
|
||||
const paint = () => repainter?.paint(renderTextIngestView(state, true));
|
||||
const paint = () => repainter?.paint(renderTextIngestView(state, true, isVerbatim));
|
||||
paint();
|
||||
|
||||
let spinnerInterval: ReturnType<typeof setInterval> | null = null;
|
||||
|
|
@ -288,29 +305,50 @@ export async function runKtxTextIngest(
|
|||
const target = targets[index]!;
|
||||
target.status = 'running';
|
||||
target.startedAt = now();
|
||||
target.detailLine = 'capturing...';
|
||||
target.detailLine = isVerbatim ? 'writing...' : 'capturing...';
|
||||
target.progressUpdatedAtMs = target.startedAt;
|
||||
paint();
|
||||
|
||||
let runId: string | null = null;
|
||||
let result: TextIngestResult;
|
||||
try {
|
||||
const ingestInput: MemoryAgentInput = {
|
||||
userId: args.userId,
|
||||
chatId: `cli-text-ingest-${batchId}-${index + 1}`,
|
||||
userMessage: `Ingest external text artifact ${artifactReference(item.label)} into ktx memory.`,
|
||||
assistantMessage: item.content.trim(),
|
||||
...(args.connectionId ? { connectionId: args.connectionId } : {}),
|
||||
sourceType: 'external_ingest',
|
||||
};
|
||||
const ingest = await memoryIngest.ingest(ingestInput);
|
||||
runId = ingest.runId;
|
||||
await memoryIngest.waitForRun(runId);
|
||||
const status = await memoryIngest.status(runId);
|
||||
if (!status) {
|
||||
throw new Error(`Memory ingest run "${runId}" was not found.`);
|
||||
if (verbatimIngestor) {
|
||||
const verbatimItem: VerbatimIngestItem = {
|
||||
origin: item.origin,
|
||||
content: item.content,
|
||||
...(args.connectionId ? { connectionId: args.connectionId } : {}),
|
||||
};
|
||||
const outcome: VerbatimIngestResult = await verbatimIngestor.ingest(verbatimItem);
|
||||
result = {
|
||||
label: item.label,
|
||||
runId: null,
|
||||
status: 'done',
|
||||
captured: { wiki: [outcome.pageKey], sl: [], xrefs: [] },
|
||||
commitHash: outcome.commitHash,
|
||||
error: null,
|
||||
};
|
||||
} else {
|
||||
// memoryIngest is set whenever verbatim is off — they are mutually exclusive.
|
||||
if (!memoryIngest) {
|
||||
throw new Error('Memory ingest was not initialized.');
|
||||
}
|
||||
const ingestInput: MemoryAgentInput = {
|
||||
userId: args.userId,
|
||||
chatId: `cli-text-ingest-${batchId}-${index + 1}`,
|
||||
userMessage: `Ingest external text artifact ${artifactReference(item.label)} into ktx memory.`,
|
||||
assistantMessage: item.content.trim(),
|
||||
...(args.connectionId ? { connectionId: args.connectionId } : {}),
|
||||
sourceType: 'external_ingest',
|
||||
};
|
||||
const ingest = await memoryIngest.ingest(ingestInput);
|
||||
runId = ingest.runId;
|
||||
await memoryIngest.waitForRun(runId);
|
||||
const status = await memoryIngest.status(runId);
|
||||
if (!status) {
|
||||
throw new Error(`Memory ingest run "${runId}" was not found.`);
|
||||
}
|
||||
result = resultFromStatus(item.label, status);
|
||||
}
|
||||
result = resultFromStatus(item.label, status);
|
||||
} catch (error) {
|
||||
result = errorResult(item.label, runId, error);
|
||||
}
|
||||
|
|
@ -340,17 +378,18 @@ export async function runKtxTextIngest(
|
|||
if (args.json) {
|
||||
writeJsonResult(args, results, io);
|
||||
} else if (repainter) {
|
||||
repainter.paint(renderTextIngestView(state, true));
|
||||
repainter.paint(renderTextIngestView(state, true, isVerbatim));
|
||||
writePlainFailures(results, io);
|
||||
} else {
|
||||
io.stdout.write(renderTextIngestView(state, false));
|
||||
io.stdout.write(renderTextIngestView(state, false, isVerbatim));
|
||||
writePlainFailures(results, io);
|
||||
}
|
||||
|
||||
if (!args.json && results.length > 0) {
|
||||
const duration = state.totalElapsedMs > 0 ? ` in ${formatDuration(state.totalElapsedMs)}` : '';
|
||||
const outcome = results.some((result) => result.status === 'error') ? 'finished with failures' : 'finished';
|
||||
io.stdout.write(`Text memory ingest ${outcome}${duration}.\n`);
|
||||
const label = isVerbatim ? 'Verbatim ingest' : 'Text memory ingest';
|
||||
io.stdout.write(`${label} ${outcome}${duration}.\n`);
|
||||
}
|
||||
|
||||
return results.some((result) => result.status === 'error') ? 1 : 0;
|
||||
|
|
|
|||
308
packages/cli/src/verbatim-ingest.ts
Normal file
308
packages/cli/src/verbatim-ingest.ts
Normal file
|
|
@ -0,0 +1,308 @@
|
|||
import { basename, extname, join } from 'node:path';
|
||||
import YAML from 'yaml';
|
||||
import { z } from 'zod';
|
||||
import { noopLogger } from './context/core/config.js';
|
||||
import { assertConfiguredConnectionId } from './context/connections/configured-connections.js';
|
||||
import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js';
|
||||
import { createLocalKtxEmbeddingProviderFromConfig, createLocalKtxLlmRuntimeFromConfig } from './context/llm/local-config.js';
|
||||
import type { KtxLlmRuntimePort } from './context/llm/runtime-port.js';
|
||||
import type { KtxProjectConnectionConfig } from './context/project/config.js';
|
||||
import type { KtxLocalProject } from './context/project/project.js';
|
||||
import { KnowledgeWikiService } from './context/wiki/knowledge-wiki.service.js';
|
||||
import { suggestFlatWikiKey } from './context/wiki/keys.js';
|
||||
import { SqliteKnowledgeIndex } from './context/wiki/sqlite-knowledge-index.js';
|
||||
import type { WikiFrontmatter } from './context/wiki/types.js';
|
||||
import type { KtxEmbeddingProvider } from './llm/types.js';
|
||||
|
||||
const LOCAL_AUTHOR = 'ktx';
|
||||
const LOCAL_AUTHOR_EMAIL = 'ktx@example.com';
|
||||
|
||||
/** Only the prefix sent to the LLM for metadata is clipped; the stored body is never clipped. */
|
||||
const METADATA_CLIP_LENGTH = 48_000;
|
||||
|
||||
const VERBATIM_METADATA_SYSTEM_PROMPT = [
|
||||
'You generate search metadata for an authoritative document that ktx stores verbatim.',
|
||||
'You never rewrite, summarize into, or alter the document body — you only describe it.',
|
||||
'Return a concise one- or two-sentence summary, a few topical tags, and any semantic-layer',
|
||||
'source names the document is clearly about. Use empty arrays when none apply.',
|
||||
].join(' ');
|
||||
|
||||
const verbatimMetadataSchema = z.object({
|
||||
summary: z.string().min(1).describe('A one- or two-sentence description of what the document defines or specifies.'),
|
||||
tags: z.array(z.string()).default([]).describe('Short topical keywords that aid lexical and semantic recall.'),
|
||||
sl_refs: z
|
||||
.array(z.string())
|
||||
.default([])
|
||||
.describe('Semantic-layer source names the document is clearly about, if any are evident.'),
|
||||
});
|
||||
|
||||
type VerbatimMetadata = z.infer<typeof verbatimMetadataSchema>;
|
||||
|
||||
export interface VerbatimIngestOrigin {
|
||||
kind: 'file' | 'text' | 'stdin';
|
||||
/** Present only for `kind: 'file'`; the resolved path the key basename is derived from. */
|
||||
path?: string;
|
||||
}
|
||||
|
||||
const DEGRADED_SUMMARY_MAX_LENGTH = 200;
|
||||
const FRONTMATTER_PATTERN = /^---\n([\s\S]*?)\n---\n?([\s\S]*)$/;
|
||||
const HEADING_PATTERN = /^#{1,6}\s+(.+?)\s*#*\s*$/;
|
||||
|
||||
type UsageMode = WikiFrontmatter['usage_mode'];
|
||||
|
||||
function isUsageMode(value: unknown): value is UsageMode {
|
||||
return value === 'always' || value === 'auto' || value === 'never';
|
||||
}
|
||||
|
||||
function nonEmptyString(value: unknown): string | undefined {
|
||||
return typeof value === 'string' && value.trim().length > 0 ? value : undefined;
|
||||
}
|
||||
|
||||
function stringArray(value: unknown): string[] {
|
||||
return Array.isArray(value) ? value.filter((item): item is string => typeof item === 'string') : [];
|
||||
}
|
||||
|
||||
/** `connections` accepts a single id or a list in YAML; normalize either to a string list. */
|
||||
function stringList(value: unknown): string[] {
|
||||
if (typeof value === 'string') {
|
||||
return value.trim().length > 0 ? [value] : [];
|
||||
}
|
||||
return stringArray(value);
|
||||
}
|
||||
|
||||
function leadingHeadingText(body: string): string | null {
|
||||
const firstLine = body.trimStart().split('\n', 1)[0] ?? '';
|
||||
const match = firstLine.match(HEADING_PATTERN);
|
||||
return match ? match[1].trim() : null;
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
export function splitInputDocument(raw: string): { frontmatter: Record<string, unknown>; body: string } {
|
||||
const match = raw.match(FRONTMATTER_PATTERN);
|
||||
if (!match) {
|
||||
return { frontmatter: {}, body: raw.trim() };
|
||||
}
|
||||
const parsed = YAML.parse(match[1]) as unknown;
|
||||
const frontmatter =
|
||||
parsed !== null && typeof parsed === 'object' && !Array.isArray(parsed)
|
||||
? (parsed as Record<string, unknown>)
|
||||
: {};
|
||||
return { frontmatter, body: match[2].trim() };
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
export function deriveVerbatimPageKey(origin: VerbatimIngestOrigin, body: string): string {
|
||||
if (origin.kind === 'file' && origin.path) {
|
||||
return suggestFlatWikiKey(basename(origin.path, extname(origin.path)));
|
||||
}
|
||||
const heading = leadingHeadingText(body);
|
||||
if (!heading) {
|
||||
throw new Error(
|
||||
'Verbatim inline text needs a leading Markdown heading to derive a stable page key. Add a "# Heading" line, or pass the content as --file <path>.',
|
||||
);
|
||||
}
|
||||
return suggestFlatWikiKey(heading);
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
export function deriveDegradedSummary(body: string): string {
|
||||
const heading = leadingHeadingText(body);
|
||||
if (heading) {
|
||||
return heading;
|
||||
}
|
||||
const text = body.trim();
|
||||
const sentence = text.match(/^([\s\S]*?[.!?])(\s|$)/);
|
||||
const summary = sentence ? sentence[1].trim() : text;
|
||||
if (summary.length <= DEGRADED_SUMMARY_MAX_LENGTH) {
|
||||
return summary;
|
||||
}
|
||||
return `${summary.slice(0, DEGRADED_SUMMARY_MAX_LENGTH).trimEnd()}…`;
|
||||
}
|
||||
|
||||
/** @internal */
|
||||
export function buildVerbatimFrontmatter(input: {
|
||||
inputFrontmatter: Record<string, unknown>;
|
||||
summary: string;
|
||||
tags: string[];
|
||||
slRefs: string[];
|
||||
connectionId?: string;
|
||||
}): WikiFrontmatter & Record<string, unknown> {
|
||||
const { inputFrontmatter } = input;
|
||||
|
||||
const inputConnections = stringList(inputFrontmatter.connections);
|
||||
const flagConnections = input.connectionId ? [input.connectionId] : [];
|
||||
if (
|
||||
inputConnections.length > 0 &&
|
||||
flagConnections.length > 0 &&
|
||||
!connectionSetsEqual(inputConnections, flagConnections)
|
||||
) {
|
||||
throw new Error(
|
||||
`Connection scope conflict: frontmatter declares connections [${inputConnections.join(
|
||||
', ',
|
||||
)}] but --connection-id is "${input.connectionId}". Remove one so the intent is unambiguous.`,
|
||||
);
|
||||
}
|
||||
const connections = inputConnections.length > 0 ? inputConnections : flagConnections;
|
||||
|
||||
const summary = nonEmptyString(inputFrontmatter.summary) ?? input.summary;
|
||||
const usageMode = isUsageMode(inputFrontmatter.usage_mode) ? inputFrontmatter.usage_mode : 'auto';
|
||||
const tags = inputFrontmatter.tags !== undefined ? stringArray(inputFrontmatter.tags) : input.tags;
|
||||
const slRefs = inputFrontmatter.sl_refs !== undefined ? stringArray(inputFrontmatter.sl_refs) : input.slRefs;
|
||||
|
||||
const passthrough = Object.fromEntries(
|
||||
Object.entries(inputFrontmatter).filter(
|
||||
([key]) => !['summary', 'usage_mode', 'tags', 'sl_refs', 'connections'].includes(key),
|
||||
),
|
||||
);
|
||||
|
||||
return {
|
||||
...passthrough,
|
||||
summary,
|
||||
usage_mode: usageMode,
|
||||
...(tags.length > 0 ? { tags } : {}),
|
||||
...(slRefs.length > 0 ? { sl_refs: slRefs } : {}),
|
||||
...(connections.length > 0 ? { connections } : {}),
|
||||
} satisfies WikiFrontmatter & Record<string, unknown>;
|
||||
}
|
||||
|
||||
function connectionSetsEqual(left: string[], right: string[]): boolean {
|
||||
if (left.length !== right.length) {
|
||||
return false;
|
||||
}
|
||||
const rightSet = new Set(right);
|
||||
return left.every((id) => rightSet.has(id));
|
||||
}
|
||||
|
||||
export interface VerbatimIngestItem {
|
||||
origin: VerbatimIngestOrigin;
|
||||
content: string;
|
||||
connectionId?: string;
|
||||
}
|
||||
|
||||
export interface VerbatimIngestResult {
|
||||
pageKey: string;
|
||||
outcome: 'written' | 'unchanged';
|
||||
connections: string[];
|
||||
commitHash: string | null;
|
||||
}
|
||||
|
||||
export interface VerbatimIngestorPort {
|
||||
ingest(item: VerbatimIngestItem): Promise<VerbatimIngestResult>;
|
||||
}
|
||||
|
||||
export interface CreateLocalProjectVerbatimIngestorDeps {
|
||||
/** `undefined` ⇒ resolve from project config; `null` ⇒ force degraded (offline) metadata. */
|
||||
llmRuntime?: KtxLlmRuntimePort | null;
|
||||
embeddingProvider?: KtxEmbeddingProvider | null;
|
||||
}
|
||||
|
||||
class LocalVerbatimIngestor implements VerbatimIngestorPort {
|
||||
constructor(
|
||||
private readonly deps: {
|
||||
wikiService: KnowledgeWikiService;
|
||||
llmRuntime: KtxLlmRuntimePort | null;
|
||||
configuredConnections: Record<string, KtxProjectConnectionConfig>;
|
||||
author: string;
|
||||
authorEmail: string;
|
||||
},
|
||||
) {}
|
||||
|
||||
async ingest(item: VerbatimIngestItem): Promise<VerbatimIngestResult> {
|
||||
if (item.connectionId) {
|
||||
assertConfiguredConnectionId(this.deps.configuredConnections, item.connectionId);
|
||||
}
|
||||
|
||||
const { frontmatter: inputFrontmatter, body } = splitInputDocument(item.content);
|
||||
const pageKey = deriveVerbatimPageKey(item.origin, body);
|
||||
|
||||
const generated = await this.resolveMetadata(inputFrontmatter, body);
|
||||
const frontmatter = buildVerbatimFrontmatter({
|
||||
inputFrontmatter,
|
||||
summary: generated.summary,
|
||||
tags: generated.tags,
|
||||
slRefs: generated.slRefs,
|
||||
...(item.connectionId ? { connectionId: item.connectionId } : {}),
|
||||
});
|
||||
const connections = Array.isArray(frontmatter.connections) ? frontmatter.connections : [];
|
||||
|
||||
const existing = await this.deps.wikiService.readPage('GLOBAL', null, pageKey);
|
||||
if (existing) {
|
||||
if (existing.content === body) {
|
||||
return { pageKey, outcome: 'unchanged', connections, commitHash: null };
|
||||
}
|
||||
throw new Error(
|
||||
`A different page already exists at key "${pageKey}". Re-run with a distinct document name or key, ` +
|
||||
'or remove the existing page first — verbatim ingest never overwrites a conflicting page.',
|
||||
);
|
||||
}
|
||||
|
||||
const writeResult = await this.deps.wikiService.writePageAndSync(
|
||||
'GLOBAL',
|
||||
null,
|
||||
pageKey,
|
||||
frontmatter,
|
||||
body,
|
||||
this.deps.author,
|
||||
this.deps.authorEmail,
|
||||
`Ingest verbatim document: ${pageKey}`,
|
||||
);
|
||||
|
||||
return { pageKey, outcome: 'written', connections, commitHash: writeResult.commitHash ?? null };
|
||||
}
|
||||
|
||||
/**
|
||||
* Generated metadata is only used to gap-fill absent frontmatter fields, so the LLM is
|
||||
* skipped entirely when summary, tags, and sl_refs are all explicit. A configured backend
|
||||
* that fails surfaces the error (the item fails); degraded derivation is reserved for
|
||||
* `backend: none`, never used as a silent fallback that would poison the idempotency check.
|
||||
*/
|
||||
private async resolveMetadata(
|
||||
inputFrontmatter: Record<string, unknown>,
|
||||
body: string,
|
||||
): Promise<{ summary: string; tags: string[]; slRefs: string[] }> {
|
||||
const needsGeneration =
|
||||
nonEmptyString(inputFrontmatter.summary) === undefined ||
|
||||
inputFrontmatter.tags === undefined ||
|
||||
inputFrontmatter.sl_refs === undefined;
|
||||
|
||||
if (this.deps.llmRuntime && needsGeneration) {
|
||||
const clipped = body.length > METADATA_CLIP_LENGTH ? body.slice(0, METADATA_CLIP_LENGTH) : body;
|
||||
const generated = await this.deps.llmRuntime.generateObject<VerbatimMetadata, typeof verbatimMetadataSchema>({
|
||||
role: 'triage',
|
||||
system: VERBATIM_METADATA_SYSTEM_PROMPT,
|
||||
prompt: clipped,
|
||||
schema: verbatimMetadataSchema,
|
||||
});
|
||||
return { summary: generated.summary, tags: generated.tags, slRefs: generated.sl_refs };
|
||||
}
|
||||
|
||||
return { summary: deriveDegradedSummary(body), tags: [], slRefs: [] };
|
||||
}
|
||||
}
|
||||
|
||||
export function createLocalProjectVerbatimIngestor(
|
||||
project: KtxLocalProject,
|
||||
deps: CreateLocalProjectVerbatimIngestorDeps = {},
|
||||
): VerbatimIngestorPort {
|
||||
const llmRuntime =
|
||||
deps.llmRuntime !== undefined
|
||||
? deps.llmRuntime
|
||||
: createLocalKtxLlmRuntimeFromConfig(project.config.llm, { projectDir: project.projectDir });
|
||||
|
||||
const embeddingProvider =
|
||||
deps.embeddingProvider !== undefined
|
||||
? deps.embeddingProvider
|
||||
: createLocalKtxEmbeddingProviderFromConfig(project.config.ingest.embeddings, { projectDir: project.projectDir });
|
||||
const embeddingPort = embeddingProvider ? new KtxIngestEmbeddingPortAdapter(embeddingProvider) : null;
|
||||
|
||||
const knowledgeIndex = new SqliteKnowledgeIndex({ dbPath: join(project.projectDir, '.ktx', 'db.sqlite') });
|
||||
const wikiService = new KnowledgeWikiService(project.fileStore, embeddingPort, knowledgeIndex, project.git, noopLogger);
|
||||
|
||||
return new LocalVerbatimIngestor({
|
||||
wikiService,
|
||||
llmRuntime,
|
||||
configuredConnections: project.config.connections,
|
||||
author: LOCAL_AUTHOR,
|
||||
authorEmail: LOCAL_AUTHOR_EMAIL,
|
||||
});
|
||||
}
|
||||
117
packages/cli/test/commands/ingest-commands.test.ts
Normal file
117
packages/cli/test/commands/ingest-commands.test.ts
Normal file
|
|
@ -0,0 +1,117 @@
|
|||
import { Command } from '@commander-js/extra-typings';
|
||||
import { describe, expect, it, vi } from 'vitest';
|
||||
import type { KtxCliCommandContext } from '../../src/cli-program.js';
|
||||
import { parseEnrichmentStagesOption, registerIngestCommands } from '../../src/commands/ingest-commands.js';
|
||||
|
||||
function makeContext(overrides: Partial<KtxCliCommandContext> = {}): KtxCliCommandContext {
|
||||
let exitCode = 0;
|
||||
return {
|
||||
io: {
|
||||
stdout: { write: vi.fn() },
|
||||
stderr: { write: vi.fn() },
|
||||
},
|
||||
deps: {},
|
||||
packageInfo: { name: '@kaelio/ktx', version: '0.0.0-test' },
|
||||
setExitCode: (code: number) => {
|
||||
exitCode = code;
|
||||
},
|
||||
runInit: vi.fn(),
|
||||
writeDebug: vi.fn(),
|
||||
...overrides,
|
||||
get exitCode() {
|
||||
return exitCode;
|
||||
},
|
||||
} as unknown as KtxCliCommandContext;
|
||||
}
|
||||
|
||||
function ingestProgram(context: KtxCliCommandContext): Command {
|
||||
const program = new Command().exitOverride().option('--project-dir <path>');
|
||||
registerIngestCommands(program, context, { runTextIngest: vi.fn(async () => 0) });
|
||||
return program;
|
||||
}
|
||||
|
||||
describe('parseEnrichmentStagesOption', () => {
|
||||
it('parses a single stage', () => {
|
||||
expect(parseEnrichmentStagesOption('relationships')).toEqual(['relationships']);
|
||||
});
|
||||
|
||||
it('orders and de-duplicates by the canonical registry order', () => {
|
||||
expect(parseEnrichmentStagesOption('embeddings,descriptions')).toEqual(['descriptions', 'embeddings']);
|
||||
expect(parseEnrichmentStagesOption('relationships,relationships,descriptions')).toEqual([
|
||||
'descriptions',
|
||||
'relationships',
|
||||
]);
|
||||
});
|
||||
|
||||
it('tolerates surrounding whitespace and empty segments', () => {
|
||||
expect(parseEnrichmentStagesOption(' descriptions , , embeddings ')).toEqual(['descriptions', 'embeddings']);
|
||||
});
|
||||
|
||||
it('rejects an empty list', () => {
|
||||
expect(() => parseEnrichmentStagesOption('')).toThrow(/non-empty/);
|
||||
expect(() => parseEnrichmentStagesOption(' , ')).toThrow(/non-empty/);
|
||||
});
|
||||
|
||||
it('rejects an unknown stage name', () => {
|
||||
expect(() => parseEnrichmentStagesOption('foo')).toThrow(/unknown stage "foo"/);
|
||||
expect(() => parseEnrichmentStagesOption('descriptions,foo')).toThrow(/unknown stage "foo"/);
|
||||
});
|
||||
});
|
||||
|
||||
describe('ktx ingest --stages', () => {
|
||||
it('threads a parsed stage set into the public ingest args', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = ingestProgram(context);
|
||||
|
||||
await program.parseAsync(
|
||||
['--project-dir', '/tmp/ktx', 'ingest', 'warehouse', '--stages', 'descriptions,embeddings'],
|
||||
{ from: 'user' },
|
||||
);
|
||||
|
||||
expect(publicIngest).toHaveBeenCalledTimes(1);
|
||||
expect(publicIngest.mock.calls[0]?.[0]).toMatchObject({
|
||||
command: 'run',
|
||||
targetConnectionId: 'warehouse',
|
||||
stages: ['descriptions', 'embeddings'],
|
||||
});
|
||||
});
|
||||
|
||||
it('omits stages entirely when the flag is absent (default = all)', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = ingestProgram(context);
|
||||
|
||||
await program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', 'warehouse'], { from: 'user' });
|
||||
|
||||
expect(publicIngest).toHaveBeenCalledTimes(1);
|
||||
expect(publicIngest.mock.calls[0]?.[0]).not.toHaveProperty('stages');
|
||||
});
|
||||
|
||||
it('rejects an unknown stage with a clear parse error', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = ingestProgram(context);
|
||||
|
||||
await expect(
|
||||
program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', 'warehouse', '--stages', 'foo'], { from: 'user' }),
|
||||
).rejects.toThrow(/unknown stage "foo"/);
|
||||
expect(publicIngest).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('rejects --stages combined with text capture', async () => {
|
||||
const publicIngest = vi.fn(async (_args: unknown) => 0);
|
||||
const runTextIngest = vi.fn(async () => 0);
|
||||
const context = makeContext({ deps: { publicIngest } });
|
||||
const program = new Command().exitOverride().option('--project-dir <path>');
|
||||
registerIngestCommands(program, context, { runTextIngest });
|
||||
|
||||
await expect(
|
||||
program.parseAsync(['--project-dir', '/tmp/ktx', 'ingest', '--text', 'hi', '--stages', 'descriptions'], {
|
||||
from: 'user',
|
||||
}),
|
||||
).rejects.toThrow(/--stages applies to database ingest only/);
|
||||
expect(publicIngest).not.toHaveBeenCalled();
|
||||
expect(runTextIngest).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
|
@ -1,4 +1,5 @@
|
|||
import { describe, expect, it, vi } from 'vitest';
|
||||
import { KtxQueryError } from '../../../src/errors.js';
|
||||
import { bigQueryConnectionConfigFromConfig, isKtxBigQueryConnectionConfig, type KtxBigQueryClient, KtxBigQueryScanConnector, type KtxBigQueryClientFactory, type KtxBigQueryDataset, type KtxBigQueryQueryJob, type KtxBigQueryTableRef, prepareBigQueryReadOnlyQuery } from '../../../src/connectors/bigquery/connector.js';
|
||||
import { createBigQueryLiveDatabaseIntrospection } from '../../../src/connectors/bigquery/live-database-introspection.js';
|
||||
import { tableRefSet } from '../../../src/context/scan/table-ref.js';
|
||||
|
|
@ -114,11 +115,40 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
expect(isKtxBigQueryConnectionConfig({ driver: 'mysql' })).toBe(false);
|
||||
expect(bigQueryConnectionConfigFromConfig({ connectionId: 'warehouse', connection })).toMatchObject({
|
||||
projectId: 'project-1',
|
||||
datasetIds: ['analytics'],
|
||||
datasetIds: [{ project: 'project-1', dataset: 'analytics' }],
|
||||
location: 'US',
|
||||
});
|
||||
});
|
||||
|
||||
it('parses project.dataset entries to host-project pairs and rejects malformed entries', () => {
|
||||
expect(
|
||||
bigQueryConnectionConfigFromConfig({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['bigquery-public-data.austin_311', 'analytics'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
},
|
||||
}).datasetIds,
|
||||
).toEqual([
|
||||
{ project: 'bigquery-public-data', dataset: 'austin_311' },
|
||||
{ project: 'project-1', dataset: 'analytics' },
|
||||
]);
|
||||
|
||||
for (const badEntry of ['proj.ds.table', 'proj.', '.ds']) {
|
||||
expect(() =>
|
||||
bigQueryConnectionConfigFromConfig({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: [badEntry],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
},
|
||||
}),
|
||||
).toThrow(/connections\.warehouse/);
|
||||
}
|
||||
});
|
||||
|
||||
it('introspects datasets, table metadata, primary keys, and normalized types', async () => {
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
|
|
@ -184,6 +214,84 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
]);
|
||||
});
|
||||
|
||||
it('introspects a foreign-hosted dataset under its own project while billing stays local', async () => {
|
||||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['bigquery-public-data.austin_311'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
location: 'US',
|
||||
},
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'foreign' });
|
||||
|
||||
const client = vi.mocked(clientFactory.createClient).mock.results[0]?.value as KtxBigQueryClient;
|
||||
expect(client.dataset).toHaveBeenCalledWith('austin_311', 'bigquery-public-data');
|
||||
expect(clientFactory.createClient).toHaveBeenCalledWith(expect.objectContaining({ projectId: 'project-1' }));
|
||||
expect(snapshot.scope).toEqual({
|
||||
catalogs: ['bigquery-public-data'],
|
||||
datasets: ['bigquery-public-data.austin_311'],
|
||||
});
|
||||
expect(snapshot.metadata.project_id).toBe('project-1');
|
||||
expect(snapshot.tables[0]).toMatchObject({
|
||||
catalog: 'bigquery-public-data',
|
||||
db: 'austin_311',
|
||||
name: 'orders',
|
||||
});
|
||||
});
|
||||
|
||||
it('introspects datasets across multiple host projects, each under its own project', async () => {
|
||||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['bigquery-public-data.austin_311', 'analytics'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
location: 'US',
|
||||
},
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'multi' });
|
||||
|
||||
const client = vi.mocked(clientFactory.createClient).mock.results[0]?.value as KtxBigQueryClient;
|
||||
expect(client.dataset).toHaveBeenCalledWith('austin_311', 'bigquery-public-data');
|
||||
expect(client.dataset).toHaveBeenCalledWith('analytics', 'project-1');
|
||||
expect(snapshot.scope.catalogs).toEqual(['bigquery-public-data', 'project-1']);
|
||||
expect(snapshot.scope.datasets).toEqual(['bigquery-public-data.austin_311', 'analytics']);
|
||||
expect(snapshot.tables.map((table) => ({ catalog: table.catalog, db: table.db, name: table.name }))).toEqual([
|
||||
{ catalog: 'bigquery-public-data', db: 'austin_311', name: 'orders' },
|
||||
{ catalog: 'project-1', db: 'analytics', name: 'orders' },
|
||||
]);
|
||||
});
|
||||
|
||||
it('keeps same-named datasets in different projects distinct', async () => {
|
||||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: {
|
||||
driver: 'bigquery',
|
||||
dataset_ids: ['proj_a.shared', 'proj_b.shared'],
|
||||
credentials_json: JSON.stringify({ project_id: 'project-1' }),
|
||||
},
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'same-name' });
|
||||
|
||||
expect(snapshot.scope.catalogs).toEqual(['proj_a', 'proj_b']);
|
||||
expect(snapshot.scope.datasets).toEqual(['proj_a.shared', 'proj_b.shared']);
|
||||
expect(snapshot.tables.map((table) => `${table.catalog}.${table.db}.${table.name}`)).toEqual([
|
||||
'proj_a.shared.orders',
|
||||
'proj_b.shared.orders',
|
||||
]);
|
||||
});
|
||||
|
||||
it.each([
|
||||
Object.assign(new Error('Access Denied'), { code: 403 }),
|
||||
Object.assign(new Error('Not found'), { errors: [{ reason: 'notFound' }] }),
|
||||
|
|
@ -330,6 +438,50 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
expect(skippedGet).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('skips a table that fails introspection and ingests its healthy siblings', async () => {
|
||||
const ordersGet = vi.fn(async (): ReturnType<KtxBigQueryTableRef['get']> => [
|
||||
{ metadata: { type: 'TABLE', numRows: '5', schema: { fields: [{ name: 'id', type: 'INT64', mode: 'REQUIRED' }] } } },
|
||||
]);
|
||||
const brokenGet = vi.fn(async (): ReturnType<KtxBigQueryTableRef['get']> => {
|
||||
throw new Error('Access Denied: Table project-1:analytics.locked');
|
||||
});
|
||||
const clientFactory: KtxBigQueryClientFactory = {
|
||||
createClient: vi.fn(() => ({
|
||||
getDatasets: vi.fn(async (): ReturnType<KtxBigQueryClient['getDatasets']> => [[{ id: 'analytics' }]]),
|
||||
dataset: vi.fn(
|
||||
(): KtxBigQueryDataset => ({
|
||||
get: vi.fn(async () => [{ id: 'analytics' }]),
|
||||
getTables: vi.fn(async (): ReturnType<KtxBigQueryDataset['getTables']> => [
|
||||
[
|
||||
{ id: 'orders', get: ordersGet },
|
||||
{ id: 'locked', get: brokenGet },
|
||||
],
|
||||
]),
|
||||
}),
|
||||
),
|
||||
createQueryJob: vi.fn(async (): ReturnType<KtxBigQueryClient['createQueryJob']> => [
|
||||
{
|
||||
getQueryResults: async (): ReturnType<KtxBigQueryQueryJob['getQueryResults']> => [
|
||||
[],
|
||||
undefined,
|
||||
{ schema: { fields: [{ name: 'table_name', type: 'STRING' }, { name: 'column_name', type: 'STRING' }] } },
|
||||
],
|
||||
},
|
||||
]),
|
||||
})),
|
||||
};
|
||||
const connector = new KtxBigQueryScanConnector({ connectionId: 'warehouse', connection, clientFactory });
|
||||
const snapshot = await connector.introspect({ connectionId: 'warehouse', driver: 'bigquery' }, { runId: 'skip-test' });
|
||||
|
||||
expect(snapshot.tables.map((table) => table.name)).toEqual(['orders']);
|
||||
expect(snapshot.warnings).toHaveLength(1);
|
||||
expect(snapshot.warnings?.[0]).toMatchObject({
|
||||
code: 'object_introspection_failed',
|
||||
table: 'locked',
|
||||
metadata: { object: 'project-1.analytics.locked' },
|
||||
});
|
||||
});
|
||||
|
||||
it('constructs for discovery without dataset scope and lists tables through one region information schema query', async () => {
|
||||
const createQueryJob = vi.fn(
|
||||
async (
|
||||
|
|
@ -441,7 +593,7 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
const clientFactory = fakeClientFactory();
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { ...connection, max_bytes_billed: '987654321', job_timeout_ms: 30_000 },
|
||||
connection: { ...connection, max_bytes_billed: '987654321', query_timeout_ms: 30_000 },
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
|
|
@ -491,4 +643,35 @@ describe('KtxBigQueryScanConnector', () => {
|
|||
]),
|
||||
});
|
||||
});
|
||||
|
||||
it('maps a BigQuery job timeout to KtxQueryError', async () => {
|
||||
const timeoutError = new Error('Job execution was cancelled: Job timed out after 5000ms');
|
||||
const clientFactory: KtxBigQueryClientFactory = {
|
||||
createClient: vi.fn(() => ({
|
||||
getDatasets: vi.fn(async (): ReturnType<KtxBigQueryClient['getDatasets']> => [[{ id: 'analytics' }]]),
|
||||
dataset: vi.fn(
|
||||
(datasetId: string): KtxBigQueryDataset => ({
|
||||
get: vi.fn(async () => [{ id: datasetId }]),
|
||||
getTables: vi.fn(async (): ReturnType<KtxBigQueryDataset['getTables']> => [[]]),
|
||||
}),
|
||||
),
|
||||
createQueryJob: vi.fn(async (): ReturnType<KtxBigQueryClient['createQueryJob']> => {
|
||||
throw timeoutError;
|
||||
}),
|
||||
})),
|
||||
};
|
||||
const connector = new KtxBigQueryScanConnector({
|
||||
connectionId: 'warehouse',
|
||||
connection: { ...connection, query_timeout_ms: 5_000 },
|
||||
clientFactory,
|
||||
});
|
||||
|
||||
const execution = connector.executeReadOnly(
|
||||
{ connectionId: 'warehouse', sql: 'select count(*) from `project-1`.`analytics`.`orders`' },
|
||||
{ runId: 'scan-run-1' },
|
||||
);
|
||||
await expect(execution).rejects.toBeInstanceOf(KtxQueryError);
|
||||
await expect(execution).rejects.toThrow('query exceeded 5s');
|
||||
await expect(execution).rejects.toMatchObject({ cause: timeoutError });
|
||||
});
|
||||
});
|
||||
|
|
|
|||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue