mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
feat: ktx batch — scan resilience, analytics SQL craft, connector hardening (#312)
* docs: add spider2-specs handoff directory for benchmark-driven feature specs
* feat(cli): connection-scoped wiki pages
Add an optional `connections` frontmatter field so database-specific wiki
knowledge can be scoped to a connection without polluting searches about other
databases, while page keys stay a flat, globally-unique namespace.
- connections: single string or list; absent/empty ⇒ unscoped (applies to all)
- wiki_search (MCP) and `ktx wiki --connection` return unscoped ∪ matching
pages, filtered at the disk-load seam so all three search lanes draw their
candidate pool from the already-scoped set (not a post-filter)
- wiki_write accepts connections with REPLACE semantics and rejects a
connection-scoped write whose key collides with a disjoint-connection page
(data-loss guard; hard error, no silent clobber)
- explicit connection-id args (wiki_search, memory_ingest, ktx wiki) are
validated against ktx.yaml via a shared assertConfiguredConnectionId, which
also closes the prior gap where memory_ingest's connectionId was unvalidated;
persisted ids absent from config warn (not fail) in `ktx status`
- prompt guidance in the wiki_capture skill and external-ingest prompt; the
session connectionId is surfaced to the memory agent and ingest work units
Implements spider2-specs/specs/01-connection-scoped-wiki.md; intake draft moved
to spider2-specs/done/.
* docs(spider2-specs): add specs/ refinement stage and composite-key join spec
Describe the todo/ → specs/ → done/ pipeline in the README (refined specs are
the durable artifact; intake drafts move to done/ on ship) and add a
MEDIUM-priority spec for multi-column composite-key join detection found during
the first sqlite smoke test.
* feat(cli): add --verbatim ingest mode for authoritative documents
Store each --text/--file document body unchanged as a GLOBAL wiki page
instead of routing it through the memory agent, which may rewrite,
condense, or re-title it. The LLM derives only metadata (summary, tags,
sl_refs) and only for frontmatter fields the document does not already
set; the stored body is written by code and never edited.
- Deterministic page key: files derive it from the filename, inline
text from its leading Markdown heading (headless inline text is
rejected — pass it as --file instead).
- Idempotent: re-running the same body is a no-op; a different body at
the same key fails loudly rather than overwriting.
- Works with llm.provider.backend: none, deriving a degraded summary
from the heading or first sentence.
- Existing frontmatter (including unmodeled fields like effective_date)
passes through untouched; --connection-id scopes the page.
* feat(cli): SQL-authoring craft and per-dialect notes tool for the analytics skill
Spec 07: add a dialect-agnostic <sql_craft> block to the ktx-analytics skill (schema discovery, composition, window-function correctness, numeric precision, answer completeness) with one worked window-then-filter example. Workflow steps gain pointers into it; existing guidance is unchanged.
Spec 08: add a read-only sql_dialect_notes MCP tool returning a connection's engine SQL conventions (FQTN form, identifier quoting/case, date/time, top-N idiom, JSON access), resolved through the existing sqlAnalysisDialectForDriver path. Notes are per-dialect markdown files under context/sql-analysis/dialects, served by the tool and copied to dist (package-internal, never installed). Non-SQL connections return a clear KtxExpectedError. The flat skill gains a one-line pointer to the tool.
Both spider2-specs intake drafts move to done/ with implementation notes.
* feat(cli): tolerate objects that fail introspection during scan
Isolate per-object introspection failures so one broken or inaccessible object no longer zeroes out a connection's whole semantic layer: the sqlite and bigquery connectors introspect each object defensively (tryIntrospectObject), the live-database adapter records a scan outcome and fetch report, and enabled_tables accepts catalog.db.name, db.name, or bare names with a clear no-match error. Includes matching ktx-daemon introspection changes, docs, and tests.
* docs(spider2-specs): add 06-scan-tolerate-broken-objects spec
* feat(cli): generalize analytics fan-out rule to multi-hop join chains
The ktx-analytics skill's fan-out rule only reliably caught single-hop
inflation; agents still silently fanned out on multi-hop chains where the
offending one-to-many join sits several hops below the SUM/COUNT and is easy
to miss.
Rewrite the Composition rule so the danger reads as cumulative across the whole
chain (pre-aggregate per measure-owning table), add an affirmative
grain-verification habit (default: pre-aggregate to grain; escape hatch:
COUNT(DISTINCT key) for pure counts only; SUM/AVG of a fanned-out measure must
pre-aggregate), and add one generic wrong-vs-right worked example. Content-only
and dialect-agnostic; no new tool, flag, or config.
Implements spider2-specs/specs/09 and annotates spec 07's one-example
constraint as superseded.
* feat(cli): add panel-completeness, time-series window, and text-encoded numeric SQL craft
Extend the analytics skill's <sql_craft> with three correctness habits and
route the dialect-specific halves through sql_dialect_notes:
- Panel completeness (spec 10): full-domain spine -> LEFT JOIN -> COALESCE for
"each/every/all/per" questions, defaulted by measure additivity.
- Time-series windows (spec 11): explicit cumulative frames, calendar-range
rolling windows with minimum-periods guards, and period-over-period via LAG.
- Text-encoded numerics (spec 12): sample distinct values, strip/scale/cast in
one early CTE, and confirm coverage with a failure-detecting cast.
Add per-dialect Series, Rolling window, and Safe cast notes to all seven
dialect files so the skill stays dialect-agnostic while the engine-specific
syntax lives in sql_dialect_notes. Tests updated and passing (19).
* docs(spider2-specs): add specs 10-12 for analytics SQL-craft additions
Refined specs and completion records for the panel-completeness spine (10),
time-series window recipes (11), and text-encoded numeric parsing (12)
implemented in the preceding commit.
* docs(spider2-specs): add backlog intake drafts 13-14
- 13: canonical authoritative-source measures
- 14: output-completeness final check
* skill(analytics): spec 14 output-completeness + iter1 (active column planning)
Bundles two changes (entangled in SKILL.md; future spider2 iterations land as
separate commits):
- spec 14 (output-completeness): multi-part "answer every requested output" rule
+ a "Final completeness check" in workflow Step 6 and <sql_craft>; analytics
skill-content test updated; intake draft -> done/, refined spec added.
- iter1 experiment: spec 14's passive end-check did not change behavior on the
benchmark's output-completeness failures, so (a) the Plan step now writes the
exact output-column list UP FRONT as a contract the final SELECT must match,
and (b) "expose identity" -> "project BOTH the entity id and its name" (covers
both omission directions). All generic craft.
Driven by the Spider 2.0-Lite failure analysis (incomplete output was the
largest failure bucket); benchmark only as motivation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* skill(analytics): iter2 — deterministic order in string/array aggregation
GROUP_CONCAT/string_agg/array_agg element order is undefined without an explicit
ORDER BY; also note SQLite's default text sort is binary/case-sensitive (uppercase
before lowercase) vs case-insensitive (COLLATE NOCASE). Generic SQLite craft.
Spider 2.0-Lite motivation: an ordered-ingredient-list question failed only on the
within-string element order (right elements, wrong order); benchmark as motivation only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(mcp): structured, leveled logging for the MCP server
Add one synchronous pino logger per MCP server process, written through the
io.stderr sink: plain JSON when stderr is not a TTY, colorized pino-pretty
(sync, in-process) when it is. Every tool call logs tool.start with its raw
params BEFORE the handler runs and tool.end after (info / warn past
KTX_MCP_SLOW_TOOL_MS / error), correlated by callId plus sessionId, so a
runaway sql_execution leaves a recoverable start line with its exact SQL and
no matching end. HTTP logs session.open/close and wires the previously-dead
transport.onerror to transport.error; stdio routes its transport error
through the logger. Level via KTX_MCP_LOG_LEVEL (default info). Existing
mcp_request_completed telemetry and registerParsedTool are unchanged; no
worker/async transport and no redaction in v1 (logs are local-only).
Implements spider2-specs/specs/15-mcp-server-structured-logging.md and moves
the intake draft to done/.
* feat(mcp): report uptimeMs in MCP server /health
The /health endpoint now includes uptimeMs (monotonic elapsed time since
the server started), mirroring the Python daemon's uptime_ms telemetry
field.
* feat(cli): bound read-query execution with a per-connection deadline
Enforce one shared query deadline (default 30s, overridable per connection via
query_timeout_ms) on every executeReadOnly path, so an accidentally-expensive
LLM-authored query returns a fast "query exceeded Ns" KtxQueryError instead of
hanging the MCP server.
- New shared contract context/connections/query-deadline.ts
(resolveQueryDeadlineMs, queryDeadlineExceededError); query_timeout_ms added to
the shared warehouse schema; BigQuery's job_timeout_ms removed.
- SQLite runs the read query in a short-lived forked child process and enforces
the deadline with SIGKILL. worker_threads + terminate() was tried first but
cannot interrupt a synchronous better-sqlite3 scan (the native loop never
yields); SIGKILL reclaims the process in ~2ms and keeps the event loop free.
- Remote connectors apply a real server-side statement timeout and re-wrap their
own timeout signal as KtxQueryError: Postgres statement_timeout/57014, MySQL
max_execution_time/3024, Snowflake STATEMENT_TIMEOUT_IN_SECONDS/604, ClickHouse
max_execution_time + aligned request_timeout/159, SQL Server requestTimeout/
ETIMEOUT, BigQuery jobTimeoutMs.
- Relationship validation skips a candidate to review on a deadline timeout
instead of aborting the pass; the deadline surfaces through the existing MCP
pino logger as a matched tool.start/tool.end(error) pair (no new logging code).
Also fixes a pre-existing, unrelated invalid cast in mcp-server-factory.test.ts
that was breaking tsc -p tsconfig.test.json.
* docs(spider2-specs): mark spec 16 (bounded query execution) done
Append Implementation notes to the refined spec (what shipped, where, and the
worker-thread -> child-process+SIGKILL deviation with its evidence) and move the
intake draft from todo/ to done/.
* skill(analytics): iter3 — measure-as-amount, inter-event gap, top-per-metric career
Three generic interpretation rules: a named business measure (sales/revenue/spend)
means its amount not a row count; "inter-event duration/gap" is LAG/LEAD time-between
events not a magnitude column; "highest across several achievements" aggregates per
metric over the whole history. All three demonstrably FIRE (verified on local008/003/152
SQL). local008 flips to correct (mechanism-aligned). 003/152 still fail on a different
axis (source-column / grouping). Generic craft; benchmark only as motivation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* skill(analytics): spine-for-extreme-selection + aggregate-over-selected-set
Two generic answer-completeness refinements:
- Selecting the extreme group (lowest/highest count over a period/category
domain) must rank over the COMPLETE spine, not only groups with fact rows —
an empty period is a genuine 0 and often the true minimum.
- An aggregate scoped to a per-entity selected set ('avg revenue per actor in
those top-3 films') is computed ACROSS that set, distinct from the per-item
value; project both.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter2 — sharpen extreme-selection spine + top-N ranking-measure
- spine-for-extreme: concrete cue that a zero-row period never appears in a
GROUP BY of the facts; generate the full calendar, LEFT JOIN, COALESCE, then rank.
- aggregate-over-selected-set: top-N selection ranks by the named ranking measure
(the item's own revenue), independent of the per-item share that feeds the aggregate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter3 — comparison-between-two-extremes is one wide row
Distinguishes a cross-item comparison ('the difference between the highest and
lowest month' -> single wide row, both extremes side by side + the comparison
column) from 'report a metric for each group' (-> stays long). Generic, question-
derived; targets the wide-vs-long shape gap without affecting per-group long output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter4 — anchor a period bucket to the named lifecycle event
When a record carries multiple lifecycle timestamps (created/placed, approved,
shipped, delivered, completed, settled) and the question counts/measures records
in a named *completed state* by period ("delivered orders by month", "shipped
items per week"), bucket the period by that named event's own timestamp, not the
record-creation timestamp; the state value is the qualifying filter, the matching
timestamp is the time anchor. Wording priority is explicit — purchased/placed/
created/submitted/ordered keep the start-event timestamp — and a non-temporal
state filter (counts by customer/city/seller with no period) introduces no anchor.
Generic analytics craft: counting completed-state records by their creation date
silently answers "records that later reached that state, grouped by when they
started" instead of the question asked. Surfaced via the spider2-autofix loop;
FAIR_PRODUCT (adversary-screened, restatable from question wording + schema/
semantic-layer lifecycle descriptions, no gold dependency).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter5 — canonicalize observed URL-path variants before page-level analysis
When a question groups/filters/sequences web pages by a path/url column, sample
its distinct values; if the data itself shows /route and /route/ variants for the
same page context, canonicalize in an early CTE (preserve / as root, strip trailing
slashes from non-root paths, map an observed empty path to / only when the column is
a URL path with blank root-page events) and use the canonical path everywhere above.
Explicitly forbids inventing aliases the data doesn't show: no merging different
route names, no stripping query/fragment/host/scheme, no lowercasing, and no
canonicalization when the question asks for raw URL/path or slash-vs-no-slash diffs.
Generic web-analytics craft: raw request logs routinely store the same user-visible
page with and without a trailing slash, so grouping raw labels silently splits one
page into several. Surfaced via the spider2-autofix loop (Codex runner, round r2);
FAIR_PRODUCT (adversary-screened, restatable from URL-path semantics + page-grain
question wording + solver-observed distinct values, no gold dependency). The rule
fired mechanism-aligned on both targets; flipped local330 (landing/exit page counts),
local331 residual is a separate sequence-semantics axis beyond canonicalization.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): iter6 — coverage over a selected group is a set-membership aggregate
When a question first selects a group of entities ("the top 5 actors", "these
products") and then asks what count/share/percentage of a DIFFERENT subject domain
relates to *these* selected entities ("what % of customers rented films featuring
these actors"), the subject set is the UNION across the whole group: count DISTINCT
subject ids once across the selected entities and return one collective value at the
subject-domain grain — not one row per selected entity (which double-counts subjects
related to more than one entity and answers a different question). Narrowly guarded:
emit one row per entity only when the wording says "for each / per / by / list" or
asks for each entity's own metric ("top 5 players and their batting averages").
The collective-coverage cousin of the existing per-entity selected-set rule. Generic
analytics craft (per-entity metric vs set-level coverage). Surfaced via the
spider2-autofix loop (Codex runner, round r3); FAIR_PRODUCT (adversary-screened,
restatable from wording alone, no gold dependency). Flipped local195 mechanism-aligned
(union COUNT(DISTINCT customer)/total, one scalar); 0 regression across 5 passing
per-entity top-N guards (local023/024/029/212/221 stayed long).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* skill(analytics): label-only joins must LEFT JOIN — incomplete dims silently drop fact rows
Mirror of the existing fan-out rule for the DROP direction: an inner JOIN to a
dimension table used only to attach a display attribute silently discards every
fact row whose key has no parent when the dimension is incomplete (trimmed
catalogs, late-arriving / SCD-gap rows), shrinking counts/sums and the universe
over which shares/averages/medians are computed. Guidance: LEFT JOIN pure
enrichment; inner-join a dimension only when intended as a filter; key the
aggregate/GROUP BY on the fact column, not the dimension column.
Spider2 autofix round 'joindim': flips complex_oracle local050 (FAIL->PASS,
official scorer) — solver dropped the gratuitous products inner-join and
recovered the exact gold. local060/063 also adopt LEFT JOIN (rule fires) but
remain gold-convention-blocked. Guards local061/067 held.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(spider2-specs): add todo/17 — lifecycle-event metrics (semantic-layer)
Draft intake spec surfaced by the spider2-autofix loop (round r1): the model-layer
form of the shipped iter4 lifecycle-date-anchoring skill rule — infer per-state
lifecycle-event metrics (e.g. delivered_orders with defaultTimeDimension = the
delivery timestamp) during enrichment so the correct time anchor is the default for
any consumer, not only an agent that loaded the skill. Generic; FAIR_PRODUCT.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(connectors): accept leading underscore in connection/identifier ids
The safe-identifier validator regex /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/ allowed an
underscore everywhere except the first character, so a connection id / database
name that legitimately starts with '_' (valid in Snowflake, e.g. _1000_GENOMES)
could never be ingested or queried. Allow a leading underscore across all 16
duplicated validators (connection ids, source ids, page/wiki keys, warehouse-
verification tool schemas). Path-safety is unaffected — '.' and '/' remain
excluded, and assertSafePathToken still blocks traversal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): generic geospatial query guidance
Add a Snowflake ST_* dialect note (ST_MAKEPOINT lon-first, ST_DWITHIN/ST_CONTAINS/
ST_WITHIN/ST_INTERSECTS, bbox->polygon via ST_MAKEPOLYGON/ST_MAKELINE) and a
dialect-agnostic 'Spatial predicates' recipe in the analytics skill (resolve the
entity geometry, build an area-of-interest polygon, test with the engine's
containment/proximity/overlap predicate; mind lon/lat argument order). Steers the
solver off hand-rolled lat/lon BETWEEN boxes toward correct, index-assisted
geospatial predicates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): parse code/dependency text by language grammar
Add two generic <sql_craft> rules: (1) parse imported/required/loaded packages by
the language or manifest format (Java import keep-package-path allowing underscores/
mixed-case; Python import/from + alias stripping; R library/require; .ipynb parse
JSON cell source before language rules; JSON manifests flatten the dependency object
keys), stripping comments/prose and splitting multi-import lines; (2) on a
de-duplicated table with a documented copy/occurrence count, choose COUNT(*) vs the
weight column from the population the question names, not silently. Steers off one
broad regex that drops valid identifiers and matches prose.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): source filters/dates/measures from the owning fact grain
Add a <sql_craft> rule for joined fact tables at different grains (parent order
vs child line item): read each predicate, calendar bucket, and measure from the
table whose grain the question names, not whichever is in scope post-join. An
order-grain filter ("orders that are Complete", "the order's creation date")
must come from the parent even though the child carries its own status/created_at;
line price/cost come from the child. Mirror at metric grain: don't combine a
parent-grain count with child rows (num_of_item * SUM(line_price) per line) —
aggregate each measure at its own grain before combining.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(analytics): collapse multi-valued classes to one representative per entity before counting/concentration
When an entity carries a multi-valued classification array (IPC/CPC codes, tags)
and the methodology counts entities-per-class or a concentration/diversity metric
(HHI, originality, share), pick ONE representative per entity first (the array's
main/primary/first flag, else a defined fallback like most-frequent), then
aggregate; and use COUNT(DISTINCT entity) when the denominator is defined as a
count of entities. Unnesting the array otherwise multiplies an entity's weight by
its code count, inflating per-class frequencies and skewing the ranking/score.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(connectors): introspect BigQuery datasets hosted in foreign projects
A dataset_ids/dataset_id entry may now be written `project.dataset` to
introspect a dataset hosted in another project while query jobs still bill to
credentials.project_id. Entries are parsed once at the config boundary into
canonical {project, dataset} pairs; introspection, primary-key discovery,
testConnection, getTableRowCount, and listTables (grouped per project) all
resolve in the dataset's own project, and scanned tables are labeled with that
project so sampling, distinct-value, and read queries resolve. Bare entries are
unchanged.
Implements spider2-specs/specs/18-bigquery-cross-project-datasets.md.
* feat(scan): durable, resumable, bounded relationship detection during enrichment
Move the enrichment persistence boundary to the cost boundary and bound the
open-ended relationship stage (spec 19).
- Checkpoint descriptions + embeddings into the queryable `_schema` manifest
(and the raw enrichment artifacts) before relationship detection runs, via a
new `onCheckpoint` hook + `writeLocalScanEnrichmentCheckpoint`. An interrupted,
budget-truncated, or failed relationship stage now degrades to "no joins",
never "no descriptions".
- Resume the enrichment cache by content identity: re-key the SQLite stage store
on `(connection_id, stage, input_hash)` so a re-run with a fresh runId resumes
finished descriptions/embeddings instead of re-paying for LLM work. The
disposable cache recreates its table if the on-disk key shape differs.
- Make the relationship stage observable and bounded: a sticky wall-clock budget
(`scan.relationships.detectionBudgetMs`, default 600000 ms) + per-unit progress
+ honored `ctx.signal`, threaded through profiling, validation, and composite
detection. On exhaustion/abort it stops scheduling, finalizes, and returns a
partial result instead of throwing or hanging.
- Mark a budget/abort-truncated result partial (diagnostics `partial`/`partialReason`
+ recoverable `relationship_detection_partial` warning). A graceful partial saves
as a completed stage and resumes cheaply; raising the budget changes inputHash
and forces a fresh, fuller run. A process killed mid-stage saves nothing.
Document `detectionBudgetMs` in the ktx.yaml reference. Append implementation
notes to specs/19 and move the intake draft to done/.
Also carries the in-tree per-table enrichment LLM timeout work it builds on
(`description-generation.ts` + the `enrichment_timeout` warning code), which is
intertwined in `local-enrichment.ts`/`types.ts` and cannot be split into a
separately-building commit.
* feat(scan): bound + retry the per-table enrichment LLM call
The batched table-description call had no retry (sampleTable retried 3x, this did
not), so a single transient backend error (e.g. an overloaded/burst rejection when
many tables enrich concurrently) silently nulled a whole table's descriptions —
observed dropping ~70% of a db's tables during a bad window despite ample quota.
- Wrap generateObject in retryAsync (3 attempts + backoff; KTX_ENRICH_LLM_ATTEMPTS).
- Fresh per-attempt timeout (KTX_ENRICH_LLM_TIMEOUT_MS, default 120s) still bounds a
wedged wide table; a timeout is surfaced as KtxAbortedError so it is NOT retried
(one wedge stays one timeout, not 3x).
- Granular per-table progress + start/done/retry/timeout logging.
Composes with spec 19 (its non-goal #1): spec 19 makes completed descriptions durable;
this makes more of them complete.
* feat(scan): survive a hung LLM enrichment backend and resume descriptions
Two compounding failure modes on the per-table description-enrichment path (spec 20):
Enforced per-table timeout for subprocess backends. The runtime declares whether it owns an SDK subprocess (subprocessForkSpec on KtxLlmRuntimePort); codex/claude-code calls run behind a ktx-owned detached child that is tree-killed (SIGKILL of the process group on POSIX, taskkill /T on Windows) on the deadline or ctx.signal, reaping the wedged model grandchild. HTTP backends keep native fetch abort. Default stays 120s, one-wedge-one-timeout.
Incremental, resumable descriptions persistence. generateDescriptions flushes enriched tables per batch to an inputHash-tagged durable record (at a stable, non-syncId path) plus only the changed manifest shards, skips already-enriched tables on resume, and never lets one table's failure discard the stage (a skipped table costs one missing description, not the whole stage's output).
Spec 20 refined + intake draft moved to done/.
* feat(scan): selective enrichment stages (--stages) + per-stage cache keys
Split the single coarse enrichment cache key into per-stage hashes
(descriptions <- snapshot + LLM identity; embeddings <- snapshot + embedding
identity + description digest; relationships <- snapshot + relationship settings
+ LLM identity), so changing one stage's inputs invalidates only that stage and
never throws away the expensive per-table descriptions on an unrelated edit.
Add `ktx ingest --stages <list>` to force-re-run a chosen subset on an
already-ingested connection: a named stage bypasses the completed-stage
short-circuit while the per-table descriptions resume record still skips
already-enriched tables, and unselected stages are left untouched on disk. Feed
embeddings + relationships their description context from the on-disk _schema
when descriptions do not run this invocation, and carry descriptions into the
llmProposals evidence packet (closing a latent gap on the full-run path too).
Surface an enrichment_stage_stale warning when an unselected stage's inputs have
drifted, rather than silently cascading the work.
Implements spider2-specs/specs/21-selective-enrichment-stages.md.
* test(analytics): realign SKILL.md acceptance test with the evolved skill
Three assertions in analytics-skill-content.test.ts drifted from the analytics
SKILL.md as later iterations edited the skill without updating the test:
- the sub-heading was renamed Window functions -> Ordering & aggregation
determinism (iter2), so follow the source name;
- the rule "Expose identity, not just the label" was renamed to "Project BOTH
identity and label" (spec 14), so match the new wording;
- the dialect-FQTN guard false-positived on the Java package example
com.planet_ink.coffee_mud, whose backticks made a 3-segment package path read
as a BigQuery/Snowflake `a.b.c` table reference. Drop the backticks so the
guard stays at full strength without weakening it.
* fix(scan): --stages subset must not delete unselected stages' on-disk artifacts
A --stages subset that omitted descriptions wiped all on-disk ai/db descriptions
from the written _schema. runLocalScan writes the structural manifest shard from
the bare snapshot BEFORE enrichment runs, and the shard merge treats ai/db as
scan-managed and overwrites them with whatever the run emits — none, on a subset
that skips descriptions. Enrichment then read the already-wiped shard via
loadPriorDescriptions and had nothing to restore.
runLocalScanEnrichment now returns the best-available descriptions (fresh-this-run
if descriptions ran, else loaded from the on-disk _schema) instead of [], and
runLocalScan captures the prior descriptions before the structural write and feeds
them to both the structural write and enrichment, so an unselected stage's
artifacts survive. Joins were already preserved for --stages descriptions via the
manual/inferred preservedJoins path.
Tests: a full runLocalScan --stages relationships path test (RED without the fix,
GREEN with it — the earlier unit test missed the structural-pre-write ordering),
plus enrichment-layer contract tests for both directions. Validated live on
northwind: --stages relationships keeps all 110 descriptions + 22 joins (was
wiping to 0); --stages descriptions restores descriptions from the spec-20 resume
record (no LLM calls) while keeping joins.
* feat(dialects): bigquery nested-data (ARRAY/STRUCT/UNNEST), geospatial (GEOGRAPHY), SAFE_DIVIDE
bigquery.md lacked the two sections that define BigQuery analytics (present in snowflake.md):
- Nested & repeated data: UNNEST to flatten arrays of STRUCTs (GA360 hits, GA4 event_params),
dot-notation field access, key-value param scalar-subquery extraction, fan-out/COUNT(DISTINCT) guard.
- Geospatial (GEOGRAPHY): ST_GEOGPOINT (lon-first), containment/proximity/distance/intersection
predicates, areal allocation via ST_AREA(ST_INTERSECTION()).
- SAFE_DIVIDE for zero-denominator-safe rates; sharded-table shard-presence note.
Generic BigQuery craft surfaced by sql_dialect_notes; product-completeness (any BQ analyst benefits).
* feat(dialects): sqlite ROUND half-up FP-underflow note (+1e-9 before ROUND)
SQLite ROUND(x,n) rounds half-away-from-zero, but binary FP stores an exact
half-way value just below it, so ROUND(6.475,2) returns 6.47 not 6.48. Add a
dialect note: nudge by a tiny epsilon (1e-9) below display precision before
rounding for deterministic half-up, leaving non-boundary values unchanged.
Generic SQLite craft surfaced by sql_dialect_notes (any analyst rounding a
displayed average/rate/price benefits).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(analytics): list-as-delimited-string, answer-literally, drop free-text columns
Add SKILL.md guidance to emit list-valued answer cells as delimited
STRING (not ARRAY/repeated column), answer the literal ask without
unrequested transformations (HAVING for aggregate bounds), and avoid
projecting unrequested free-text columns that corrupt row-delimited output.
* fix(scan,mcp): gitignore runtime logs, budget-guard LLM proposal, validate enrich timeout
- gitignore `.ktx/logs/` in both scaffold + setup-merge lists: the managed MCP
daemon writes raw tool params (SQL, memory_ingest content) to mcp.log under a
version-controlled `.ktx/`, and snowflake.log already sat there unprotected.
- gate the LLM relationship proposal on the detection budget/abort signal so an
exhausted or aborted stage cannot start a fresh LLM call; document the boundary.
- validate KTX_ENRICH_LLM_TIMEOUT_MS (NaN/0 → 120s default) like enrichAttempts,
so a bad value no longer times out every table immediately.
- daemon introspection now warns on malformed column/FK rows instead of dropping
them silently, matching the table-row path and the "surface broken objects" goal.
- docs: document `ktx wiki -c/--connection`; fix the SQLite query-deadline schema
doc (forked-subprocess SIGKILL, not worker-thread termination).
* fix(scan,wiki,mcp): address PR #312 review findings
- scan: key the description pipeline (resume map, enriched-schema and
embedding-text lookups, manifest write/read) by full table identity via
tableRefKey/buildTableRef, so two same-named tables in different schemas no
longer cross-assign descriptions or skip a sibling on resume
- scan: re-throw a genuine context cancel during the batched description LLM
call so Ctrl-C resumes the stage instead of nulling tables and recording it
completed; per-table timeouts still degrade (context.signal not aborted)
- scan: report statisticalValidation 'skipped' (not 'completed') when a
budget/abort stop leaves relationship profiling partial
- wiki: sync the full page corpus into the sqlite index and filter only the
candidate/result set, so a connection-scoped search no longer prunes other
connections' pages and cached embeddings from the shared index
- wiki: route verbatim ingest through the canonical writePageAndSync so
contentHash is set and later syncs can short-circuit
- mcp: drop the as-unknown-as cast in serializeMcpError
- dialects/analytics: document the integer-division trap on postgres/sqlite/tsql
Adds regression tests for each behavior change.
* fix(wiki): scope connection filter before SQLite lane limit
Connection-scoped wiki search applied the connectionId allowlist after
the lexical/semantic lanes had already truncated to laneCandidatePoolLimit
over the full (connection-agnostic) corpus. When the requested connection
was a minority of a large corpus, its pages were crowded out of the
candidate pool before filtering, so a semantic-only match could be missed
outright and lexical hits under-ranked.
Push the path allowlist into searchLexicalCandidates/searchSemanticCandidates
so LIMIT applies to in-scope rows, matching what the token lane already did,
and drop the now-redundant post-limit JS filters.
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
2afab61417
commit
f65a5b0e2e
200 changed files with 17780 additions and 672 deletions
|
|
@ -34,8 +34,10 @@ connection is selected.
|
||||||
| `--query-history` | Include database query-history usage patterns | Stored connection default |
|
| `--query-history` | Include database query-history usage patterns | Stored connection default |
|
||||||
| `--no-query-history` | Skip database query-history usage patterns for this run | Stored connection default |
|
| `--no-query-history` | Skip database query-history usage patterns for this run | Stored connection default |
|
||||||
| `--query-history-window-days <days>` | BigQuery/Snowflake query-history lookback window for this run | Stored connection default |
|
| `--query-history-window-days <days>` | BigQuery/Snowflake query-history lookback window for this run | Stored connection default |
|
||||||
|
| `--stages <list>` | Comma-separated enrichment stages to (re)run: `descriptions`, `embeddings`, `relationships` | All three |
|
||||||
| `--text <content>` | Capture inline text into **ktx** memory; repeatable | `[]` |
|
| `--text <content>` | Capture inline text into **ktx** memory; repeatable | `[]` |
|
||||||
| `--file <path>` | Capture a text file into **ktx** memory; use `-` for stdin; repeatable | `[]` |
|
| `--file <path>` | Capture a text file into **ktx** memory; use `-` for stdin; repeatable | `[]` |
|
||||||
|
| `--verbatim` | Store each `--text`/`--file` document body unchanged as a `GLOBAL` wiki page; the LLM derives metadata only | `false` |
|
||||||
| `--connection-id <connectionId>` | **ktx** connection id to tag captured text/file notes | - |
|
| `--connection-id <connectionId>` | **ktx** connection id to tag captured text/file notes | - |
|
||||||
| `--user-id <id>` | Memory user id for text/file capture attribution | `local-cli` |
|
| `--user-id <id>` | Memory user id for text/file capture attribution | `local-cli` |
|
||||||
| `--fail-fast` | Stop after the first failed text/file item | `false` |
|
| `--fail-fast` | Stop after the first failed text/file item | `false` |
|
||||||
|
|
@ -63,6 +65,65 @@ use `--no-input` to fail fast with install guidance.
|
||||||
`--text` and `--file` cannot be combined with a positional `connectionId` or
|
`--text` and `--file` cannot be combined with a positional `connectionId` or
|
||||||
`--all`; pass `--connection-id <id>` instead to tag captured notes.
|
`--all`; pass `--connection-id <id>` instead to tag captured notes.
|
||||||
|
|
||||||
|
### Verbatim ingest
|
||||||
|
|
||||||
|
By default, captured text is routed through the memory agent, which decides what
|
||||||
|
to persist and may rewrite, condense, split, or re-title it. For *authoritative*
|
||||||
|
documents — metric definitions, formula specs, runbooks, compliance text — that
|
||||||
|
paraphrasing is a defect. Add `--verbatim` to store each `--text`/`--file`
|
||||||
|
document body **unchanged** as a `GLOBAL` wiki page:
|
||||||
|
|
||||||
|
- The stored body is the input document, written by code; the LLM never edits it.
|
||||||
|
It is used only to derive page metadata (`summary`, `tags`, `sl_refs`), and even
|
||||||
|
that is skipped for fields the document's own frontmatter already sets.
|
||||||
|
- The page key is deterministic: a `--file` derives it from the filename, inline
|
||||||
|
`--text` from the document's leading Markdown heading (inline text without a
|
||||||
|
heading is rejected — pass it as `--file` instead).
|
||||||
|
- Ingest is idempotent. Re-running the same document is a safe no-op; a different
|
||||||
|
body at the same key fails loudly rather than overwriting.
|
||||||
|
- `--verbatim` works with `llm.provider.backend: none` — the only ingest path that
|
||||||
|
does. With no backend the `summary` is derived from the heading or first
|
||||||
|
sentence and `tags`/`sl_refs` are left empty; the full body is still stored.
|
||||||
|
- Existing frontmatter passes through untouched (including fields **ktx** does not
|
||||||
|
model, such as `effective_date` or `version`); generated metadata only fills
|
||||||
|
absent fields. `--connection-id <id>` scopes the page to that connection by
|
||||||
|
setting its `connections` frontmatter.
|
||||||
|
|
||||||
|
### Selecting enrichment stages
|
||||||
|
|
||||||
|
Database enrichment runs three stages: `descriptions` (one LLM call per table),
|
||||||
|
`embeddings` (vectors over the schema and descriptions), and `relationships`
|
||||||
|
(join detection, optionally LLM-proposed). Each stage is cached on a **per-stage
|
||||||
|
hash of only its own inputs**, so changing one stage's inputs invalidates only
|
||||||
|
that stage. Switching the description LLM re-runs only `descriptions`; upgrading
|
||||||
|
the embeddings model re-runs only `embeddings`; turning on
|
||||||
|
`scan.relationships.llmProposals` re-runs only `relationships`. The expensive
|
||||||
|
per-table descriptions are never thrown away because an unrelated setting moved.
|
||||||
|
|
||||||
|
`--stages <list>` re-runs a chosen subset on an already-ingested connection. A
|
||||||
|
named stage is **force-recomputed** (it bypasses the completed-stage cache),
|
||||||
|
while unselected stages are left exactly as they are on disk:
|
||||||
|
|
||||||
|
- `ktx ingest warehouse --stages embeddings` — re-embed on a new model, keeping
|
||||||
|
descriptions and joins.
|
||||||
|
- `ktx ingest --all --stages relationships --no-query-history` — backfill joins
|
||||||
|
across every database after enabling `llmProposals`, without re-paying for
|
||||||
|
descriptions.
|
||||||
|
- `ktx ingest warehouse --stages descriptions` — re-run thin descriptions (for
|
||||||
|
example after raising `KTX_ENRICH_LLM_TIMEOUT_MS`). When nothing the
|
||||||
|
descriptions depend on changed, the per-table resume record means only the
|
||||||
|
tables that previously failed are re-sent to the LLM.
|
||||||
|
|
||||||
|
Stage names are validated: an unknown or empty name (`--stages foo`, `--stages
|
||||||
|
descriptions,foo`, `--stages ""`) is a hard parse error. Naming all three
|
||||||
|
(`--stages descriptions,embeddings,relationships`) forces a full enrichment
|
||||||
|
recompute, which is **not** the same as omitting the flag (omitting resumes
|
||||||
|
whatever is already done). After a selective run, **ktx** warns
|
||||||
|
(`enrichment_stage_stale`) when an unselected stage's inputs no longer match what
|
||||||
|
it was last built from — for example, re-running `descriptions` flags
|
||||||
|
`embeddings` as stale until you re-run `--stages embeddings`. The warning is
|
||||||
|
informational; **ktx** never silently cascades the extra work.
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -77,6 +138,11 @@ ktx ingest warehouse --query-history
|
||||||
# Set the lookback window for BigQuery or Snowflake query history
|
# Set the lookback window for BigQuery or Snowflake query history
|
||||||
ktx ingest warehouse --query-history-window-days 30
|
ktx ingest warehouse --query-history-window-days 30
|
||||||
|
|
||||||
|
# Re-embed one connection on a new embeddings model (descriptions/joins untouched)
|
||||||
|
ktx ingest warehouse --stages embeddings
|
||||||
|
# Backfill LLM-proposed joins across every database without re-describing
|
||||||
|
ktx ingest --all --stages relationships --no-query-history
|
||||||
|
|
||||||
# Build a context-source connection
|
# Build a context-source connection
|
||||||
ktx ingest notion
|
ktx ingest notion
|
||||||
|
|
||||||
|
|
@ -91,6 +157,12 @@ ktx ingest --file docs/revenue-notes.md --connection-id warehouse
|
||||||
|
|
||||||
# Capture one stdin item
|
# Capture one stdin item
|
||||||
printf "Refunds are excluded from net revenue." | ktx ingest --file -
|
printf "Refunds are excluded from net revenue." | ktx ingest --file -
|
||||||
|
|
||||||
|
# Store an authoritative document verbatim (body preserved exactly)
|
||||||
|
ktx ingest --file docs/rfm-bucket-definitions.md --verbatim
|
||||||
|
|
||||||
|
# Store it verbatim and scope it to one connection
|
||||||
|
ktx ingest --file docs/haversine-formula.md --verbatim --connection-id warehouse
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output
|
## Output
|
||||||
|
|
@ -191,3 +263,7 @@ according to `ingest.rateLimit`.
|
||||||
| Python runtime is missing | The selected ingest target needs runtime-backed SQL analysis or source parsing | Accept the interactive prompt, rerun with `--yes`, or run the suggested `ktx admin runtime install` command |
|
| Python runtime is missing | The selected ingest target needs runtime-backed SQL analysis or source parsing | Accept the interactive prompt, rerun with `--yes`, or run the suggested `ktx admin runtime install` command |
|
||||||
| Context-source options were ignored | Query-history flags were supplied for a context-source connection | Omit database-only flags when ingesting context-source connections |
|
| Context-source options were ignored | Query-history flags were supplied for a context-source connection | Omit database-only flags when ingesting context-source connections |
|
||||||
| Text ingest stops early | `--fail-fast` was used and one item failed | Fix the failed item or rerun without `--fail-fast` to collect all failures |
|
| Text ingest stops early | `--fail-fast` was used and one item failed | Fix the failed item or rerun without `--fail-fast` to collect all failures |
|
||||||
|
| `--verbatim requires --text or --file` | `--verbatim` was passed without a document to store | Add `--text` or `--file`, or drop `--verbatim` |
|
||||||
|
| Inline verbatim text needs a leading heading | `--text --verbatim` content has no `# Heading` to derive a stable key | Add a leading Markdown heading, or pass the content as `--file <path>` |
|
||||||
|
| A different page already exists at key | A verbatim re-run targeted an existing key with a different body | Use a distinct document name/key, or remove the existing page first |
|
||||||
|
| Connection scope conflict | Frontmatter `connections` disagrees with `--connection-id` | Remove one so the intended scope is unambiguous |
|
||||||
|
|
|
||||||
|
|
@ -134,6 +134,13 @@ incomplete.
|
||||||
MySQL, and SQL Server; `schema_names` for Snowflake; `dataset_ids` for
|
MySQL, and SQL Server; `schema_names` for Snowflake; `dataset_ids` for
|
||||||
BigQuery; and `databases` for ClickHouse.
|
BigQuery; and `databases` for ClickHouse.
|
||||||
|
|
||||||
|
A BigQuery `--database-schema` value may be qualified as `project.dataset` to
|
||||||
|
scan a dataset hosted in another project (such as
|
||||||
|
`bigquery-public-data.austin_311`); a bare value stays in the credentials'
|
||||||
|
project. Setup does not discover foreign-project datasets, so supply qualified
|
||||||
|
entries explicitly. See
|
||||||
|
[Primary sources → BigQuery](/docs/integrations/primary-sources#cross-project-datasets).
|
||||||
|
|
||||||
With `--no-input`, scope for a scope-bearing driver (PostgreSQL, MySQL,
|
With `--no-input`, scope for a scope-bearing driver (PostgreSQL, MySQL,
|
||||||
ClickHouse, SQL Server, BigQuery, Snowflake) must come from `--database-schema`
|
ClickHouse, SQL Server, BigQuery, Snowflake) must come from `--database-schema`
|
||||||
or from existing connection config in `ktx.yaml` (for example
|
or from existing connection config in `ktx.yaml` (for example
|
||||||
|
|
|
||||||
|
|
@ -28,10 +28,17 @@ Edit the Markdown files under `wiki/` directly, or ingest source content with
|
||||||
| Flag | Description | Default |
|
| Flag | Description | Default |
|
||||||
|------|-------------|---------|
|
|------|-------------|---------|
|
||||||
| `--user-id <id>` | Local user id | `local` |
|
| `--user-id <id>` | Local user id | `local` |
|
||||||
|
| `-c, --connection <id>` | Scope results to one connection: unscoped pages plus pages tagged with that connection | - |
|
||||||
| `--limit <number>` | Maximum search results (search mode only) | - |
|
| `--limit <number>` | Maximum search results (search mode only) | - |
|
||||||
| `--output <mode>` | Output mode: `pretty` (default in TTY), `plain` (TSV), or `json` | `pretty` |
|
| `--output <mode>` | Output mode: `pretty` (default in TTY), `plain` (TSV), or `json` | `pretty` |
|
||||||
| `--json` | Shortcut for `--output=json` (overrides `--output`) | `false` |
|
| `--json` | Shortcut for `--output=json` (overrides `--output`) | `false` |
|
||||||
|
|
||||||
|
`-c, --connection <id>` takes a connection id from the `connections` map in
|
||||||
|
`ktx.yaml` (an unknown id is rejected). It narrows both list and search to
|
||||||
|
pages that are not tied to any connection plus pages tagged with that
|
||||||
|
connection, so an agent working against one database sees only the wiki
|
||||||
|
knowledge relevant to it.
|
||||||
|
|
||||||
`ktx wiki <query>` uses hybrid search when `storage.search` is `sqlite-fts5`.
|
`ktx wiki <query>` uses hybrid search when `storage.search` is `sqlite-fts5`.
|
||||||
**ktx** combines lexical SQLite FTS5 matches, token matches, and semantic matches
|
**ktx** combines lexical SQLite FTS5 matches, token matches, and semantic matches
|
||||||
from wiki page embeddings stored in `.ktx/db.sqlite`. If embeddings are not
|
from wiki page embeddings stored in `.ktx/db.sqlite`. If embeddings are not
|
||||||
|
|
@ -50,6 +57,12 @@ ktx wiki --json
|
||||||
# Search wiki pages
|
# Search wiki pages
|
||||||
ktx wiki "monthly recurring revenue"
|
ktx wiki "monthly recurring revenue"
|
||||||
|
|
||||||
|
# List pages scoped to one connection (unscoped + connection-tagged)
|
||||||
|
ktx wiki --connection warehouse
|
||||||
|
|
||||||
|
# Search within one connection's scope
|
||||||
|
ktx wiki "monthly recurring revenue" -c warehouse
|
||||||
|
|
||||||
# Search wiki pages as JSON
|
# Search wiki pages as JSON
|
||||||
ktx wiki "monthly recurring revenue" --json --limit 10
|
ktx wiki "monthly recurring revenue" --json --limit 10
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -124,8 +124,10 @@ context-source drivers share the map.
|
||||||
|
|
||||||
Warehouse connections are open objects: the listed fields are validated, and
|
Warehouse connections are open objects: the listed fields are validated, and
|
||||||
any other field is preserved and passed through to the connector. Use
|
any other field is preserved and passed through to the connector. Use
|
||||||
`enabled_tables` to scope ingest to a specific list of
|
`enabled_tables` to scope ingest to a specific list of objects - useful for
|
||||||
`schema.table` names - useful for smoke tests.
|
smoke tests. Each entry accepts a `catalog.db.name`, `db.name`, or bare `name`
|
||||||
|
qualifier. ktx restricts the scan to the listed objects and fails with a clear
|
||||||
|
error (naming the available objects) if none match.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
connections:
|
connections:
|
||||||
|
|
@ -137,6 +139,18 @@ connections:
|
||||||
- public.customers
|
- public.customers
|
||||||
```
|
```
|
||||||
|
|
||||||
|
For SQLite, which exposes a single `main` schema, the qualified `main.<name>`
|
||||||
|
and the bare `<name>` forms select the same object:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
connections:
|
||||||
|
local-db:
|
||||||
|
driver: sqlite
|
||||||
|
path: ./warehouse.db
|
||||||
|
enabled_tables:
|
||||||
|
- customers # equivalent to main.customers
|
||||||
|
```
|
||||||
|
|
||||||
Connector-specific scope fields let setup and scan use the same warehouse
|
Connector-specific scope fields let setup and scan use the same warehouse
|
||||||
boundary:
|
boundary:
|
||||||
|
|
||||||
|
|
@ -158,6 +172,12 @@ connections:
|
||||||
dataset_ids: [analytics, mart]
|
dataset_ids: [analytics, mart]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
A BigQuery `dataset_ids` / `dataset_id` entry may be written `project.dataset`
|
||||||
|
to introspect a dataset hosted in another project (for example
|
||||||
|
`bigquery-public-data.austin_311`); jobs still bill to the `project_id` in
|
||||||
|
`credentials_json`. A bare `dataset` keeps using your own project. See
|
||||||
|
[Primary sources → BigQuery](/docs/integrations/primary-sources#cross-project-datasets).
|
||||||
|
|
||||||
For Postgres, MySQL, SQL Server, and Snowflake connections, set
|
For Postgres, MySQL, SQL Server, and Snowflake connections, set
|
||||||
`maxConnections` when scan or ingest work needs to stay below the target's
|
`maxConnections` when scan or ingest work needs to stay below the target's
|
||||||
connection cap. Postgres, MySQL, and SQL Server default to `10`; Snowflake
|
connection cap. Postgres, MySQL, and SQL Server default to `10`; Snowflake
|
||||||
|
|
@ -554,6 +574,7 @@ scan:
|
||||||
profileConcurrency: 4
|
profileConcurrency: 4
|
||||||
validationConcurrency: 4
|
validationConcurrency: 4
|
||||||
validationBudget: all
|
validationBudget: all
|
||||||
|
detectionBudgetMs: 600000
|
||||||
```
|
```
|
||||||
|
|
||||||
### Enrichment
|
### Enrichment
|
||||||
|
|
@ -582,6 +603,7 @@ the manifest.
|
||||||
| `relationships.profileConcurrency` | `int > 0` | `4` | Parallel relationship-profile queries against the database. For pooled connectors, effective database concurrency is also bounded by the connection's `maxConnections`. |
|
| `relationships.profileConcurrency` | `int > 0` | `4` | Parallel relationship-profile queries against the database. For pooled connectors, effective database concurrency is also bounded by the connection's `maxConnections`. |
|
||||||
| `relationships.validationConcurrency` | `int > 0` | `4` | Parallel relationship validation queries against the database. |
|
| `relationships.validationConcurrency` | `int > 0` | `4` | Parallel relationship validation queries against the database. |
|
||||||
| `relationships.validationBudget` | `all` \| `int ≥ 0` | runtime default | Cap on validation queries per scan. `all` means unlimited. |
|
| `relationships.validationBudget` | `all` \| `int ≥ 0` | runtime default | Cap on validation queries per scan. `all` means unlimited. |
|
||||||
|
| `relationships.detectionBudgetMs` | `int > 0` | `600000` | Wall-clock budget (ms) for the whole relationship-detection stage, checked at table-profile, candidate-validation, and composite-probe boundaries. On exhaustion the stage stops scheduling new work and writes the joins found so far, marked partial; descriptions and embeddings are already durable. Sits above the per-query deadline. Raise it to trigger a fresher, fuller run. |
|
||||||
|
|
||||||
## `agent`
|
## `agent`
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -321,6 +321,23 @@ Useful frontmatter:
|
||||||
5. Add `sl_refs` for relevant semantic sources.
|
5. Add `sl_refs` for relevant semantic sources.
|
||||||
6. Search again with a user-like phrase.
|
6. Search again with a user-like phrase.
|
||||||
|
|
||||||
|
### Ingest an authoritative document verbatim
|
||||||
|
|
||||||
|
When the document is already the source of truth — a metric-definition sheet, a
|
||||||
|
formula spec, a runbook, compliance text — you want **ktx** to index and surface
|
||||||
|
it, not re-author it. Instead of hand-copying the file into `wiki/global/`, ingest
|
||||||
|
it verbatim:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ktx ingest --file docs/rfm-bucket-definitions.md --verbatim
|
||||||
|
```
|
||||||
|
|
||||||
|
The body is stored byte-for-byte (the LLM only derives `summary`, `tags`, and
|
||||||
|
`sl_refs` for the absent frontmatter fields), the page key is derived from the
|
||||||
|
filename, and re-running is a safe no-op. Existing frontmatter — including fields
|
||||||
|
**ktx** does not model, like `effective_date` — passes through unchanged. See
|
||||||
|
[`ktx ingest`](/docs/cli-reference/ktx-ingest) for the full flag reference.
|
||||||
|
|
||||||
## Review context changes
|
## Review context changes
|
||||||
|
|
||||||
Before accepting agent-written context:
|
Before accepting agent-written context:
|
||||||
|
|
|
||||||
|
|
@ -35,7 +35,7 @@ Agents should prefer environment or file references over literal secrets.
|
||||||
| `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
|
| `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
|
||||||
| `path` | Yes for path-style SQLite | SQLite | Local SQLite database path or `env:NAME` reference |
|
| `path` | Yes for path-style SQLite | SQLite | Local SQLite database path or `env:NAME` reference |
|
||||||
| `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job |
|
| `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job |
|
||||||
| `job_timeout_ms` | No | BigQuery | BigQuery query job timeout in milliseconds |
|
| `query_timeout_ms` | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise. |
|
||||||
| `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
|
| `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
|
||||||
|
|
||||||
## PostgreSQL
|
## PostgreSQL
|
||||||
|
|
@ -220,6 +220,37 @@ BigQuery dataset scope is stored in `connections.<id>.dataset_ids`. Interactive
|
||||||
setup discovers datasets from credentials plus location, then writes the chosen
|
setup discovers datasets from credentials plus location, then writes the chosen
|
||||||
dataset ids as the scan scope.
|
dataset ids as the scan scope.
|
||||||
|
|
||||||
|
### Cross-project datasets
|
||||||
|
|
||||||
|
To introspect a dataset hosted in a **different project** than the one your
|
||||||
|
credentials bill to — for example Google's `bigquery-public-data`, a partner's
|
||||||
|
shared project, or an organization's central data project — qualify the entry
|
||||||
|
as `project.dataset`:
|
||||||
|
|
||||||
|
```yaml title="ktx.yaml"
|
||||||
|
connections:
|
||||||
|
public-bq:
|
||||||
|
driver: bigquery
|
||||||
|
credentials_json: file:~/.config/gcloud/bq-service-account.json
|
||||||
|
location: US
|
||||||
|
dataset_ids:
|
||||||
|
- bigquery-public-data.austin_311
|
||||||
|
- bigquery-public-data.census_bureau_usa
|
||||||
|
- analytics
|
||||||
|
```
|
||||||
|
|
||||||
|
**ktx** introspects each dataset in its host project while every query job still
|
||||||
|
bills to the `project_id` inside your `credentials_json`. A bare `dataset` entry
|
||||||
|
(no prefix) is scanned in your own project, exactly as before. A single
|
||||||
|
connection may mix datasets from several projects, and two projects may host
|
||||||
|
datasets with the same name without colliding.
|
||||||
|
|
||||||
|
Interactive setup does not enumerate datasets in projects your credentials don't
|
||||||
|
own, so hand-write `project.dataset` entries for foreign datasets. The wizard's
|
||||||
|
table picker also only lists datasets in your connection's `location` region;
|
||||||
|
this affects table selection only — ingest and `discover_data` introspect a
|
||||||
|
cross-project dataset regardless of region.
|
||||||
|
|
||||||
### Authentication
|
### Authentication
|
||||||
|
|
||||||
| Method | Config |
|
| Method | Config |
|
||||||
|
|
@ -269,7 +300,7 @@ staged artifact shape as Postgres and Snowflake.
|
||||||
- Parameter binding uses named `@param` syntax
|
- Parameter binding uses named `@param` syntax
|
||||||
- Arrays flattened to comma-separated strings in results
|
- Arrays flattened to comma-separated strings in results
|
||||||
- Location specified at query execution time
|
- Location specified at query execution time
|
||||||
- Supports `max_bytes_billed` and `job_timeout_ms` limits from `ktx.yaml`
|
- Supports the `max_bytes_billed` limit from `ktx.yaml`; the shared `query_timeout_ms` field maps to the query job's `jobTimeoutMs`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -17,7 +17,9 @@
|
||||||
"test/**/*.test-utils.ts",
|
"test/**/*.test-utils.ts",
|
||||||
"test/**/acceptance-fixtures.ts",
|
"test/**/acceptance-fixtures.ts",
|
||||||
"src/context/scan/relationship-benchmarks.ts!",
|
"src/context/scan/relationship-benchmarks.ts!",
|
||||||
"src/context/scan/relationship-benchmark-report.ts!"
|
"src/context/scan/relationship-benchmark-report.ts!",
|
||||||
|
"src/connectors/sqlite/read-query-child.ts!",
|
||||||
|
"src/context/llm/subprocess-generate-object-child.ts!"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"docs-site": {
|
"docs-site": {
|
||||||
|
|
|
||||||
|
|
@ -78,6 +78,8 @@
|
||||||
"openai": "^6.38.0",
|
"openai": "^6.38.0",
|
||||||
"p-limit": "^7.3.0",
|
"p-limit": "^7.3.0",
|
||||||
"pg": "^8.21.0",
|
"pg": "^8.21.0",
|
||||||
|
"pino": "^10.3.1",
|
||||||
|
"pino-pretty": "^13.1.3",
|
||||||
"posthog-node": "^5.34.9",
|
"posthog-node": "^5.34.9",
|
||||||
"react": "^19.2.6",
|
"react": "^19.2.6",
|
||||||
"semver": "^7.8.1",
|
"semver": "^7.8.1",
|
||||||
|
|
|
||||||
|
|
@ -7,10 +7,17 @@ const promptsSource = join(packageRoot, 'src', 'prompts');
|
||||||
const promptsTarget = join(packageRoot, 'dist', 'prompts');
|
const promptsTarget = join(packageRoot, 'dist', 'prompts');
|
||||||
const skillsSource = join(packageRoot, 'src', 'skills');
|
const skillsSource = join(packageRoot, 'src', 'skills');
|
||||||
const skillsTarget = join(packageRoot, 'dist', 'skills');
|
const skillsTarget = join(packageRoot, 'dist', 'skills');
|
||||||
|
// Per-dialect SQL notes are markdown served by the sql_dialect_notes MCP tool;
|
||||||
|
// tsc does not emit non-.ts files, so copy them next to their compiled module.
|
||||||
|
const dialectNotesSource = join(packageRoot, 'src', 'context', 'sql-analysis', 'dialects');
|
||||||
|
const dialectNotesTarget = join(packageRoot, 'dist', 'context', 'sql-analysis', 'dialects');
|
||||||
|
|
||||||
await rm(promptsTarget, { recursive: true, force: true });
|
await rm(promptsTarget, { recursive: true, force: true });
|
||||||
await rm(skillsTarget, { recursive: true, force: true });
|
await rm(skillsTarget, { recursive: true, force: true });
|
||||||
|
await rm(dialectNotesTarget, { recursive: true, force: true });
|
||||||
await mkdir(dirname(promptsTarget), { recursive: true });
|
await mkdir(dirname(promptsTarget), { recursive: true });
|
||||||
await mkdir(dirname(skillsTarget), { recursive: true });
|
await mkdir(dirname(skillsTarget), { recursive: true });
|
||||||
|
await mkdir(dirname(dialectNotesTarget), { recursive: true });
|
||||||
await cp(promptsSource, promptsTarget, { recursive: true });
|
await cp(promptsSource, promptsTarget, { recursive: true });
|
||||||
await cp(skillsSource, skillsTarget, { recursive: true });
|
await cp(skillsSource, skillsTarget, { recursive: true });
|
||||||
|
await cp(dialectNotesSource, dialectNotesTarget, { recursive: true });
|
||||||
|
|
|
||||||
|
|
@ -133,7 +133,7 @@ export function parseBooleanStringOption(value: string): boolean {
|
||||||
}
|
}
|
||||||
|
|
||||||
export function parseSafeConnectionIdOption(value: string): string {
|
export function parseSafeConnectionIdOption(value: string): string {
|
||||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) {
|
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) {
|
||||||
throw new InvalidArgumentError(`Unsafe connection id: ${value}`);
|
throw new InvalidArgumentError(`Unsafe connection id: ${value}`);
|
||||||
}
|
}
|
||||||
return value;
|
return value;
|
||||||
|
|
|
||||||
|
|
@ -1,10 +1,12 @@
|
||||||
import { type Command, Option } from '@commander-js/extra-typings';
|
import { type Command, InvalidArgumentError, Option } from '@commander-js/extra-typings';
|
||||||
import {
|
import {
|
||||||
collectOption,
|
collectOption,
|
||||||
type KtxCliCommandContext,
|
type KtxCliCommandContext,
|
||||||
parsePositiveIntegerOption,
|
parsePositiveIntegerOption,
|
||||||
resolveCommandProjectDir,
|
resolveCommandProjectDir,
|
||||||
} from '../cli-program.js';
|
} from '../cli-program.js';
|
||||||
|
import { KTX_SCAN_ENRICHMENT_STAGES } from '../context/scan/enrichment-state.js';
|
||||||
|
import type { KtxScanEnrichmentStage } from '../context/scan/types.js';
|
||||||
import type { KtxCliDeps, KtxCliIo } from '../index.js';
|
import type { KtxCliDeps, KtxCliIo } from '../index.js';
|
||||||
import { runtimeInstallPolicyFromFlags } from '../managed-python-command.js';
|
import { runtimeInstallPolicyFromFlags } from '../managed-python-command.js';
|
||||||
import type { KtxPublicIngestArgs } from '../public-ingest.js';
|
import type { KtxPublicIngestArgs } from '../public-ingest.js';
|
||||||
|
|
@ -14,6 +16,36 @@ import { resolveConnectionSelection } from './connection-selection.js';
|
||||||
|
|
||||||
profileMark('module:commands/ingest-commands');
|
profileMark('module:commands/ingest-commands');
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Parses `--stages` into an ordered, de-duplicated subset of the canonical
|
||||||
|
* enrichment-stage registry. An unknown or empty name is a hard parse error so
|
||||||
|
* a typo never silently degrades to "run everything."
|
||||||
|
*
|
||||||
|
* @internal
|
||||||
|
*/
|
||||||
|
export function parseEnrichmentStagesOption(value: string): KtxScanEnrichmentStage[] {
|
||||||
|
const names = value
|
||||||
|
.split(',')
|
||||||
|
.map((name) => name.trim())
|
||||||
|
.filter((name) => name.length > 0);
|
||||||
|
if (names.length === 0) {
|
||||||
|
throw new InvalidArgumentError(
|
||||||
|
`must be a non-empty comma-separated list of stages (${KTX_SCAN_ENRICHMENT_STAGES.join(', ')})`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
const valid = new Set<string>(KTX_SCAN_ENRICHMENT_STAGES);
|
||||||
|
const selected = new Set<KtxScanEnrichmentStage>();
|
||||||
|
for (const name of names) {
|
||||||
|
if (!valid.has(name)) {
|
||||||
|
throw new InvalidArgumentError(
|
||||||
|
`unknown stage "${name}"; valid stages are ${KTX_SCAN_ENRICHMENT_STAGES.join(', ')}`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
selected.add(name as KtxScanEnrichmentStage);
|
||||||
|
}
|
||||||
|
return KTX_SCAN_ENRICHMENT_STAGES.filter((stage) => selected.has(stage));
|
||||||
|
}
|
||||||
|
|
||||||
interface IngestCommandOptions {
|
interface IngestCommandOptions {
|
||||||
runTextIngest: (args: KtxTextIngestArgs, io: KtxCliIo, deps: KtxCliDeps) => Promise<number>;
|
runTextIngest: (args: KtxTextIngestArgs, io: KtxCliIo, deps: KtxCliDeps) => Promise<number>;
|
||||||
}
|
}
|
||||||
|
|
@ -32,8 +64,18 @@ export function registerIngestCommands(
|
||||||
.addOption(new Option('--query-history', 'Include database query-history usage patterns').conflicts('noQueryHistory'))
|
.addOption(new Option('--query-history', 'Include database query-history usage patterns').conflicts('noQueryHistory'))
|
||||||
.addOption(new Option('--no-query-history', 'Skip database query-history usage patterns'))
|
.addOption(new Option('--no-query-history', 'Skip database query-history usage patterns'))
|
||||||
.option('--query-history-window-days <days>', 'Query-history lookback window for this run', parsePositiveIntegerOption)
|
.option('--query-history-window-days <days>', 'Query-history lookback window for this run', parsePositiveIntegerOption)
|
||||||
|
.option(
|
||||||
|
'--stages <stages>',
|
||||||
|
'Comma-separated enrichment stages to (re)run (descriptions,embeddings,relationships); omit to run all',
|
||||||
|
parseEnrichmentStagesOption,
|
||||||
|
)
|
||||||
.option('--text <content>', 'Capture inline text into ktx memory; repeatable', collectOption, [])
|
.option('--text <content>', 'Capture inline text into ktx memory; repeatable', collectOption, [])
|
||||||
.option('--file <path>', 'Capture a text file into ktx memory; use - for stdin; repeatable', collectOption, [])
|
.option('--file <path>', 'Capture a text file into ktx memory; use - for stdin; repeatable', collectOption, [])
|
||||||
|
.option(
|
||||||
|
'--verbatim',
|
||||||
|
'Store each --text/--file document body unchanged as a GLOBAL wiki page; the LLM derives only metadata',
|
||||||
|
false,
|
||||||
|
)
|
||||||
.option('--connection-id <connectionId>', 'ktx connection id to tag captured text/file notes')
|
.option('--connection-id <connectionId>', 'ktx connection id to tag captured text/file notes')
|
||||||
.option('--user-id <id>', 'Memory user id for text/file capture attribution', 'local-cli')
|
.option('--user-id <id>', 'Memory user id for text/file capture attribution', 'local-cli')
|
||||||
.option('--fail-fast', 'Stop after the first failed text/file item', false)
|
.option('--fail-fast', 'Stop after the first failed text/file item', false)
|
||||||
|
|
@ -47,6 +89,14 @@ export function registerIngestCommands(
|
||||||
const projectDir = resolveCommandProjectDir(command);
|
const projectDir = resolveCommandProjectDir(command);
|
||||||
const hasTextCapture = options.text.length > 0 || options.file.length > 0;
|
const hasTextCapture = options.text.length > 0 || options.file.length > 0;
|
||||||
|
|
||||||
|
if (options.verbatim === true && !hasTextCapture) {
|
||||||
|
command.error('error: --verbatim requires --text or --file');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (options.stages !== undefined && hasTextCapture) {
|
||||||
|
command.error('error: --stages applies to database ingest only; it cannot be combined with --text or --file');
|
||||||
|
}
|
||||||
|
|
||||||
if (hasTextCapture) {
|
if (hasTextCapture) {
|
||||||
if (connectionId !== undefined) {
|
if (connectionId !== undefined) {
|
||||||
command.error(
|
command.error(
|
||||||
|
|
@ -66,6 +116,7 @@ export function registerIngestCommands(
|
||||||
userId: options.userId,
|
userId: options.userId,
|
||||||
json: options.json === true,
|
json: options.json === true,
|
||||||
failFast: options.failFast === true,
|
failFast: options.failFast === true,
|
||||||
|
...(options.verbatim === true ? { verbatim: true } : {}),
|
||||||
},
|
},
|
||||||
context.io,
|
context.io,
|
||||||
context.deps,
|
context.deps,
|
||||||
|
|
@ -87,6 +138,7 @@ export function registerIngestCommands(
|
||||||
inputMode: options.input === false ? 'disabled' : 'auto',
|
inputMode: options.input === false ? 'disabled' : 'auto',
|
||||||
queryHistory,
|
queryHistory,
|
||||||
...(options.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: options.queryHistoryWindowDays } : {}),
|
...(options.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: options.queryHistoryWindowDays } : {}),
|
||||||
|
...(options.stages ? { stages: options.stages } : {}),
|
||||||
cliVersion: context.packageInfo.version,
|
cliVersion: context.packageInfo.version,
|
||||||
runtimeInstallPolicy: runtimeInstallPolicyFromFlags(options),
|
runtimeInstallPolicy: runtimeInstallPolicyFromFlags(options),
|
||||||
};
|
};
|
||||||
|
|
|
||||||
|
|
@ -27,6 +27,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
||||||
.usage('[options] [query...]')
|
.usage('[options] [query...]')
|
||||||
.argument('[query...]', 'Search query; omit to list all pages')
|
.argument('[query...]', 'Search query; omit to list all pages')
|
||||||
.option('--user-id <id>', 'Local user id', 'local')
|
.option('--user-id <id>', 'Local user id', 'local')
|
||||||
|
.option('-c, --connection <id>', 'Scope results to one connection (unscoped pages plus pages tagged with it)')
|
||||||
.option('--limit <number>', 'Maximum search results (search mode only)', parsePositiveIntegerOption)
|
.option('--limit <number>', 'Maximum search results (search mode only)', parsePositiveIntegerOption)
|
||||||
.addOption(
|
.addOption(
|
||||||
new Option('--output <mode>', 'Output mode: pretty (default in TTY), plain (TSV), or json').choices([
|
new Option('--output <mode>', 'Output mode: pretty (default in TTY), plain (TSV), or json').choices([
|
||||||
|
|
@ -46,6 +47,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
||||||
query: string[],
|
query: string[],
|
||||||
options: {
|
options: {
|
||||||
userId: string;
|
userId: string;
|
||||||
|
connection?: string;
|
||||||
limit?: number;
|
limit?: number;
|
||||||
output?: 'pretty' | 'plain' | 'json';
|
output?: 'pretty' | 'plain' | 'json';
|
||||||
json?: boolean;
|
json?: boolean;
|
||||||
|
|
@ -57,6 +59,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
||||||
command: 'list',
|
command: 'list',
|
||||||
projectDir: resolveCommandProjectDir(command),
|
projectDir: resolveCommandProjectDir(command),
|
||||||
userId: options.userId,
|
userId: options.userId,
|
||||||
|
...(options.connection !== undefined ? { connectionId: options.connection } : {}),
|
||||||
output: options.output,
|
output: options.output,
|
||||||
json: options.json,
|
json: options.json,
|
||||||
cliVersion: context.packageInfo.version,
|
cliVersion: context.packageInfo.version,
|
||||||
|
|
@ -68,6 +71,7 @@ export function registerWikiCommands(program: Command, context: KtxCliCommandCon
|
||||||
projectDir: resolveCommandProjectDir(command),
|
projectDir: resolveCommandProjectDir(command),
|
||||||
query: query.join(' '),
|
query: query.join(' '),
|
||||||
userId: options.userId,
|
userId: options.userId,
|
||||||
|
...(options.connection !== undefined ? { connectionId: options.connection } : {}),
|
||||||
output: options.output,
|
output: options.output,
|
||||||
json: options.json,
|
json: options.json,
|
||||||
...(isDebugEnabled(command) ? { debug: true } : {}),
|
...(isDebugEnabled(command) ? { debug: true } : {}),
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,7 @@
|
||||||
import type { KtxProjectConnectionConfig } from './context/project/config.js';
|
import type { KtxProjectConnectionConfig } from './context/project/config.js';
|
||||||
|
|
||||||
const KTX_DATABASE_DRIVER_IDS = new Set([
|
/** @internal Canonical SQL-warehouse driver ids; the dialect-notes coverage test derives its required coverage from this set. */
|
||||||
|
export const KTX_DATABASE_DRIVER_IDS = [
|
||||||
'sqlite',
|
'sqlite',
|
||||||
'postgres',
|
'postgres',
|
||||||
'mysql',
|
'mysql',
|
||||||
|
|
@ -8,8 +9,11 @@ const KTX_DATABASE_DRIVER_IDS = new Set([
|
||||||
'sqlserver',
|
'sqlserver',
|
||||||
'bigquery',
|
'bigquery',
|
||||||
'snowflake',
|
'snowflake',
|
||||||
'mongodb',
|
] as const;
|
||||||
]);
|
|
||||||
|
// mongodb is a database driver but has no SQL dialect, so it sits outside the
|
||||||
|
// dialect-notes coverage set above.
|
||||||
|
const databaseDriverIds = new Set<string>([...KTX_DATABASE_DRIVER_IDS, 'mongodb']);
|
||||||
|
|
||||||
export function normalizeConnectionDriver(connection: KtxProjectConnectionConfig): string {
|
export function normalizeConnectionDriver(connection: KtxProjectConnectionConfig): string {
|
||||||
return String(connection.driver ?? '')
|
return String(connection.driver ?? '')
|
||||||
|
|
@ -18,5 +22,5 @@ export function normalizeConnectionDriver(connection: KtxProjectConnectionConfig
|
||||||
}
|
}
|
||||||
|
|
||||||
export function isDatabaseDriver(driver: string): boolean {
|
export function isDatabaseDriver(driver: string): boolean {
|
||||||
return KTX_DATABASE_DRIVER_IDS.has(driver.trim().toLowerCase());
|
return databaseDriverIds.has(driver.trim().toLowerCase());
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,8 +1,14 @@
|
||||||
import { BigQuery, type TableField } from '@google-cloud/bigquery';
|
import { BigQuery, type TableField } from '@google-cloud/bigquery';
|
||||||
import { normalizeBigQueryProjectId, normalizeBigQueryRegion } from '../../context/connections/bigquery-identifiers.js';
|
import {
|
||||||
|
normalizeBigQueryDatasetId,
|
||||||
|
normalizeBigQueryProjectId,
|
||||||
|
normalizeBigQueryRegion,
|
||||||
|
} from '../../context/connections/bigquery-identifiers.js';
|
||||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||||
|
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||||
|
import { tryIntrospectObject } from '../../context/scan/object-introspection.js';
|
||||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||||
import {
|
import {
|
||||||
connectorTestFailure,
|
connectorTestFailure,
|
||||||
|
|
@ -35,14 +41,25 @@ export interface KtxBigQueryConnectionConfig {
|
||||||
credentials_json?: string;
|
credentials_json?: string;
|
||||||
location?: string;
|
location?: string;
|
||||||
max_bytes_billed?: number | string;
|
max_bytes_billed?: number | string;
|
||||||
job_timeout_ms?: number;
|
query_timeout_ms?: number;
|
||||||
[key: string]: unknown;
|
[key: string]: unknown;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A dataset to introspect, paired with the project that hosts it. `project`
|
||||||
|
* defaults to the billing project (`credentials.project_id`) when an entry has
|
||||||
|
* no `project.` prefix; a fully-qualified `project.dataset` entry resolves to
|
||||||
|
* its own host project. Jobs always bill in `credentials.project_id`.
|
||||||
|
*/
|
||||||
|
export interface BigQueryDatasetRef {
|
||||||
|
project: string;
|
||||||
|
dataset: string;
|
||||||
|
}
|
||||||
|
|
||||||
export interface KtxBigQueryResolvedConnectionConfig {
|
export interface KtxBigQueryResolvedConnectionConfig {
|
||||||
projectId: string;
|
projectId: string;
|
||||||
credentials: Record<string, unknown>;
|
credentials: Record<string, unknown>;
|
||||||
datasetIds: string[];
|
datasetIds: BigQueryDatasetRef[];
|
||||||
location?: string;
|
location?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -95,7 +112,7 @@ export interface KtxBigQueryDataset {
|
||||||
|
|
||||||
export interface KtxBigQueryClient {
|
export interface KtxBigQueryClient {
|
||||||
getDatasets(input?: { maxResults?: number }): Promise<[Array<{ id?: string }>, ...unknown[]]>;
|
getDatasets(input?: { maxResults?: number }): Promise<[Array<{ id?: string }>, ...unknown[]]>;
|
||||||
dataset(datasetId: string): KtxBigQueryDataset;
|
dataset(datasetId: string, projectId: string): KtxBigQueryDataset;
|
||||||
createQueryJob(input: {
|
createQueryJob(input: {
|
||||||
query: string;
|
query: string;
|
||||||
location?: string;
|
location?: string;
|
||||||
|
|
@ -116,7 +133,6 @@ export interface KtxBigQueryScanConnectorOptions {
|
||||||
env?: NodeJS.ProcessEnv;
|
env?: NodeJS.ProcessEnv;
|
||||||
now?: () => Date;
|
now?: () => Date;
|
||||||
maxBytesBilled?: number | string;
|
maxBytesBilled?: number | string;
|
||||||
queryTimeoutMs?: number;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
class DefaultBigQueryClientFactory implements KtxBigQueryClientFactory {
|
class DefaultBigQueryClientFactory implements KtxBigQueryClientFactory {
|
||||||
|
|
@ -124,8 +140,8 @@ class DefaultBigQueryClientFactory implements KtxBigQueryClientFactory {
|
||||||
const client = new BigQuery(input);
|
const client = new BigQuery(input);
|
||||||
return {
|
return {
|
||||||
getDatasets: (options) => client.getDatasets(options) as Promise<[Array<{ id?: string }>, ...unknown[]]>,
|
getDatasets: (options) => client.getDatasets(options) as Promise<[Array<{ id?: string }>, ...unknown[]]>,
|
||||||
dataset: (datasetId) => {
|
dataset: (datasetId, projectId) => {
|
||||||
const dataset = client.dataset(datasetId);
|
const dataset = client.dataset(datasetId, { projectId });
|
||||||
return {
|
return {
|
||||||
get: () => dataset.get() as Promise<unknown>,
|
get: () => dataset.get() as Promise<unknown>,
|
||||||
getTables: () => dataset.getTables() as Promise<[KtxBigQueryTableRef[], ...unknown[]]>,
|
getTables: () => dataset.getTables() as Promise<[KtxBigQueryTableRef[], ...unknown[]]>,
|
||||||
|
|
@ -145,14 +161,48 @@ function stringConfigValue(
|
||||||
return typeof value === 'string' && value.trim().length > 0 ? resolveStringReference(value.trim(), env) : undefined;
|
return typeof value === 'string' && value.trim().length > 0 ? resolveStringReference(value.trim(), env) : undefined;
|
||||||
}
|
}
|
||||||
|
|
||||||
function datasetIds(connection: KtxBigQueryConnectionConfig, env: NodeJS.ProcessEnv): string[] {
|
/**
|
||||||
if (Array.isArray(connection.dataset_ids) && connection.dataset_ids.length > 0) {
|
* Parse one `dataset_ids` / `dataset_id` entry into a canonical
|
||||||
return connection.dataset_ids
|
* {@link BigQueryDatasetRef}. A `project.dataset` prefix selects the host
|
||||||
.filter((dataset) => dataset.trim().length > 0)
|
* project; a bare entry defaults to `defaultProject` (the billing project).
|
||||||
.map((dataset) => resolveStringReference(dataset, env));
|
* More than one dot, or an empty segment, is a config error naming the
|
||||||
|
* connection — never a silent mis-introspection at scan time.
|
||||||
|
*/
|
||||||
|
function parseBigQueryDatasetEntry(entry: string, defaultProject: string, connectionId: string): BigQueryDatasetRef {
|
||||||
|
const context = `connections.${connectionId}.dataset_ids entry "${entry}"`;
|
||||||
|
const parts = entry.split('.');
|
||||||
|
if (parts.length === 1) {
|
||||||
|
return { project: defaultProject, dataset: normalizeBigQueryDatasetId(parts[0]!, context) };
|
||||||
}
|
}
|
||||||
const datasetId = stringConfigValue(connection, 'dataset_id', env);
|
if (parts.length === 2) {
|
||||||
return datasetId ? [datasetId] : [];
|
const [project, dataset] = parts;
|
||||||
|
if (!project || !dataset) {
|
||||||
|
throw new Error(`Invalid BigQuery dataset entry for ${context}: empty project or dataset segment`);
|
||||||
|
}
|
||||||
|
return {
|
||||||
|
project: normalizeBigQueryProjectId(project, context),
|
||||||
|
dataset: normalizeBigQueryDatasetId(dataset, context),
|
||||||
|
};
|
||||||
|
}
|
||||||
|
throw new Error(
|
||||||
|
`Invalid BigQuery dataset entry for ${context}: expected "dataset" or "project.dataset", got more than one "."`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
function resolveDatasetRefs(
|
||||||
|
connection: KtxBigQueryConnectionConfig,
|
||||||
|
env: NodeJS.ProcessEnv,
|
||||||
|
defaultProject: string,
|
||||||
|
connectionId: string,
|
||||||
|
): BigQueryDatasetRef[] {
|
||||||
|
const rawEntries =
|
||||||
|
Array.isArray(connection.dataset_ids) && connection.dataset_ids.length > 0
|
||||||
|
? connection.dataset_ids.map((dataset) => resolveStringReference(dataset, env))
|
||||||
|
: [stringConfigValue(connection, 'dataset_id', env)].filter((value): value is string => Boolean(value));
|
||||||
|
return rawEntries
|
||||||
|
.map((entry) => entry.trim())
|
||||||
|
.filter((entry) => entry.length > 0)
|
||||||
|
.map((entry) => parseBigQueryDatasetEntry(entry, defaultProject, connectionId));
|
||||||
}
|
}
|
||||||
|
|
||||||
function bigQueryMaxBytesBilledFromConnection(
|
function bigQueryMaxBytesBilledFromConnection(
|
||||||
|
|
@ -169,12 +219,25 @@ function bigQueryMaxBytesBilledFromConnection(
|
||||||
return undefined;
|
return undefined;
|
||||||
}
|
}
|
||||||
|
|
||||||
function bigQueryJobTimeoutMsFromConnection(connection: KtxBigQueryConnectionConfig | undefined): number | undefined {
|
// jobTimeoutMs cancels the job with a "Job timed out" message (or a timeout
|
||||||
const value = connection?.job_timeout_ms;
|
// reason in the errors array) once the deadline elapses.
|
||||||
if (typeof value !== 'number') {
|
function isBigQueryTimeoutError(error: unknown): boolean {
|
||||||
return undefined;
|
if (!error || typeof error !== 'object') {
|
||||||
|
return false;
|
||||||
}
|
}
|
||||||
return Number.isInteger(value) && value > 0 ? value : undefined;
|
const topMessage = (error as { message?: unknown }).message;
|
||||||
|
if (typeof topMessage === 'string' && /timed out|timeout/i.test(topMessage)) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
const errors = (error as { errors?: unknown }).errors;
|
||||||
|
return (
|
||||||
|
Array.isArray(errors) &&
|
||||||
|
errors.some((entry) => {
|
||||||
|
const reason = (entry as { reason?: unknown })?.reason;
|
||||||
|
const message = (entry as { message?: unknown })?.message;
|
||||||
|
return reason === 'timeout' || (typeof message === 'string' && /timed out|timeout/i.test(message));
|
||||||
|
})
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
function tableKind(metadataType: string | undefined): KtxSchemaTable['kind'] {
|
function tableKind(metadataType: string | undefined): KtxSchemaTable['kind'] {
|
||||||
|
|
@ -267,7 +330,7 @@ export function bigQueryConnectionConfigFromConfig(input: {
|
||||||
if (!projectId) {
|
if (!projectId) {
|
||||||
throw new Error(`Native BigQuery connector requires credentials_json.project_id for connections.${input.connectionId}`);
|
throw new Error(`Native BigQuery connector requires credentials_json.project_id for connections.${input.connectionId}`);
|
||||||
}
|
}
|
||||||
const resolvedDatasetIds = datasetIds(input.connection, env);
|
const resolvedDatasetIds = resolveDatasetRefs(input.connection, env, projectId, input.connectionId);
|
||||||
const location = stringConfigValue(input.connection, 'location', env);
|
const location = stringConfigValue(input.connection, 'location', env);
|
||||||
return { projectId, credentials, datasetIds: resolvedDatasetIds, ...(location ? { location } : {}) };
|
return { projectId, credentials, datasetIds: resolvedDatasetIds, ...(location ? { location } : {}) };
|
||||||
}
|
}
|
||||||
|
|
@ -290,7 +353,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
private readonly clientFactory: KtxBigQueryClientFactory;
|
private readonly clientFactory: KtxBigQueryClientFactory;
|
||||||
private readonly now: () => Date;
|
private readonly now: () => Date;
|
||||||
private readonly maxBytesBilled?: number | string;
|
private readonly maxBytesBilled?: number | string;
|
||||||
private readonly queryTimeoutMs?: number;
|
private readonly deadlineMs: number;
|
||||||
private readonly dialect = getSqlDialectForDriver('bigquery');
|
private readonly dialect = getSqlDialectForDriver('bigquery');
|
||||||
private client: KtxBigQueryClient | null = null;
|
private client: KtxBigQueryClient | null = null;
|
||||||
|
|
||||||
|
|
@ -304,7 +367,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
this.clientFactory = options.clientFactory ?? new DefaultBigQueryClientFactory();
|
this.clientFactory = options.clientFactory ?? new DefaultBigQueryClientFactory();
|
||||||
this.now = options.now ?? (() => new Date());
|
this.now = options.now ?? (() => new Date());
|
||||||
this.maxBytesBilled = options.maxBytesBilled ?? bigQueryMaxBytesBilledFromConnection(options.connection);
|
this.maxBytesBilled = options.maxBytesBilled ?? bigQueryMaxBytesBilledFromConnection(options.connection);
|
||||||
this.queryTimeoutMs = options.queryTimeoutMs ?? bigQueryJobTimeoutMsFromConnection(options.connection);
|
this.deadlineMs = resolveQueryDeadlineMs(options.connection);
|
||||||
this.id = `bigquery:${options.connectionId}`;
|
this.id = `bigquery:${options.connectionId}`;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -312,8 +375,8 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
try {
|
try {
|
||||||
const client = this.getClient();
|
const client = this.getClient();
|
||||||
await client.getDatasets({ maxResults: 1 });
|
await client.getDatasets({ maxResults: 1 });
|
||||||
for (const datasetId of this.resolved.datasetIds) {
|
for (const ref of this.resolved.datasetIds) {
|
||||||
await client.dataset(datasetId).get();
|
await client.dataset(ref.dataset, ref.project).get();
|
||||||
}
|
}
|
||||||
return { success: true };
|
return { success: true };
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
|
|
@ -324,22 +387,23 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise<KtxSchemaSnapshot> {
|
async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise<KtxSchemaSnapshot> {
|
||||||
this.assertConnection(input.connectionId);
|
this.assertConnection(input.connectionId);
|
||||||
const tables: KtxSchemaTable[] = [];
|
const tables: KtxSchemaTable[] = [];
|
||||||
const datasetIds = this.requireDatasetIdsForScan();
|
const datasetRefs = this.requireDatasetIdsForScan();
|
||||||
const snapshotWarnings: KtxScanWarning[] = [];
|
const snapshotWarnings: KtxScanWarning[] = [];
|
||||||
for (const datasetId of datasetIds) {
|
for (const ref of datasetRefs) {
|
||||||
const scopedNames = input.tableScope
|
const scopedNames = input.tableScope
|
||||||
? scopedTableNames(input.tableScope, { catalog: this.resolved.projectId, db: datasetId })
|
? scopedTableNames(input.tableScope, { catalog: ref.project, db: ref.dataset })
|
||||||
: null;
|
: null;
|
||||||
tables.push(...(await this.introspectDataset(datasetId, scopedNames, snapshotWarnings)));
|
tables.push(...(await this.introspectDataset(ref, scopedNames, snapshotWarnings)));
|
||||||
}
|
}
|
||||||
|
const datasetLabels = datasetRefs.map((ref) => this.qualifiedDatasetLabel(ref));
|
||||||
return {
|
return {
|
||||||
connectionId: this.connectionId,
|
connectionId: this.connectionId,
|
||||||
driver: 'bigquery',
|
driver: 'bigquery',
|
||||||
extractedAt: this.now().toISOString(),
|
extractedAt: this.now().toISOString(),
|
||||||
scope: { catalogs: [this.resolved.projectId], datasets: datasetIds },
|
scope: { catalogs: [...new Set(datasetRefs.map((ref) => ref.project))], datasets: datasetLabels },
|
||||||
metadata: {
|
metadata: {
|
||||||
project_id: this.resolved.projectId,
|
project_id: this.resolved.projectId,
|
||||||
datasets: datasetIds,
|
datasets: datasetLabels,
|
||||||
table_count: tables.length,
|
table_count: tables.length,
|
||||||
total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0),
|
total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0),
|
||||||
},
|
},
|
||||||
|
|
@ -400,11 +464,14 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
return { values: valueRows.filter((row) => row.val !== null).map((row) => String(row.val)), cardinality };
|
return { values: valueRows.filter((row) => row.val !== null).map((row) => String(row.val)), cardinality };
|
||||||
}
|
}
|
||||||
|
|
||||||
async getTableRowCount(tableName: string, datasetId = this.resolved.datasetIds[0]): Promise<number> {
|
async getTableRowCount(
|
||||||
if (!datasetId) {
|
tableName: string,
|
||||||
|
ref: BigQueryDatasetRef | undefined = this.resolved.datasetIds[0],
|
||||||
|
): Promise<number> {
|
||||||
|
if (!ref) {
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
const tables = await this.introspectDataset(datasetId, null, []);
|
const tables = await this.introspectDataset(ref, null, []);
|
||||||
return tables.find((table) => table.name === tableName)?.estimatedRows ?? 0;
|
return tables.find((table) => table.name === tableName)?.estimatedRows ?? 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -422,12 +489,28 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
}
|
}
|
||||||
|
|
||||||
async listTables(datasetIds?: string[]): Promise<KtxTableListEntry[]> {
|
async listTables(datasetIds?: string[]): Promise<KtxTableListEntry[]> {
|
||||||
const projectId = normalizeBigQueryProjectId(this.resolved.projectId, 'table discovery');
|
|
||||||
const region = normalizeBigQueryRegion(this.resolved.location ?? 'US', 'table discovery');
|
const region = normalizeBigQueryRegion(this.resolved.location ?? 'US', 'table discovery');
|
||||||
|
if (!datasetIds || datasetIds.length === 0) {
|
||||||
|
return this.listTablesInProject(this.resolved.projectId, region);
|
||||||
|
}
|
||||||
|
const datasetsByProject = new Map<string, string[]>();
|
||||||
|
for (const entry of datasetIds) {
|
||||||
|
const ref = parseBigQueryDatasetEntry(entry.trim(), this.resolved.projectId, this.connectionId);
|
||||||
|
datasetsByProject.set(ref.project, [...(datasetsByProject.get(ref.project) ?? []), ref.dataset]);
|
||||||
|
}
|
||||||
|
const entries: KtxTableListEntry[] = [];
|
||||||
|
for (const [project, datasets] of datasetsByProject) {
|
||||||
|
entries.push(...(await this.listTablesInProject(project, region, datasets)));
|
||||||
|
}
|
||||||
|
return entries;
|
||||||
|
}
|
||||||
|
|
||||||
|
private async listTablesInProject(project: string, region: string, datasets?: string[]): Promise<KtxTableListEntry[]> {
|
||||||
|
const projectId = normalizeBigQueryProjectId(project, 'table discovery');
|
||||||
const params: Record<string, unknown> = {};
|
const params: Record<string, unknown> = {};
|
||||||
const filter = datasetIds && datasetIds.length > 0 ? 'AND table_schema IN UNNEST(@dataset_ids)' : '';
|
const filter = datasets && datasets.length > 0 ? 'AND table_schema IN UNNEST(@dataset_ids)' : '';
|
||||||
if (datasetIds && datasetIds.length > 0) {
|
if (datasets && datasets.length > 0) {
|
||||||
params.dataset_ids = datasetIds;
|
params.dataset_ids = datasets;
|
||||||
}
|
}
|
||||||
const rows = await this.queryRaw<{ table_schema: string; table_name: string; table_type: string }>(
|
const rows = await this.queryRaw<{ table_schema: string; table_name: string; table_type: string }>(
|
||||||
`
|
`
|
||||||
|
|
@ -442,7 +525,7 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
params,
|
params,
|
||||||
);
|
);
|
||||||
return rows.map((row) => ({
|
return rows.map((row) => ({
|
||||||
catalog: this.resolved.projectId,
|
catalog: project,
|
||||||
schema: row.table_schema,
|
schema: row.table_schema,
|
||||||
name: row.table_name,
|
name: row.table_name,
|
||||||
kind:
|
kind:
|
||||||
|
|
@ -466,34 +549,48 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
return this.client;
|
return this.client;
|
||||||
}
|
}
|
||||||
|
|
||||||
private requireDatasetIdsForScan(): string[] {
|
private requireDatasetIdsForScan(): BigQueryDatasetRef[] {
|
||||||
if (this.resolved.datasetIds.length === 0) {
|
if (this.resolved.datasetIds.length === 0) {
|
||||||
throw new Error(`Native BigQuery scan requires connections.${this.connectionId}.dataset_ids or dataset_id`);
|
throw new Error(`Native BigQuery scan requires connections.${this.connectionId}.dataset_ids or dataset_id`);
|
||||||
}
|
}
|
||||||
return this.resolved.datasetIds;
|
return this.resolved.datasetIds;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Bare in the billing project, qualified `project.dataset` otherwise, so the
|
||||||
|
// snapshot's scope/metadata stay unambiguous when two projects host the same
|
||||||
|
// dataset name. The dotless form is the unchanged single-project label.
|
||||||
|
private qualifiedDatasetLabel(ref: BigQueryDatasetRef): string {
|
||||||
|
return ref.project === this.resolved.projectId ? ref.dataset : `${ref.project}.${ref.dataset}`;
|
||||||
|
}
|
||||||
|
|
||||||
private async query(sql: string, params?: Record<string, unknown>): Promise<KtxQueryResult> {
|
private async query(sql: string, params?: Record<string, unknown>): Promise<KtxQueryResult> {
|
||||||
const [job] = await this.getClient().createQueryJob({
|
try {
|
||||||
query: sql,
|
const [job] = await this.getClient().createQueryJob({
|
||||||
...(this.resolved.location ? { location: this.resolved.location } : {}),
|
query: sql,
|
||||||
...(params && Object.keys(params).length > 0 ? { params } : {}),
|
...(this.resolved.location ? { location: this.resolved.location } : {}),
|
||||||
...(this.maxBytesBilled ? { maximumBytesBilled: String(this.maxBytesBilled) } : {}),
|
...(params && Object.keys(params).length > 0 ? { params } : {}),
|
||||||
...(this.queryTimeoutMs ? { jobTimeoutMs: this.queryTimeoutMs } : {}),
|
...(this.maxBytesBilled ? { maximumBytesBilled: String(this.maxBytesBilled) } : {}),
|
||||||
});
|
jobTimeoutMs: this.deadlineMs,
|
||||||
const [rows, , response] = await job.getQueryResults();
|
});
|
||||||
let headers = response?.schema?.fields?.map((field) => field.name || '') ?? [];
|
const [rows, , response] = await job.getQueryResults();
|
||||||
const headerTypes = response?.schema?.fields?.map((field) => String(field.type || 'STRING')) ?? [];
|
let headers = response?.schema?.fields?.map((field) => field.name || '') ?? [];
|
||||||
if (headers.length === 0 && rows.length > 0) {
|
const headerTypes = response?.schema?.fields?.map((field) => String(field.type || 'STRING')) ?? [];
|
||||||
headers = Object.keys(rows[0]!);
|
if (headers.length === 0 && rows.length > 0) {
|
||||||
|
headers = Object.keys(rows[0]!);
|
||||||
|
}
|
||||||
|
return {
|
||||||
|
headers,
|
||||||
|
headerTypes: headerTypes.length > 0 ? headerTypes : undefined,
|
||||||
|
rows: rows.map((row) => headers.map((header) => normalizeValue(row[header]))),
|
||||||
|
totalRows: rows.length,
|
||||||
|
rowCount: rows.length,
|
||||||
|
};
|
||||||
|
} catch (error) {
|
||||||
|
if (isBigQueryTimeoutError(error)) {
|
||||||
|
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
}
|
}
|
||||||
return {
|
|
||||||
headers,
|
|
||||||
headerTypes: headerTypes.length > 0 ? headerTypes : undefined,
|
|
||||||
rows: rows.map((row) => headers.map((header) => normalizeValue(row[header]))),
|
|
||||||
totalRows: rows.length,
|
|
||||||
rowCount: rows.length,
|
|
||||||
};
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private async queryRaw<T extends Record<string, unknown>>(sql: string, params?: Record<string, unknown>): Promise<T[]> {
|
private async queryRaw<T extends Record<string, unknown>>(sql: string, params?: Record<string, unknown>): Promise<T[]> {
|
||||||
|
|
@ -507,18 +604,18 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
}
|
}
|
||||||
|
|
||||||
private async introspectDataset(
|
private async introspectDataset(
|
||||||
datasetId: string,
|
ref: BigQueryDatasetRef,
|
||||||
scopedNames: readonly string[] | null,
|
scopedNames: readonly string[] | null,
|
||||||
snapshotWarnings: KtxScanWarning[],
|
snapshotWarnings: KtxScanWarning[],
|
||||||
): Promise<KtxSchemaTable[]> {
|
): Promise<KtxSchemaTable[]> {
|
||||||
if (scopedNames && scopedNames.length === 0) return [];
|
if (scopedNames && scopedNames.length === 0) return [];
|
||||||
const dataset = this.getClient().dataset(datasetId);
|
const dataset = this.getClient().dataset(ref.dataset, ref.project);
|
||||||
const [tableRefs] = await dataset.getTables();
|
const [tableRefs] = await dataset.getTables();
|
||||||
const scopeSet = scopedNames ? new Set(scopedNames) : null;
|
const scopeSet = scopedNames ? new Set(scopedNames) : null;
|
||||||
const filteredTableRefs = scopeSet ? tableRefs.filter((tableRef) => scopeSet.has(tableRef.id ?? '')) : tableRefs;
|
const filteredTableRefs = scopeSet ? tableRefs.filter((tableRef) => scopeSet.has(tableRef.id ?? '')) : tableRefs;
|
||||||
const primaryKeysResult = await tryConstraintQuery(
|
const primaryKeysResult = await tryConstraintQuery(
|
||||||
{ schema: datasetId, kind: 'primary_key', isDeniedError },
|
{ schema: ref.dataset, kind: 'primary_key', isDeniedError },
|
||||||
() => this.primaryKeys(datasetId),
|
() => this.primaryKeys(ref),
|
||||||
);
|
);
|
||||||
const primaryKeys = primaryKeysResult.ok ? primaryKeysResult.value : new Map<string, Set<string>>();
|
const primaryKeys = primaryKeysResult.ok ? primaryKeysResult.value : new Map<string, Set<string>>();
|
||||||
if (!primaryKeysResult.ok) {
|
if (!primaryKeysResult.ok) {
|
||||||
|
|
@ -527,41 +624,51 @@ export class KtxBigQueryScanConnector implements KtxScanConnector {
|
||||||
const tables: KtxSchemaTable[] = [];
|
const tables: KtxSchemaTable[] = [];
|
||||||
for (const tableRef of filteredTableRefs) {
|
for (const tableRef of filteredTableRefs) {
|
||||||
const tableName = tableRef.id || '';
|
const tableName = tableRef.id || '';
|
||||||
const [table] = await tableRef.get();
|
const outcome = await tryIntrospectObject<KtxSchemaTable>(
|
||||||
const fields = table.metadata.schema?.fields ?? [];
|
{ object: tableName, catalog: ref.project, db: ref.dataset },
|
||||||
tables.push({
|
async () => {
|
||||||
catalog: this.resolved.projectId,
|
const [table] = await tableRef.get();
|
||||||
db: datasetId,
|
const fields = table.metadata.schema?.fields ?? [];
|
||||||
name: tableName,
|
return {
|
||||||
kind: tableKind(table.metadata.type),
|
catalog: ref.project,
|
||||||
comment: table.metadata.description || null,
|
db: ref.dataset,
|
||||||
estimatedRows: firstNumber(table.metadata.numRows) ?? 0,
|
name: tableName,
|
||||||
columns: fields.map((field) => this.toSchemaColumn(tableName, field, primaryKeys)),
|
kind: tableKind(table.metadata.type),
|
||||||
foreignKeys: [],
|
comment: table.metadata.description || null,
|
||||||
});
|
estimatedRows: firstNumber(table.metadata.numRows) ?? 0,
|
||||||
|
columns: fields.map((field) => this.toSchemaColumn(tableName, field, primaryKeys)),
|
||||||
|
foreignKeys: [],
|
||||||
|
};
|
||||||
|
},
|
||||||
|
);
|
||||||
|
if (outcome.ok) {
|
||||||
|
tables.push(outcome.table);
|
||||||
|
} else {
|
||||||
|
snapshotWarnings.push(outcome.warning);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
return tables;
|
return tables;
|
||||||
}
|
}
|
||||||
|
|
||||||
private async primaryKeys(datasetId: string): Promise<Map<string, Set<string>>> {
|
private async primaryKeys(ref: BigQueryDatasetRef): Promise<Map<string, Set<string>>> {
|
||||||
const rows = await this.queryRaw<{ table_name: string; column_name: string }>(
|
const rows = await this.queryRaw<{ table_name: string; column_name: string }>(
|
||||||
'SELECT tc.table_name, kcu.column_name ' +
|
'SELECT tc.table_name, kcu.column_name ' +
|
||||||
'FROM `' +
|
'FROM `' +
|
||||||
this.resolved.projectId +
|
ref.project +
|
||||||
'.' +
|
'.' +
|
||||||
datasetId +
|
ref.dataset +
|
||||||
'.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` tc ' +
|
'.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` tc ' +
|
||||||
'JOIN `' +
|
'JOIN `' +
|
||||||
this.resolved.projectId +
|
ref.project +
|
||||||
'.' +
|
'.' +
|
||||||
datasetId +
|
ref.dataset +
|
||||||
'.INFORMATION_SCHEMA.KEY_COLUMN_USAGE` kcu ' +
|
'.INFORMATION_SCHEMA.KEY_COLUMN_USAGE` kcu ' +
|
||||||
'ON tc.constraint_name = kcu.constraint_name ' +
|
'ON tc.constraint_name = kcu.constraint_name ' +
|
||||||
'AND tc.table_schema = kcu.table_schema ' +
|
'AND tc.table_schema = kcu.table_schema ' +
|
||||||
'AND tc.table_name = kcu.table_name ' +
|
'AND tc.table_name = kcu.table_name ' +
|
||||||
"WHERE tc.constraint_type = 'PRIMARY KEY' " +
|
"WHERE tc.constraint_type = 'PRIMARY KEY' " +
|
||||||
"AND tc.table_schema = '" +
|
"AND tc.table_schema = '" +
|
||||||
datasetId +
|
ref.dataset +
|
||||||
"' " +
|
"' " +
|
||||||
"AND NOT REGEXP_CONTAINS(kcu.column_name, r'^(stacksync_record_id|sync_primary_key)_') " +
|
"AND NOT REGEXP_CONTAINS(kcu.column_name, r'^(stacksync_record_id|sync_primary_key)_') " +
|
||||||
'ORDER BY tc.table_name, kcu.ordinal_position',
|
'ORDER BY tc.table_name, kcu.ordinal_position',
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,6 @@
|
||||||
import { createClient } from '@clickhouse/client';
|
import { createClient } from '@clickhouse/client';
|
||||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||||
|
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||||
import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxSchemaColumn, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableRef, type KtxTableSampleInput, type KtxTableListEntry, type KtxTableSampleResult } from '../../context/scan/types.js';
|
import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxSchemaColumn, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableRef, type KtxTableSampleInput, type KtxTableListEntry, type KtxTableSampleResult } from '../../context/scan/types.js';
|
||||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||||
|
|
@ -144,6 +145,21 @@ function maybeNumber(value: unknown): number | undefined {
|
||||||
return typeof value === 'number' && Number.isFinite(value) ? value : undefined;
|
return typeof value === 'number' && Number.isFinite(value) ? value : undefined;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ClickHouse error code 159 = TIMEOUT_EXCEEDED, raised when max_execution_time
|
||||||
|
// is hit. The client surfaces it via a numeric/string `code` or a "Code: 159"
|
||||||
|
// message prefix depending on transport.
|
||||||
|
function isClickHouseTimeoutError(error: unknown): boolean {
|
||||||
|
if (!error || typeof error !== 'object') {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
const code = (error as { code?: unknown }).code;
|
||||||
|
if (code === 159 || code === '159') {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
const message = (error as { message?: unknown }).message;
|
||||||
|
return typeof message === 'string' && (/\bCode:\s*159\b/.test(message) || message.includes('TIMEOUT_EXCEEDED'));
|
||||||
|
}
|
||||||
|
|
||||||
function parseClickHouseUrl(url: string): Partial<KtxClickHouseConnectionConfig> {
|
function parseClickHouseUrl(url: string): Partial<KtxClickHouseConnectionConfig> {
|
||||||
const parsed = new URL(url);
|
const parsed = new URL(url);
|
||||||
return {
|
return {
|
||||||
|
|
@ -284,6 +300,7 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
||||||
private readonly clientFactory: KtxClickHouseClientFactory;
|
private readonly clientFactory: KtxClickHouseClientFactory;
|
||||||
private readonly endpointResolver?: KtxClickHouseEndpointResolver;
|
private readonly endpointResolver?: KtxClickHouseEndpointResolver;
|
||||||
private readonly now: () => Date;
|
private readonly now: () => Date;
|
||||||
|
private readonly deadlineMs: number;
|
||||||
private readonly dialect = getSqlDialectForDriver('clickhouse');
|
private readonly dialect = getSqlDialectForDriver('clickhouse');
|
||||||
private client: KtxClickHouseClient | null = null;
|
private client: KtxClickHouseClient | null = null;
|
||||||
private resolvedEndpoint: KtxClickHouseResolvedEndpoint | null = null;
|
private resolvedEndpoint: KtxClickHouseResolvedEndpoint | null = null;
|
||||||
|
|
@ -299,6 +316,7 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
||||||
this.clientFactory = options.clientFactory ?? new DefaultClickHouseClientFactory();
|
this.clientFactory = options.clientFactory ?? new DefaultClickHouseClientFactory();
|
||||||
this.endpointResolver = options.endpointResolver;
|
this.endpointResolver = options.endpointResolver;
|
||||||
this.now = options.now ?? (() => new Date());
|
this.now = options.now ?? (() => new Date());
|
||||||
|
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||||
this.id = `clickhouse:${options.connectionId}`;
|
this.id = `clickhouse:${options.connectionId}`;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -584,9 +602,13 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
||||||
username: config.username,
|
username: config.username,
|
||||||
password: config.password ?? '',
|
password: config.password ?? '',
|
||||||
database: config.database,
|
database: config.database,
|
||||||
request_timeout: 30_000,
|
// The server aborts at max_execution_time (seconds); request_timeout must
|
||||||
|
// outlast it so the HTTP client receives the code-159 error instead of
|
||||||
|
// giving up first and leaving the query running.
|
||||||
|
request_timeout: this.deadlineMs + 5_000,
|
||||||
clickhouse_settings: {
|
clickhouse_settings: {
|
||||||
output_format_json_quote_64bit_integers: 1,
|
output_format_json_quote_64bit_integers: 1,
|
||||||
|
max_execution_time: Math.ceil(this.deadlineMs / 1000),
|
||||||
},
|
},
|
||||||
...(isProxied && config.ssl
|
...(isProxied && config.ssl
|
||||||
? {
|
? {
|
||||||
|
|
@ -613,19 +635,26 @@ export class KtxClickHouseScanConnector implements KtxScanConnector {
|
||||||
|
|
||||||
private async query(sql: string, params?: Record<string, unknown>): Promise<Omit<KtxQueryResult, 'rowCount'>> {
|
private async query(sql: string, params?: Record<string, unknown>): Promise<Omit<KtxQueryResult, 'rowCount'>> {
|
||||||
const client = await this.clientForQuery();
|
const client = await this.clientForQuery();
|
||||||
const resultSet = await client.query({
|
try {
|
||||||
query: assertReadOnlySql(sql),
|
const resultSet = await client.query({
|
||||||
format: 'JSONCompact',
|
query: assertReadOnlySql(sql),
|
||||||
...(params ? { query_params: params } : {}),
|
format: 'JSONCompact',
|
||||||
});
|
...(params ? { query_params: params } : {}),
|
||||||
const response = (await resultSet.json()) as ClickHouseCompactResponse;
|
});
|
||||||
const meta = response.meta ?? [];
|
const response = (await resultSet.json()) as ClickHouseCompactResponse;
|
||||||
return {
|
const meta = response.meta ?? [];
|
||||||
headers: meta.map((field) => field.name),
|
return {
|
||||||
headerTypes: meta.map((field) => field.type),
|
headers: meta.map((field) => field.name),
|
||||||
rows: response.data ?? [],
|
headerTypes: meta.map((field) => field.type),
|
||||||
totalRows: response.rows ?? response.data?.length ?? 0,
|
rows: response.data ?? [],
|
||||||
};
|
totalRows: response.rows ?? response.data?.length ?? 0,
|
||||||
|
};
|
||||||
|
} catch (error) {
|
||||||
|
if (isClickHouseTimeoutError(error)) {
|
||||||
|
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private assertConnection(connectionId: string): void {
|
private assertConnection(connectionId: string): void {
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,6 @@
|
||||||
import mysql, { type FieldPacket, type Pool, type RowDataPacket } from 'mysql2/promise';
|
import mysql, { type FieldPacket, type Pool, type RowDataPacket } from 'mysql2/promise';
|
||||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||||
|
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||||
import { resolveStringReference } from '../shared/string-reference.js';
|
import { resolveStringReference } from '../shared/string-reference.js';
|
||||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||||
import {
|
import {
|
||||||
|
|
@ -282,6 +283,11 @@ function isDeniedError(error: unknown): boolean {
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// errno 3024 = ER_QUERY_TIMEOUT, raised when max_execution_time is exceeded.
|
||||||
|
function isMysqlTimeoutError(error: unknown): boolean {
|
||||||
|
return Boolean(error) && typeof error === 'object' && (error as { errno?: unknown }).errno === 3024;
|
||||||
|
}
|
||||||
|
|
||||||
function pushConstraintWarnings(
|
function pushConstraintWarnings(
|
||||||
warnings: KtxScanWarning[],
|
warnings: KtxScanWarning[],
|
||||||
schemas: readonly string[],
|
schemas: readonly string[],
|
||||||
|
|
@ -391,6 +397,7 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
||||||
private readonly poolFactory: KtxMysqlPoolFactory;
|
private readonly poolFactory: KtxMysqlPoolFactory;
|
||||||
private readonly endpointResolver?: KtxMysqlEndpointResolver;
|
private readonly endpointResolver?: KtxMysqlEndpointResolver;
|
||||||
private readonly now: () => Date;
|
private readonly now: () => Date;
|
||||||
|
private readonly deadlineMs: number;
|
||||||
private readonly dialect = getSqlDialectForDriver('mysql');
|
private readonly dialect = getSqlDialectForDriver('mysql');
|
||||||
private pool: KtxMysqlPool | null = null;
|
private pool: KtxMysqlPool | null = null;
|
||||||
private resolvedEndpoint: KtxMysqlResolvedEndpoint | null = null;
|
private resolvedEndpoint: KtxMysqlResolvedEndpoint | null = null;
|
||||||
|
|
@ -406,6 +413,7 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
||||||
this.poolFactory = options.poolFactory ?? new DefaultMysqlPoolFactory();
|
this.poolFactory = options.poolFactory ?? new DefaultMysqlPoolFactory();
|
||||||
this.endpointResolver = options.endpointResolver;
|
this.endpointResolver = options.endpointResolver;
|
||||||
this.now = options.now ?? (() => new Date());
|
this.now = options.now ?? (() => new Date());
|
||||||
|
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||||
this.id = `mysql:${options.connectionId}`;
|
this.id = `mysql:${options.connectionId}`;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -763,6 +771,9 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
||||||
const pool = await this.poolForQuery();
|
const pool = await this.poolForQuery();
|
||||||
const connection = await pool.getConnection();
|
const connection = await pool.getConnection();
|
||||||
try {
|
try {
|
||||||
|
// max_execution_time (ms) bounds read-only SELECTs server-side; our path
|
||||||
|
// only runs SELECT/WITH, so the session setting always applies.
|
||||||
|
await connection.query('SET SESSION max_execution_time = ?', [this.deadlineMs]);
|
||||||
const [rows, fields] = await connection.query(assertReadOnlySql(sql), queryParams(params));
|
const [rows, fields] = await connection.query(assertReadOnlySql(sql), queryParams(params));
|
||||||
const headers = fields.map((field) => field.name);
|
const headers = fields.map((field) => field.name);
|
||||||
const headerTypes = fields.map((field) => String(field.type ?? 'unknown'));
|
const headerTypes = fields.map((field) => String(field.type ?? 'unknown'));
|
||||||
|
|
@ -772,6 +783,11 @@ export class KtxMysqlScanConnector implements KtxScanConnector {
|
||||||
rows: rows.map((row) => headers.map((header) => row[header])),
|
rows: rows.map((row) => headers.map((header) => row[header])),
|
||||||
totalRows: rows.length,
|
totalRows: rows.length,
|
||||||
};
|
};
|
||||||
|
} catch (error) {
|
||||||
|
if (isMysqlTimeoutError(error)) {
|
||||||
|
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
} finally {
|
} finally {
|
||||||
connection.release();
|
connection.release();
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,6 @@
|
||||||
import { resolveStringReference } from '../shared/string-reference.js';
|
import { resolveStringReference } from '../shared/string-reference.js';
|
||||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||||
|
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||||
|
|
@ -260,6 +261,11 @@ function isDeniedError(error: unknown): boolean {
|
||||||
return code === '42501' || code === '42P01';
|
return code === '42501' || code === '42P01';
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// 57014 = query_canceled, which is how statement_timeout surfaces.
|
||||||
|
function isPostgresTimeoutError(error: unknown): boolean {
|
||||||
|
return Boolean(error) && typeof error === 'object' && (error as { code?: unknown }).code === '57014';
|
||||||
|
}
|
||||||
|
|
||||||
function queryRows(result: KtxPostgresQueryResult): unknown[][] {
|
function queryRows(result: KtxPostgresQueryResult): unknown[][] {
|
||||||
const headers = (result.fields ?? []).map((field) => field.name);
|
const headers = (result.fields ?? []).map((field) => field.name);
|
||||||
return result.rows.map((row) => headers.map((header) => row[header]));
|
return result.rows.map((row) => headers.map((header) => row[header]));
|
||||||
|
|
@ -384,9 +390,13 @@ export function postgresPoolConfigFromConfig(input: {
|
||||||
: { host, port: numberValue(merged.port) ?? 5432, database, user, password }),
|
: { host, port: numberValue(merged.port) ?? 5432, database, user, password }),
|
||||||
};
|
};
|
||||||
const searchPathSchemas = searchPathSchemasFromConnection(merged);
|
const searchPathSchemas = searchPathSchemasFromConnection(merged);
|
||||||
|
// statement_timeout (ms) bounds every query on connections from this pool, so
|
||||||
|
// the server itself aborts a runaway query and frees the connection cleanly.
|
||||||
|
const serverOptions = [`-c statement_timeout=${resolveQueryDeadlineMs(merged)}`];
|
||||||
if (searchPathSchemas.length > 0) {
|
if (searchPathSchemas.length > 0) {
|
||||||
config.options = `-c search_path=${searchPathSchemas.join(',')}`;
|
serverOptions.unshift(`-c search_path=${searchPathSchemas.join(',')}`);
|
||||||
}
|
}
|
||||||
|
config.options = serverOptions.join(' ');
|
||||||
if (merged.ssl && sslmode !== 'prefer' && sslmode !== 'disable') {
|
if (merged.ssl && sslmode !== 'prefer' && sslmode !== 'disable') {
|
||||||
config.ssl = { rejectUnauthorized: merged.rejectUnauthorized ?? true };
|
config.ssl = { rejectUnauthorized: merged.rejectUnauthorized ?? true };
|
||||||
}
|
}
|
||||||
|
|
@ -412,6 +422,7 @@ export class KtxPostgresScanConnector implements KtxScanConnector {
|
||||||
private readonly poolFactory: KtxPostgresPoolFactory;
|
private readonly poolFactory: KtxPostgresPoolFactory;
|
||||||
private readonly endpointResolver?: KtxPostgresEndpointResolver;
|
private readonly endpointResolver?: KtxPostgresEndpointResolver;
|
||||||
private readonly now: () => Date;
|
private readonly now: () => Date;
|
||||||
|
private readonly deadlineMs: number;
|
||||||
private readonly dialect = getSqlDialectForDriver('postgres');
|
private readonly dialect = getSqlDialectForDriver('postgres');
|
||||||
private pool: KtxPostgresPool | null = null;
|
private pool: KtxPostgresPool | null = null;
|
||||||
private lastIdlePoolError: Error | null = null;
|
private lastIdlePoolError: Error | null = null;
|
||||||
|
|
@ -428,6 +439,7 @@ export class KtxPostgresScanConnector implements KtxScanConnector {
|
||||||
this.poolFactory = options.poolFactory ?? new DefaultPostgresPoolFactory();
|
this.poolFactory = options.poolFactory ?? new DefaultPostgresPoolFactory();
|
||||||
this.endpointResolver = options.endpointResolver;
|
this.endpointResolver = options.endpointResolver;
|
||||||
this.now = options.now ?? (() => new Date());
|
this.now = options.now ?? (() => new Date());
|
||||||
|
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||||
this.id = `postgres:${options.connectionId}`;
|
this.id = `postgres:${options.connectionId}`;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -819,6 +831,11 @@ export class KtxPostgresScanConnector implements KtxScanConnector {
|
||||||
totalRows: result.rows.length,
|
totalRows: result.rows.length,
|
||||||
rowCount: result.rows.length,
|
rowCount: result.rows.length,
|
||||||
};
|
};
|
||||||
|
} catch (error) {
|
||||||
|
if (isPostgresTimeoutError(error)) {
|
||||||
|
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
} finally {
|
} finally {
|
||||||
client.release();
|
client.release();
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,6 @@
|
||||||
import { createPrivateKey } from 'node:crypto';
|
import { createPrivateKey } from 'node:crypto';
|
||||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||||
|
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||||
import { resolveStringReference } from '../shared/string-reference.js';
|
import { resolveStringReference } from '../shared/string-reference.js';
|
||||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||||
|
|
@ -60,6 +61,7 @@ export interface KtxSnowflakeResolvedConnectionConfig {
|
||||||
passphrase?: string;
|
passphrase?: string;
|
||||||
role?: string;
|
role?: string;
|
||||||
maxConnections: number;
|
maxConnections: number;
|
||||||
|
deadlineMs: number;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface KtxSnowflakeRawColumnMetadata {
|
export interface KtxSnowflakeRawColumnMetadata {
|
||||||
|
|
@ -181,6 +183,22 @@ function isDeniedError(error: unknown): boolean {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Snowflake cancels with code 604 and a "reached its statement ... timeout"
|
||||||
|
// message once STATEMENT_TIMEOUT_IN_SECONDS elapses.
|
||||||
|
function isSnowflakeTimeoutError(error: unknown): boolean {
|
||||||
|
if (!error || typeof error !== 'object') {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
const code = (error as { code?: unknown }).code;
|
||||||
|
const message = (error as { message?: unknown }).message;
|
||||||
|
return (
|
||||||
|
code === 604 ||
|
||||||
|
code === '604' ||
|
||||||
|
code === '000604' ||
|
||||||
|
(typeof message === 'string' && /reached its (statement|warehouse) .*timeout/i.test(message))
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
function normalizeSnowflakeValue(value: unknown, columnType?: string): unknown {
|
function normalizeSnowflakeValue(value: unknown, columnType?: string): unknown {
|
||||||
if (columnType && DATE_TYPES.some((type) => columnType.toUpperCase().includes(type))) {
|
if (columnType && DATE_TYPES.some((type) => columnType.toUpperCase().includes(type))) {
|
||||||
if (typeof value === 'number') {
|
if (typeof value === 'number') {
|
||||||
|
|
@ -282,6 +300,7 @@ export function snowflakeConnectionConfigFromConfig(input: {
|
||||||
connectionId: input.connectionId,
|
connectionId: input.connectionId,
|
||||||
defaultValue: 4,
|
defaultValue: 4,
|
||||||
}),
|
}),
|
||||||
|
deadlineMs: resolveQueryDeadlineMs(input.connection),
|
||||||
};
|
};
|
||||||
const role = stringConfigValue(input.connection, 'role', env);
|
const role = stringConfigValue(input.connection, 'role', env);
|
||||||
if (role) {
|
if (role) {
|
||||||
|
|
@ -339,13 +358,23 @@ class SnowflakeSdkDriver implements KtxSnowflakeDriver {
|
||||||
|
|
||||||
async query(sql: string, params?: unknown): Promise<KtxQueryResult> {
|
async query(sql: string, params?: unknown): Promise<KtxQueryResult> {
|
||||||
const binds = Array.isArray(params) ? toSnowflakeBinds(params) : undefined;
|
const binds = Array.isArray(params) ? toSnowflakeBinds(params) : undefined;
|
||||||
|
const statementTimeoutSeconds = Math.ceil(this.resolved.deadlineMs / 1000);
|
||||||
try {
|
try {
|
||||||
const pool = await this.getPool();
|
const pool = await this.getPool();
|
||||||
const result = await pool.use(async (connection: snowflake.Connection) =>
|
const result = await pool.use(async (connection: snowflake.Connection) => {
|
||||||
this.executeSnowflakeQuery(connection, sql, binds),
|
// Bound the statement server-side; Snowflake cancels and frees the
|
||||||
);
|
// warehouse slot when STATEMENT_TIMEOUT_IN_SECONDS is reached.
|
||||||
|
await this.executeSnowflakeQuery(
|
||||||
|
connection,
|
||||||
|
`ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = ${statementTimeoutSeconds}`,
|
||||||
|
);
|
||||||
|
return this.executeSnowflakeQuery(connection, sql, binds);
|
||||||
|
});
|
||||||
return { ...result, totalRows: result.rows.length, rowCount: result.rows.length };
|
return { ...result, totalRows: result.rows.length, rowCount: result.rows.length };
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
|
if (isSnowflakeTimeoutError(error)) {
|
||||||
|
throw queryDeadlineExceededError(this.resolved.deadlineMs, { cause: error });
|
||||||
|
}
|
||||||
const message = error instanceof Error ? error.message : String(error);
|
const message = error instanceof Error ? error.message : String(error);
|
||||||
if (/timeout/i.test(message) && /pool|acquire/i.test(message)) {
|
if (/timeout/i.test(message) && /pool|acquire/i.test(message)) {
|
||||||
throw new Error(
|
throw new Error(
|
||||||
|
|
|
||||||
|
|
@ -3,19 +3,44 @@ import { existsSync, readFileSync, statSync } from 'node:fs';
|
||||||
import { homedir } from 'node:os';
|
import { homedir } from 'node:os';
|
||||||
import { isAbsolute, resolve } from 'node:path';
|
import { isAbsolute, resolve } from 'node:path';
|
||||||
import { fileURLToPath } from 'node:url';
|
import { fileURLToPath } from 'node:url';
|
||||||
|
import { fork, type ChildProcess } from 'node:child_process';
|
||||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||||
|
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||||
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
import { assertReadOnlySql, limitSqlForExecution } from '../../context/connections/read-only-sql.js';
|
||||||
import { normalizeQueryRows } from '../../context/connections/query-executor.js';
|
import { normalizeQueryRows } from '../../context/connections/query-executor.js';
|
||||||
import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxSchemaForeignKey, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableListEntry, type KtxTableRef, type KtxTableSampleInput, type KtxTableSampleResult } from '../../context/scan/types.js';
|
import { connectorTestFailure, createKtxConnectorCapabilities, type KtxConnectorTestResult, type KtxColumnSampleInput, type KtxColumnSampleResult, type KtxColumnStatsInput, type KtxColumnStatsResult, type KtxQueryResult, type KtxReadOnlyQueryInput, type KtxScanConnector, type KtxScanContext, type KtxScanInput, type KtxScanWarning, type KtxSchemaForeignKey, type KtxSchemaSnapshot, type KtxSchemaTable, type KtxTableListEntry, type KtxTableRef, type KtxTableSampleInput, type KtxTableSampleResult } from '../../context/scan/types.js';
|
||||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||||
|
import { tryIntrospectObject } from '../../context/scan/object-introspection.js';
|
||||||
|
|
||||||
export interface KtxSqliteConnectionConfig {
|
export interface KtxSqliteConnectionConfig {
|
||||||
driver?: string;
|
driver?: string;
|
||||||
path?: string;
|
path?: string;
|
||||||
url?: string;
|
url?: string;
|
||||||
|
query_timeout_ms?: number;
|
||||||
[key: string]: unknown;
|
[key: string]: unknown;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// In dist, connector.js and read-query-child.js are siblings; under vitest the
|
||||||
|
// compiled .js is absent and Node strips types from the .ts when forking it.
|
||||||
|
const readQueryChildUrl = existsSync(fileURLToPath(new URL('./read-query-child.js', import.meta.url)))
|
||||||
|
? new URL('./read-query-child.js', import.meta.url)
|
||||||
|
: new URL('./read-query-child.ts', import.meta.url);
|
||||||
|
|
||||||
|
/** @internal */
|
||||||
|
export function forkReadQueryChild(): ChildProcess {
|
||||||
|
// Empty execArgv so the child is a clean Node process (no inherited vitest /
|
||||||
|
// inspector flags); advanced serialization preserves BigInt/Buffer in rows.
|
||||||
|
return fork(readQueryChildUrl, {
|
||||||
|
execArgv: [],
|
||||||
|
serialization: 'advanced',
|
||||||
|
stdio: ['ignore', 'ignore', 'inherit', 'ipc'],
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
type ReadQueryChildMessage =
|
||||||
|
| { ok: true; headers: string[]; rows: unknown[]; totalRows: number }
|
||||||
|
| { ok: false; message: string };
|
||||||
|
|
||||||
/** @internal */
|
/** @internal */
|
||||||
export interface SqliteDatabasePathInput {
|
export interface SqliteDatabasePathInput {
|
||||||
connectionId: string;
|
connectionId: string;
|
||||||
|
|
@ -25,6 +50,8 @@ export interface SqliteDatabasePathInput {
|
||||||
|
|
||||||
export interface KtxSqliteScanConnectorOptions extends SqliteDatabasePathInput {
|
export interface KtxSqliteScanConnectorOptions extends SqliteDatabasePathInput {
|
||||||
now?: () => Date;
|
now?: () => Date;
|
||||||
|
/** @internal Test seam: spawn the read-query child so tests can observe its lifecycle. */
|
||||||
|
spawnReadQueryChild?: () => ChildProcess;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface KtxSqliteReadOnlyQueryInput extends KtxReadOnlyQueryInput {
|
export interface KtxSqliteReadOnlyQueryInput extends KtxReadOnlyQueryInput {
|
||||||
|
|
@ -133,6 +160,8 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
||||||
private readonly connectionId: string;
|
private readonly connectionId: string;
|
||||||
private readonly dbPath: string;
|
private readonly dbPath: string;
|
||||||
private readonly now: () => Date;
|
private readonly now: () => Date;
|
||||||
|
private readonly deadlineMs: number;
|
||||||
|
private readonly spawnReadQueryChild: () => ChildProcess;
|
||||||
private readonly dialect = getSqlDialectForDriver('sqlite');
|
private readonly dialect = getSqlDialectForDriver('sqlite');
|
||||||
private db: Database.Database | null = null;
|
private db: Database.Database | null = null;
|
||||||
|
|
||||||
|
|
@ -140,6 +169,8 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
||||||
this.connectionId = options.connectionId;
|
this.connectionId = options.connectionId;
|
||||||
this.dbPath = sqliteDatabasePathFromConfig(options);
|
this.dbPath = sqliteDatabasePathFromConfig(options);
|
||||||
this.now = options.now ?? (() => new Date());
|
this.now = options.now ?? (() => new Date());
|
||||||
|
this.deadlineMs = resolveQueryDeadlineMs(options.connection);
|
||||||
|
this.spawnReadQueryChild = options.spawnReadQueryChild ?? forkReadQueryChild;
|
||||||
this.id = `sqlite:${options.connectionId}`;
|
this.id = `sqlite:${options.connectionId}`;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -158,17 +189,27 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
||||||
async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise<KtxSchemaSnapshot> {
|
async introspect(input: KtxScanInput, _ctx: KtxScanContext): Promise<KtxSchemaSnapshot> {
|
||||||
this.assertConnection(input.connectionId);
|
this.assertConnection(input.connectionId);
|
||||||
const database = this.database();
|
const database = this.database();
|
||||||
const scopedNames = input.tableScope ? scopedTableNames(input.tableScope, { catalog: null, db: null }) : null;
|
const allObjects = database
|
||||||
const scopeClause = scopedNames ? `AND name IN (${scopedNames.map(() => '?').join(', ')})` : '';
|
.prepare(
|
||||||
const rawTables =
|
`SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view') AND name NOT LIKE 'sqlite_%' ORDER BY name`,
|
||||||
scopedNames && scopedNames.length === 0
|
)
|
||||||
? []
|
.all() as SqliteMasterRow[];
|
||||||
: (database
|
const scopedNames = input.tableScope
|
||||||
.prepare(
|
? new Set(scopedTableNames(input.tableScope, { catalog: null, db: null }))
|
||||||
`SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view') AND name NOT LIKE 'sqlite_%' ${scopeClause} ORDER BY name`,
|
: null;
|
||||||
)
|
const selectedObjects = scopedNames ? allObjects.filter((object) => scopedNames.has(object.name)) : allObjects;
|
||||||
.all(...(scopedNames ?? [])) as SqliteMasterRow[]);
|
|
||||||
const tables = rawTables.map((table) => this.readTable(database, table));
|
const tables: KtxSchemaTable[] = [];
|
||||||
|
const warnings: KtxScanWarning[] = [];
|
||||||
|
for (const object of selectedObjects) {
|
||||||
|
const outcome = await tryIntrospectObject({ object: object.name }, () => this.readTable(database, object));
|
||||||
|
if (outcome.ok) {
|
||||||
|
tables.push(outcome.table);
|
||||||
|
} else {
|
||||||
|
warnings.push(outcome.warning);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
const fileStats = existsSync(this.dbPath) ? statSync(this.dbPath) : null;
|
const fileStats = existsSync(this.dbPath) ? statSync(this.dbPath) : null;
|
||||||
return {
|
return {
|
||||||
connectionId: this.connectionId,
|
connectionId: this.connectionId,
|
||||||
|
|
@ -180,8 +221,12 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
||||||
file_size: fileStats ? fileStats.size : 0,
|
file_size: fileStats ? fileStats.size : 0,
|
||||||
table_count: tables.length,
|
table_count: tables.length,
|
||||||
total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0),
|
total_columns: tables.reduce((sum, table) => sum + table.columns.length, 0),
|
||||||
|
// Carries the full object inventory so a zero-match enabled_tables scope
|
||||||
|
// can report which objects were actually available.
|
||||||
|
...(scopedNames ? { discovered_object_names: allObjects.map((object) => object.name) } : {}),
|
||||||
},
|
},
|
||||||
tables,
|
tables,
|
||||||
|
...(warnings.length > 0 ? { warnings } : {}),
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -229,12 +274,81 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
|
|
||||||
async executeReadOnly(input: KtxSqliteReadOnlyQueryInput, _ctx: KtxScanContext): Promise<KtxQueryResult> {
|
async executeReadOnly(input: KtxSqliteReadOnlyQueryInput, ctx: KtxScanContext): Promise<KtxQueryResult> {
|
||||||
this.assertConnection(input.connectionId);
|
this.assertConnection(input.connectionId);
|
||||||
const result = this.query(limitSqlForExecution(input.sql, input.maxRows), input.params);
|
// Validate and row-limit on the main thread so invalid SQL fails instantly
|
||||||
|
// without spawning a process and read-only enforcement stays at the boundary.
|
||||||
|
const sql = limitSqlForExecution(input.sql, input.maxRows);
|
||||||
|
const result = await this.runReadQueryOffProcess(sql, input.params, ctx.signal);
|
||||||
return { ...result, rowCount: result.rows.length };
|
return { ...result, rowCount: result.rows.length };
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// The LLM-SQL path runs off the event loop in a short-lived child process so a
|
||||||
|
// pathological scan cannot freeze the MCP server, and the deadline is enforced
|
||||||
|
// by SIGKILL-ing that process. A synchronous better-sqlite3 scan never yields,
|
||||||
|
// so a worker-thread terminate cannot interrupt it — only the OS reclaiming the
|
||||||
|
// whole process frees the CPU. One short-lived process per call; killed on
|
||||||
|
// completion, deadline, or external abort.
|
||||||
|
private runReadQueryOffProcess(
|
||||||
|
sql: string,
|
||||||
|
params: Record<string, unknown> | unknown[] | undefined,
|
||||||
|
signal: AbortSignal | undefined,
|
||||||
|
): Promise<Omit<KtxQueryResult, 'rowCount'>> {
|
||||||
|
const deadlineMs = this.deadlineMs;
|
||||||
|
const dbPath = this.dbPath;
|
||||||
|
return new Promise((resolvePromise, rejectPromise) => {
|
||||||
|
const child = this.spawnReadQueryChild();
|
||||||
|
let settled = false;
|
||||||
|
const onDeadline = () => settle(() => rejectPromise(queryDeadlineExceededError(deadlineMs)));
|
||||||
|
const timer = setTimeout(onDeadline, deadlineMs);
|
||||||
|
function settle(finish: () => void): void {
|
||||||
|
if (settled) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
settled = true;
|
||||||
|
clearTimeout(timer);
|
||||||
|
signal?.removeEventListener('abort', onDeadline);
|
||||||
|
if (child.exitCode === null && child.signalCode === null) {
|
||||||
|
child.kill('SIGKILL');
|
||||||
|
}
|
||||||
|
finish();
|
||||||
|
}
|
||||||
|
child.on('message', (message: ReadQueryChildMessage) => {
|
||||||
|
if (message.ok) {
|
||||||
|
settle(() =>
|
||||||
|
resolvePromise({
|
||||||
|
headers: message.headers,
|
||||||
|
rows: normalizeQueryRows(message.rows),
|
||||||
|
totalRows: message.totalRows,
|
||||||
|
}),
|
||||||
|
);
|
||||||
|
} else {
|
||||||
|
settle(() => rejectPromise(new Error(message.message)));
|
||||||
|
}
|
||||||
|
});
|
||||||
|
child.on('error', (error) => settle(() => rejectPromise(error)));
|
||||||
|
child.on('exit', (code, processSignal) => {
|
||||||
|
if (!settled) {
|
||||||
|
settle(() =>
|
||||||
|
rejectPromise(
|
||||||
|
new Error(`SQLite read process exited before returning a result (code ${code}, signal ${processSignal}).`),
|
||||||
|
),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
if (signal?.aborted) {
|
||||||
|
onDeadline();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
signal?.addEventListener('abort', onDeadline, { once: true });
|
||||||
|
try {
|
||||||
|
child.send({ dbPath, sql, params });
|
||||||
|
} catch (error) {
|
||||||
|
settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error))));
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
async getColumnDistinctValues(
|
async getColumnDistinctValues(
|
||||||
table: KtxTableRef,
|
table: KtxTableRef,
|
||||||
columnName: string,
|
columnName: string,
|
||||||
|
|
@ -310,16 +424,7 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
||||||
const foreignKeys = database
|
const foreignKeys = database
|
||||||
.prepare(`PRAGMA foreign_key_list(${this.dialect.quoteIdentifier(table.name)})`)
|
.prepare(`PRAGMA foreign_key_list(${this.dialect.quoteIdentifier(table.name)})`)
|
||||||
.all() as SqliteForeignKeyRow[];
|
.all() as SqliteForeignKeyRow[];
|
||||||
const estimatedRows =
|
const estimatedRows = table.type === 'table' ? this.readRowCount(database, table.name) : null;
|
||||||
table.type === 'table'
|
|
||||||
? Number(
|
|
||||||
(
|
|
||||||
database
|
|
||||||
.prepare(`SELECT COUNT(*) AS count FROM ${this.dialect.quoteIdentifier(table.name)}`)
|
|
||||||
.get() as { count: unknown }
|
|
||||||
).count,
|
|
||||||
)
|
|
||||||
: null;
|
|
||||||
return {
|
return {
|
||||||
catalog: null,
|
catalog: null,
|
||||||
db: null,
|
db: null,
|
||||||
|
|
@ -340,6 +445,19 @@ export class KtxSqliteScanConnector implements KtxScanConnector {
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// A row-count read is profiling, not structure: a failure here leaves the
|
||||||
|
// object's structure intact rather than skipping the whole object.
|
||||||
|
private readRowCount(database: Database.Database, name: string): number | null {
|
||||||
|
try {
|
||||||
|
const row = database.prepare(`SELECT COUNT(*) AS count FROM ${this.dialect.quoteIdentifier(name)}`).get() as {
|
||||||
|
count: unknown;
|
||||||
|
};
|
||||||
|
return Number(row.count);
|
||||||
|
} catch {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
private mapForeignKeys(rows: SqliteForeignKeyRow[]): KtxSchemaForeignKey[] {
|
private mapForeignKeys(rows: SqliteForeignKeyRow[]): KtxSchemaForeignKey[] {
|
||||||
return rows
|
return rows
|
||||||
.sort((a, b) => a.id - b.id || a.seq - b.seq)
|
.sort((a, b) => a.id - b.id || a.seq - b.seq)
|
||||||
|
|
|
||||||
40
packages/cli/src/connectors/sqlite/read-query-child.ts
Normal file
40
packages/cli/src/connectors/sqlite/read-query-child.ts
Normal file
|
|
@ -0,0 +1,40 @@
|
||||||
|
import Database from 'better-sqlite3';
|
||||||
|
|
||||||
|
// Runs on a forked child process (no bundler, no test transform), so it imports
|
||||||
|
// only better-sqlite3 and node builtins. The SQL is already read-only-validated
|
||||||
|
// and row-limited by the parent; this process just executes it and posts the
|
||||||
|
// structured-cloneable raw rows back over IPC. Its only cancellation mechanism
|
||||||
|
// is the parent sending SIGKILL: a synchronous better-sqlite3 scan never yields,
|
||||||
|
// so neither a worker-thread terminate nor any in-process timer can interrupt
|
||||||
|
// it — only the OS reclaiming the whole process can.
|
||||||
|
|
||||||
|
interface ReadQueryRequest {
|
||||||
|
dbPath: string;
|
||||||
|
sql: string;
|
||||||
|
params?: Record<string, unknown> | unknown[];
|
||||||
|
}
|
||||||
|
|
||||||
|
type ReadQueryResponse =
|
||||||
|
| { ok: true; headers: string[]; rows: unknown[]; totalRows: number }
|
||||||
|
| { ok: false; message: string };
|
||||||
|
|
||||||
|
process.once('message', (request: ReadQueryRequest) => {
|
||||||
|
let db: Database.Database | undefined;
|
||||||
|
let response: ReadQueryResponse;
|
||||||
|
try {
|
||||||
|
db = new Database(request.dbPath, { readonly: true, fileMustExist: true });
|
||||||
|
const statement = db.prepare(request.sql);
|
||||||
|
const rows = (request.params ? statement.all(request.params) : statement.all()) as unknown[];
|
||||||
|
response = {
|
||||||
|
ok: true,
|
||||||
|
headers: statement.columns().map((column) => column.name),
|
||||||
|
rows,
|
||||||
|
totalRows: rows.length,
|
||||||
|
};
|
||||||
|
} catch (error) {
|
||||||
|
response = { ok: false, message: error instanceof Error ? error.message : String(error) };
|
||||||
|
} finally {
|
||||||
|
db?.close();
|
||||||
|
}
|
||||||
|
process.send?.(response, () => process.exit(0));
|
||||||
|
});
|
||||||
|
|
@ -1,5 +1,6 @@
|
||||||
import { assertReadOnlySql, hoistLeadingCte, stripTrailingSqlNoise } from '../../context/connections/read-only-sql.js';
|
import { assertReadOnlySql, hoistLeadingCte, stripTrailingSqlNoise } from '../../context/connections/read-only-sql.js';
|
||||||
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
import { getSqlDialectForDriver } from '../../context/connections/dialects.js';
|
||||||
|
import { resolveQueryDeadlineMs, queryDeadlineExceededError } from '../../context/connections/query-deadline.js';
|
||||||
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
import { tryConstraintQuery } from '../../context/scan/constraint-discovery.js';
|
||||||
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
import { scopedTableNames } from '../../context/scan/table-ref.js';
|
||||||
import {
|
import {
|
||||||
|
|
@ -50,6 +51,8 @@ export interface KtxSqlServerPoolConfig {
|
||||||
database: string;
|
database: string;
|
||||||
user: string;
|
user: string;
|
||||||
password?: string;
|
password?: string;
|
||||||
|
// ms; on expiry mssql sends a TDS attention that cancels the query server-side.
|
||||||
|
requestTimeout: number;
|
||||||
options: { encrypt: true; trustServerCertificate: boolean };
|
options: { encrypt: true; trustServerCertificate: boolean };
|
||||||
pool: { max: number; min: number; idleTimeoutMillis: number };
|
pool: { max: number; min: number; idleTimeoutMillis: number };
|
||||||
}
|
}
|
||||||
|
|
@ -269,6 +272,11 @@ function isDeniedError(error: unknown): boolean {
|
||||||
return number === 229 || number === 230 || number === 297;
|
return number === 229 || number === 230 || number === 297;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// mssql raises a RequestError with code 'ETIMEOUT' once requestTimeout elapses.
|
||||||
|
function isSqlServerTimeoutError(error: unknown): boolean {
|
||||||
|
return Boolean(error) && typeof error === 'object' && (error as { code?: unknown }).code === 'ETIMEOUT';
|
||||||
|
}
|
||||||
|
|
||||||
function limitSqlForSqlServerExecution(sqlText: string, maxRows: number | undefined): string {
|
function limitSqlForSqlServerExecution(sqlText: string, maxRows: number | undefined): string {
|
||||||
const trimmed = stripTrailingSqlNoise(assertReadOnlySql(sqlText));
|
const trimmed = stripTrailingSqlNoise(assertReadOnlySql(sqlText));
|
||||||
if (!maxRows) {
|
if (!maxRows) {
|
||||||
|
|
@ -328,6 +336,7 @@ export function sqlServerConnectionPoolConfigFromConfig(input: {
|
||||||
database,
|
database,
|
||||||
user,
|
user,
|
||||||
password: stringConfigValue(merged, 'password', env),
|
password: stringConfigValue(merged, 'password', env),
|
||||||
|
requestTimeout: resolveQueryDeadlineMs(merged),
|
||||||
options: { encrypt: true, trustServerCertificate: merged.trustServerCertificate ?? true },
|
options: { encrypt: true, trustServerCertificate: merged.trustServerCertificate ?? true },
|
||||||
pool: { max: maxConnections, min: 0, idleTimeoutMillis: 30000 },
|
pool: { max: maxConnections, min: 0, idleTimeoutMillis: 30000 },
|
||||||
};
|
};
|
||||||
|
|
@ -353,6 +362,7 @@ export class KtxSqlServerScanConnector implements KtxScanConnector {
|
||||||
private readonly poolFactory: KtxSqlServerPoolFactory;
|
private readonly poolFactory: KtxSqlServerPoolFactory;
|
||||||
private readonly endpointResolver?: KtxSqlServerEndpointResolver;
|
private readonly endpointResolver?: KtxSqlServerEndpointResolver;
|
||||||
private readonly now: () => Date;
|
private readonly now: () => Date;
|
||||||
|
private readonly deadlineMs: number;
|
||||||
private readonly dialect = getSqlDialectForDriver('sqlserver');
|
private readonly dialect = getSqlDialectForDriver('sqlserver');
|
||||||
private pool: KtxSqlServerPool | null = null;
|
private pool: KtxSqlServerPool | null = null;
|
||||||
private resolvedEndpoint: KtxSqlServerResolvedEndpoint | null = null;
|
private resolvedEndpoint: KtxSqlServerResolvedEndpoint | null = null;
|
||||||
|
|
@ -370,6 +380,7 @@ export class KtxSqlServerScanConnector implements KtxScanConnector {
|
||||||
this.poolFactory = options.poolFactory ?? new DefaultSqlServerPoolFactory();
|
this.poolFactory = options.poolFactory ?? new DefaultSqlServerPoolFactory();
|
||||||
this.endpointResolver = options.endpointResolver;
|
this.endpointResolver = options.endpointResolver;
|
||||||
this.now = options.now ?? (() => new Date());
|
this.now = options.now ?? (() => new Date());
|
||||||
|
this.deadlineMs = resolveQueryDeadlineMs(this.connection);
|
||||||
this.id = `sqlserver:${options.connectionId}`;
|
this.id = `sqlserver:${options.connectionId}`;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -804,7 +815,15 @@ export class KtxSqlServerScanConnector implements KtxScanConnector {
|
||||||
request.input(key, value);
|
request.input(key, value);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
const result = await request.query(assertReadOnlySql(query));
|
let result: KtxSqlServerQueryResult;
|
||||||
|
try {
|
||||||
|
result = await request.query(assertReadOnlySql(query));
|
||||||
|
} catch (error) {
|
||||||
|
if (isSqlServerTimeoutError(error)) {
|
||||||
|
throw queryDeadlineExceededError(this.deadlineMs, { cause: error });
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
const recordset = result.recordset ?? [];
|
const recordset = result.recordset ?? [];
|
||||||
const columnMetadata = recordset.columns ?? {};
|
const columnMetadata = recordset.columns ?? {};
|
||||||
const metadataHeaders = Object.keys(columnMetadata);
|
const metadataHeaders = Object.keys(columnMetadata);
|
||||||
|
|
|
||||||
|
|
@ -98,6 +98,7 @@ export interface ContextBuildArgs {
|
||||||
queryHistory?: Extract<KtxPublicIngestArgs, { command: 'run' }>['queryHistory'];
|
queryHistory?: Extract<KtxPublicIngestArgs, { command: 'run' }>['queryHistory'];
|
||||||
queryHistoryWindowDays?: number;
|
queryHistoryWindowDays?: number;
|
||||||
scanMode?: Extract<KtxPublicIngestArgs, { command: 'run' }>['scanMode'];
|
scanMode?: Extract<KtxPublicIngestArgs, { command: 'run' }>['scanMode'];
|
||||||
|
stages?: Extract<KtxPublicIngestArgs, { command: 'run' }>['stages'];
|
||||||
detectRelationships?: boolean;
|
detectRelationships?: boolean;
|
||||||
cliVersion?: string;
|
cliVersion?: string;
|
||||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||||
|
|
@ -990,6 +991,7 @@ export async function runContextBuild(
|
||||||
...(args.queryHistory ? { queryHistory: args.queryHistory } : {}),
|
...(args.queryHistory ? { queryHistory: args.queryHistory } : {}),
|
||||||
...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}),
|
...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}),
|
||||||
...(args.scanMode ? { scanMode: args.scanMode } : {}),
|
...(args.scanMode ? { scanMode: args.scanMode } : {}),
|
||||||
|
...(args.stages ? { stages: args.stages } : {}),
|
||||||
...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}),
|
...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}),
|
||||||
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
||||||
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,5 @@
|
||||||
const BIGQUERY_PROJECT_ID_PATTERN = /^[A-Za-z0-9_-]+$/;
|
const BIGQUERY_PROJECT_ID_PATTERN = /^[A-Za-z0-9_-]+$/;
|
||||||
|
const BIGQUERY_DATASET_ID_PATTERN = /^[A-Za-z0-9_]+$/;
|
||||||
const BIGQUERY_REGION_PATTERN = /^[a-z0-9-]+$/;
|
const BIGQUERY_REGION_PATTERN = /^[a-z0-9-]+$/;
|
||||||
|
|
||||||
export function normalizeBigQueryProjectId(value: string, context = 'historic-SQL ingest'): string {
|
export function normalizeBigQueryProjectId(value: string, context = 'historic-SQL ingest'): string {
|
||||||
|
|
@ -8,6 +9,13 @@ export function normalizeBigQueryProjectId(value: string, context = 'historic-SQ
|
||||||
return value;
|
return value;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
export function normalizeBigQueryDatasetId(value: string, context = 'historic-SQL ingest'): string {
|
||||||
|
if (!BIGQUERY_DATASET_ID_PATTERN.test(value)) {
|
||||||
|
throw new Error(`Invalid BigQuery dataset id for ${context}: ${value}`);
|
||||||
|
}
|
||||||
|
return value;
|
||||||
|
}
|
||||||
|
|
||||||
export function normalizeBigQueryRegion(value: string, context = 'historic-SQL ingest'): string {
|
export function normalizeBigQueryRegion(value: string, context = 'historic-SQL ingest'): string {
|
||||||
const normalized = value.trim().toLowerCase().replace(/^region-/, '');
|
const normalized = value.trim().toLowerCase().replace(/^region-/, '');
|
||||||
if (!BIGQUERY_REGION_PATTERN.test(normalized)) {
|
if (!BIGQUERY_REGION_PATTERN.test(normalized)) {
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,24 @@
|
||||||
|
import type { KtxProjectConnectionConfig } from '../project/config.js';
|
||||||
|
|
||||||
|
function listConfiguredConnectionIds(connections: Record<string, KtxProjectConnectionConfig>): string[] {
|
||||||
|
return Object.keys(connections).sort();
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Validate a connection id supplied as an explicit command/tool argument against
|
||||||
|
* the canonical `ktx.yaml` connections map. Returns the id when configured;
|
||||||
|
* otherwise throws an error that lists the configured ids so the caller can fix
|
||||||
|
* the typo. Use for explicit arguments only — persisted page frontmatter that
|
||||||
|
* references a since-removed connection must warn, not fail.
|
||||||
|
*/
|
||||||
|
export function assertConfiguredConnectionId(
|
||||||
|
connections: Record<string, KtxProjectConnectionConfig>,
|
||||||
|
connectionId: string,
|
||||||
|
): string {
|
||||||
|
if (Object.hasOwn(connections, connectionId)) {
|
||||||
|
return connectionId;
|
||||||
|
}
|
||||||
|
const ids = listConfiguredConnectionIds(connections);
|
||||||
|
const configured = ids.length > 0 ? ids.join(', ') : '(none configured)';
|
||||||
|
throw new Error(`Unknown connection "${connectionId}". Configured connections: ${configured}.`);
|
||||||
|
}
|
||||||
45
packages/cli/src/context/connections/query-deadline.ts
Normal file
45
packages/cli/src/context/connections/query-deadline.ts
Normal file
|
|
@ -0,0 +1,45 @@
|
||||||
|
import { KtxQueryError } from '../../errors.js';
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Canonical default bound on read-query execution time. Generous headroom over
|
||||||
|
* any indexed aggregate or normal profiling probe; a pathological nested-loop
|
||||||
|
* scan blows past it immediately. Overridable per-connection via
|
||||||
|
* `query_timeout_ms`. Production reads it through {@link resolveQueryDeadlineMs};
|
||||||
|
* exported for the resolver's own unit tests.
|
||||||
|
* @internal
|
||||||
|
*/
|
||||||
|
export const DEFAULT_QUERY_TIMEOUT_MS = 30_000;
|
||||||
|
|
||||||
|
interface QueryTimeoutConnectionConfig {
|
||||||
|
query_timeout_ms?: unknown;
|
||||||
|
[key: string]: unknown;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Single source of truth for the read-query deadline: the per-connection
|
||||||
|
* `query_timeout_ms` override (milliseconds) when present, else the default.
|
||||||
|
* Every connector resolves through here so the default and override precedence
|
||||||
|
* live in exactly one place. A malformed override (zero, negative, non-integer,
|
||||||
|
* non-number) is a config error — surfaced here even though `ktx.yaml`
|
||||||
|
* validation also rejects it, so programmatically-built connectors cannot
|
||||||
|
* silently run unbounded.
|
||||||
|
*/
|
||||||
|
export function resolveQueryDeadlineMs(connection: QueryTimeoutConnectionConfig | undefined): number {
|
||||||
|
const raw = connection?.query_timeout_ms;
|
||||||
|
if (raw === undefined || raw === null) {
|
||||||
|
return DEFAULT_QUERY_TIMEOUT_MS;
|
||||||
|
}
|
||||||
|
if (typeof raw !== 'number' || !Number.isInteger(raw) || raw <= 0) {
|
||||||
|
throw new Error(`query_timeout_ms must be a positive integer in milliseconds, received ${JSON.stringify(raw)}.`);
|
||||||
|
}
|
||||||
|
return raw;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* The canonical, driver-independent timeout error an agent sees regardless of
|
||||||
|
* which connector enforced the deadline. Reads in whole seconds. Remote
|
||||||
|
* connectors pass the driver's own timeout error as `cause`.
|
||||||
|
*/
|
||||||
|
export function queryDeadlineExceededError(deadlineMs: number, options?: ErrorOptions): KtxQueryError {
|
||||||
|
return new KtxQueryError(`query exceeded ${Math.round(deadlineMs / 1000)}s`, options);
|
||||||
|
}
|
||||||
|
|
@ -3,8 +3,9 @@ import { request as httpRequest } from 'node:http';
|
||||||
import { request as httpsRequest } from 'node:https';
|
import { request as httpsRequest } from 'node:https';
|
||||||
import { URL } from 'node:url';
|
import { URL } from 'node:url';
|
||||||
import type { KtxProjectConnectionConfig } from '../../../project/config.js';
|
import type { KtxProjectConnectionConfig } from '../../../project/config.js';
|
||||||
|
import { isKtxScanWarningCode } from '../../../scan/local-structural-artifacts.js';
|
||||||
import { tableRefFromKey } from '../../../scan/table-ref.js';
|
import { tableRefFromKey } from '../../../scan/table-ref.js';
|
||||||
import type { KtxSchemaColumn, KtxSchemaForeignKey, KtxSchemaSnapshot, KtxSchemaTable } from '../../../scan/types.js';
|
import type { KtxScanWarning, KtxSchemaColumn, KtxSchemaForeignKey, KtxSchemaSnapshot, KtxSchemaTable } from '../../../scan/types.js';
|
||||||
import { inferKtxDimensionType, normalizeKtxNativeType } from '../../../scan/type-normalization.js';
|
import { inferKtxDimensionType, normalizeKtxNativeType } from '../../../scan/type-normalization.js';
|
||||||
import type { LiveDatabaseIntrospectionOptions, LiveDatabaseIntrospectionPort } from './types.js';
|
import type { LiveDatabaseIntrospectionOptions, LiveDatabaseIntrospectionPort } from './types.js';
|
||||||
|
|
||||||
|
|
@ -206,10 +207,32 @@ function mapTable(raw: Record<string, unknown>): KtxSchemaTable {
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function mapWarning(raw: Record<string, unknown>): KtxScanWarning | null {
|
||||||
|
const code = optionalString(raw.code);
|
||||||
|
// Drop codes Node cannot render, keeping the daemon and Node warning catalogs
|
||||||
|
// in parity rather than surfacing an unknown code downstream.
|
||||||
|
if (!code || !isKtxScanWarningCode(code)) return null;
|
||||||
|
const table = optionalString(raw.table);
|
||||||
|
const column = optionalString(raw.column);
|
||||||
|
return {
|
||||||
|
code,
|
||||||
|
message: requiredString(raw.message, 'warnings[].message'),
|
||||||
|
recoverable: raw.recoverable !== false,
|
||||||
|
...(table ? { table } : {}),
|
||||||
|
...(column ? { column } : {}),
|
||||||
|
...(raw.metadata && typeof raw.metadata === 'object' && !Array.isArray(raw.metadata)
|
||||||
|
? { metadata: recordValue(raw.metadata) }
|
||||||
|
: {}),
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
function mapDaemonSnapshot(
|
function mapDaemonSnapshot(
|
||||||
raw: Record<string, unknown>,
|
raw: Record<string, unknown>,
|
||||||
input: { connectionId: string; extractedAt: string; schemas: string[] },
|
input: { connectionId: string; extractedAt: string; schemas: string[] },
|
||||||
): KtxSchemaSnapshot {
|
): KtxSchemaSnapshot {
|
||||||
|
const warnings = recordArray(raw.warnings)
|
||||||
|
.map(mapWarning)
|
||||||
|
.filter((warning): warning is KtxScanWarning => warning !== null);
|
||||||
return {
|
return {
|
||||||
connectionId: requiredString(raw.connection_id, 'connection_id') || input.connectionId,
|
connectionId: requiredString(raw.connection_id, 'connection_id') || input.connectionId,
|
||||||
driver: 'postgres',
|
driver: 'postgres',
|
||||||
|
|
@ -217,6 +240,7 @@ function mapDaemonSnapshot(
|
||||||
scope: { schemas: input.schemas },
|
scope: { schemas: input.schemas },
|
||||||
metadata: recordValue(raw.metadata),
|
metadata: recordValue(raw.metadata),
|
||||||
tables: recordArray(raw.tables).map(mapTable),
|
tables: recordArray(raw.tables).map(mapTable),
|
||||||
|
...(warnings.length > 0 ? { warnings } : {}),
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,48 @@
|
||||||
|
import { readFile } from 'node:fs/promises';
|
||||||
|
import { join } from 'node:path';
|
||||||
|
import type { SourceFetchReport } from '../../types.js';
|
||||||
|
import { LIVE_DATABASE_WARNINGS_FILE } from './stage.js';
|
||||||
|
|
||||||
|
const OBJECT_SKIP_CODE = 'object_introspection_failed';
|
||||||
|
|
||||||
|
interface RawWarning {
|
||||||
|
code?: unknown;
|
||||||
|
message?: unknown;
|
||||||
|
table?: unknown;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Derives the fetch report from the staged `warnings.json`: objects that failed
|
||||||
|
* introspection become `skipped` entries so the run report, ingest summary, and
|
||||||
|
* `ktx status` can surface them. Returns null when nothing was skipped, keeping
|
||||||
|
* clean ingests free of an empty report.
|
||||||
|
*/
|
||||||
|
export async function readLiveDatabaseFetchReport(stagedDir: string): Promise<SourceFetchReport | null> {
|
||||||
|
let parsed: unknown;
|
||||||
|
try {
|
||||||
|
parsed = JSON.parse(await readFile(join(stagedDir, LIVE_DATABASE_WARNINGS_FILE), 'utf8'));
|
||||||
|
} catch {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
const warnings =
|
||||||
|
parsed && typeof parsed === 'object' && Array.isArray((parsed as { warnings?: unknown }).warnings)
|
||||||
|
? ((parsed as { warnings: RawWarning[] }).warnings)
|
||||||
|
: [];
|
||||||
|
|
||||||
|
const skipped = warnings
|
||||||
|
.filter((warning) => warning.code === OBJECT_SKIP_CODE)
|
||||||
|
.map((warning) => ({
|
||||||
|
rawPath: '',
|
||||||
|
entityType: 'database_object',
|
||||||
|
entityId: typeof warning.table === 'string' ? warning.table : null,
|
||||||
|
severity: 'warning' as const,
|
||||||
|
statusCode: null,
|
||||||
|
message: typeof warning.message === 'string' ? warning.message : 'introspection failed',
|
||||||
|
retryRecommended: false,
|
||||||
|
}));
|
||||||
|
|
||||||
|
if (skipped.length === 0) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
return { status: 'partial', retryRecommended: false, skipped, warnings: [] };
|
||||||
|
}
|
||||||
|
|
@ -1,5 +1,7 @@
|
||||||
import type { ChunkResult, DiffSet, FetchContext, SourceAdapter } from '../../types.js';
|
import type { ChunkResult, DiffSet, FetchContext, SourceAdapter, SourceFetchReport } from '../../types.js';
|
||||||
import { chunkLiveDatabaseStagedDir } from './chunk.js';
|
import { chunkLiveDatabaseStagedDir } from './chunk.js';
|
||||||
|
import { readLiveDatabaseFetchReport } from './fetch-report.js';
|
||||||
|
import { assertLiveDatabaseScanOutcome } from './scan-outcome.js';
|
||||||
import { detectLiveDatabaseStagedDir, writeLiveDatabaseSnapshot } from './stage.js';
|
import { detectLiveDatabaseStagedDir, writeLiveDatabaseSnapshot } from './stage.js';
|
||||||
import type { LiveDatabaseSourceAdapterDeps } from './types.js';
|
import type { LiveDatabaseSourceAdapterDeps } from './types.js';
|
||||||
|
|
||||||
|
|
@ -13,14 +15,20 @@ export class LiveDatabaseSourceAdapter implements SourceAdapter {
|
||||||
return detectLiveDatabaseStagedDir(stagedDir);
|
return detectLiveDatabaseStagedDir(stagedDir);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
readFetchReport(stagedDir: string): Promise<SourceFetchReport | null> {
|
||||||
|
return readLiveDatabaseFetchReport(stagedDir);
|
||||||
|
}
|
||||||
|
|
||||||
async fetch(_pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise<void> {
|
async fetch(_pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise<void> {
|
||||||
const tableScope = ctx.tableScope;
|
const tableScope = ctx.tableScope;
|
||||||
const snapshot = await this.deps.introspection.extractSchema(ctx.connectionId, { tableScope });
|
const snapshot = await this.deps.introspection.extractSchema(ctx.connectionId, { tableScope });
|
||||||
await writeLiveDatabaseSnapshot(stagedDir, {
|
const finalized = {
|
||||||
...snapshot,
|
...snapshot,
|
||||||
connectionId: ctx.connectionId,
|
connectionId: ctx.connectionId,
|
||||||
extractedAt: snapshot.extractedAt ?? (this.deps.now ?? (() => new Date()))().toISOString(),
|
extractedAt: snapshot.extractedAt ?? (this.deps.now ?? (() => new Date()))().toISOString(),
|
||||||
});
|
};
|
||||||
|
assertLiveDatabaseScanOutcome({ connectionId: ctx.connectionId, scope: tableScope, snapshot: finalized });
|
||||||
|
await writeLiveDatabaseSnapshot(stagedDir, finalized);
|
||||||
}
|
}
|
||||||
|
|
||||||
chunk(stagedDir: string, diffSet?: DiffSet): Promise<ChunkResult> {
|
chunk(stagedDir: string, diffSet?: DiffSet): Promise<ChunkResult> {
|
||||||
|
|
|
||||||
|
|
@ -162,7 +162,8 @@ function getShardKey(connectionType: string, catalog: string | null, db: string
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
function buildTableRef(name: string, catalog: string | null, db: string | null): string {
|
/** @internal */
|
||||||
|
export function buildTableRef(name: string, catalog: string | null, db: string | null): string {
|
||||||
const parts: string[] = [];
|
const parts: string[] = [];
|
||||||
if (catalog) {
|
if (catalog) {
|
||||||
parts.push(catalog);
|
parts.push(catalog);
|
||||||
|
|
@ -273,7 +274,10 @@ export function buildLiveDatabaseManifestShards(
|
||||||
for (const table of input.tables) {
|
for (const table of input.tables) {
|
||||||
const shardKey = getShardKey(input.connectionType, table.catalog, table.db);
|
const shardKey = getShardKey(input.connectionType, table.catalog, table.db);
|
||||||
const shard = shards.get(shardKey) ?? { tables: {} };
|
const shard = shards.get(shardKey) ?? { tables: {} };
|
||||||
const existingDescriptions = input.existingDescriptions?.get(table.name);
|
// Existing descriptions/usage are keyed by the fully-qualified ref so two
|
||||||
|
// same-named tables in different schemas never share an entry.
|
||||||
|
const fullRef = buildTableRef(table.name, table.catalog, table.db);
|
||||||
|
const existingDescriptions = input.existingDescriptions?.get(fullRef);
|
||||||
|
|
||||||
const columns: LiveDatabaseManifestColumn[] = table.columns.map((column) => {
|
const columns: LiveDatabaseManifestColumn[] = table.columns.map((column) => {
|
||||||
const manifestColumn: LiveDatabaseManifestColumn = {
|
const manifestColumn: LiveDatabaseManifestColumn = {
|
||||||
|
|
@ -297,7 +301,7 @@ export function buildLiveDatabaseManifestShards(
|
||||||
});
|
});
|
||||||
|
|
||||||
const entry: LiveDatabaseManifestTableEntry = {
|
const entry: LiveDatabaseManifestTableEntry = {
|
||||||
table: buildTableRef(table.name, table.catalog, table.db),
|
table: fullRef,
|
||||||
columns,
|
columns,
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
@ -306,7 +310,7 @@ export function buildLiveDatabaseManifestShards(
|
||||||
entry.descriptions = tableDescriptions;
|
entry.descriptions = tableDescriptions;
|
||||||
}
|
}
|
||||||
|
|
||||||
const usage = mergeUsagePreservingExternal(input.existingUsage?.get(table.name), table.usage);
|
const usage = mergeUsagePreservingExternal(input.existingUsage?.get(fullRef), table.usage);
|
||||||
if (usage) {
|
if (usage) {
|
||||||
entry.usage = usage;
|
entry.usage = usage;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,55 @@
|
||||||
|
import { KtxExpectedError } from '../../../../errors.js';
|
||||||
|
import { tableRefFromKey, type KtxTableRefKey } from '../../../scan/table-ref.js';
|
||||||
|
import type { KtxSchemaSnapshot } from '../../../scan/types.js';
|
||||||
|
|
||||||
|
const OBJECT_SKIP_CODE = 'object_introspection_failed';
|
||||||
|
|
||||||
|
function formatScopeEntry(key: KtxTableRefKey): string {
|
||||||
|
const ref = tableRefFromKey(key);
|
||||||
|
return [ref.catalog, ref.db, ref.name].filter((part): part is string => Boolean(part)).join('.');
|
||||||
|
}
|
||||||
|
|
||||||
|
function discoveredObjectNames(snapshot: KtxSchemaSnapshot): string[] {
|
||||||
|
const raw = (snapshot.metadata as Record<string, unknown>).discovered_object_names;
|
||||||
|
return Array.isArray(raw) ? raw.filter((value): value is string => typeof value === 'string') : [];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enforces the partial-vs-total outcome rules for a live-database snapshot,
|
||||||
|
* uniformly for every connector. Outcomes follow from object counts, not a
|
||||||
|
* mode: a connection with at least one ingested object succeeds (any broken
|
||||||
|
* objects ride along as warnings); a connection where every introspected object
|
||||||
|
* failed, or a non-empty enabled_tables scope that matched nothing, raises a
|
||||||
|
* clear connection error instead of staging an empty layer that would later
|
||||||
|
* surface as the generic "did not recognize" message. A legitimately empty
|
||||||
|
* database (no scope, no objects) succeeds with an empty layer.
|
||||||
|
*/
|
||||||
|
export function assertLiveDatabaseScanOutcome(input: {
|
||||||
|
connectionId: string;
|
||||||
|
scope: ReadonlySet<KtxTableRefKey> | undefined;
|
||||||
|
snapshot: KtxSchemaSnapshot;
|
||||||
|
}): void {
|
||||||
|
const { connectionId, scope, snapshot } = input;
|
||||||
|
if (snapshot.tables.length > 0) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const skipped = (snapshot.warnings ?? []).filter((warning) => warning.code === OBJECT_SKIP_CODE);
|
||||||
|
if (skipped.length > 0) {
|
||||||
|
const detail = skipped.map((warning) => `${warning.table ?? 'object'}: ${warning.message}`).join('; ');
|
||||||
|
throw new KtxExpectedError(
|
||||||
|
`Connection "${connectionId}" produced no semantic layer: all ${skipped.length} introspected ` +
|
||||||
|
`${skipped.length === 1 ? 'object' : 'objects'} failed (${detail}).`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (scope && scope.size > 0) {
|
||||||
|
const requested = [...scope].map(formatScopeEntry).sort();
|
||||||
|
const available = discoveredObjectNames(snapshot);
|
||||||
|
const availableClause = available.length > 0 ? ` Available objects: ${available.join(', ')}.` : '';
|
||||||
|
throw new KtxExpectedError(
|
||||||
|
`enabled_tables for connection "${connectionId}" matched no objects ` +
|
||||||
|
`(looked for: ${requested.join(', ')}).${availableClause}`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -136,13 +136,13 @@ export async function readLiveDatabaseTableFiles(stagedDir: string): Promise<Liv
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function detectLiveDatabaseStagedDir(stagedDir: string): Promise<boolean> {
|
export async function detectLiveDatabaseStagedDir(stagedDir: string): Promise<boolean> {
|
||||||
|
// A valid live-database staging is identified by its connection.json marker.
|
||||||
|
// An empty table set is a legitimate outcome (an empty database), so the
|
||||||
|
// presence of table files is not required — the total-vs-partial decision is
|
||||||
|
// made earlier by assertLiveDatabaseScanOutcome, before staging.
|
||||||
try {
|
try {
|
||||||
const meta = JSON.parse(await readFile(join(stagedDir, LIVE_DATABASE_META_FILE), 'utf8')) as unknown;
|
const meta = JSON.parse(await readFile(join(stagedDir, LIVE_DATABASE_META_FILE), 'utf8')) as unknown;
|
||||||
if (!meta || typeof meta !== 'object' || Array.isArray(meta)) {
|
return Boolean(meta) && typeof meta === 'object' && !Array.isArray(meta);
|
||||||
return false;
|
|
||||||
}
|
|
||||||
const files = await readLiveDatabaseTableFiles(stagedDir);
|
|
||||||
return files.length > 0;
|
|
||||||
} catch {
|
} catch {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -3,7 +3,7 @@ import { z } from 'zod';
|
||||||
const metabaseSyncModeSchema = z.enum(['ALL', 'ONLY', 'EXCEPT']);
|
const metabaseSyncModeSchema = z.enum(['ALL', 'ONLY', 'EXCEPT']);
|
||||||
export type MetabaseSyncMode = z.infer<typeof metabaseSyncModeSchema>;
|
export type MetabaseSyncMode = z.infer<typeof metabaseSyncModeSchema>;
|
||||||
|
|
||||||
const metabaseLocalConnectionIdSchema = z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/);
|
const metabaseLocalConnectionIdSchema = z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/);
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* The lean config the adapter needs at `fetch()` time. Lives in the BullMQ payload's
|
* The lean config the adapter needs at `fetch()` time. Lives in the BullMQ payload's
|
||||||
|
|
|
||||||
|
|
@ -1081,6 +1081,7 @@ export class IngestBundleRunner {
|
||||||
skillsPrompt: input.skillsPrompt,
|
skillsPrompt: input.skillsPrompt,
|
||||||
syncId: input.syncId,
|
syncId: input.syncId,
|
||||||
sourceKey: input.job.sourceKey,
|
sourceKey: input.job.sourceKey,
|
||||||
|
connectionId: input.job.connectionId,
|
||||||
canonicalPins: input.canonicalPins,
|
canonicalPins: input.canonicalPins,
|
||||||
});
|
});
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -478,11 +478,11 @@ function parseKnowledgeIndexPath(file: string): { scope: 'GLOBAL' | 'USER'; page
|
||||||
const segments = file.split('/');
|
const segments = file.split('/');
|
||||||
if (segments.length === 2 && segments[0] === 'global') {
|
if (segments.length === 2 && segments[0] === 'global') {
|
||||||
const pageKey = segments[1].replace(/\.md$/, '');
|
const pageKey = segments[1].replace(/\.md$/, '');
|
||||||
return /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'GLOBAL', pageKey } : null;
|
return /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'GLOBAL', pageKey } : null;
|
||||||
}
|
}
|
||||||
if (segments.length === 3 && segments[0] === 'user') {
|
if (segments.length === 3 && segments[0] === 'user') {
|
||||||
const pageKey = segments[2].replace(/\.md$/, '');
|
const pageKey = segments[2].replace(/\.md$/, '');
|
||||||
return /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'USER', pageKey } : null;
|
return /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(pageKey) ? { scope: 'USER', pageKey } : null;
|
||||||
}
|
}
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -104,7 +104,7 @@ class LocalIngestPhase implements IngestJobPhase {
|
||||||
}
|
}
|
||||||
|
|
||||||
function safeSegment(kind: string, value: string): string {
|
function safeSegment(kind: string, value: string): string {
|
||||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) {
|
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) {
|
||||||
throw new Error(`Unsafe ${kind}: ${value}`);
|
throw new Error(`Unsafe ${kind}: ${value}`);
|
||||||
}
|
}
|
||||||
return value;
|
return value;
|
||||||
|
|
|
||||||
|
|
@ -10,7 +10,7 @@ import type { MemoryFlowEventSink, MemoryFlowPlannedWorkUnit } from './memory-fl
|
||||||
import { buildSyncId } from './raw-sources-paths.js';
|
import { buildSyncId } from './raw-sources-paths.js';
|
||||||
import { SqliteLocalIngestStore } from './sqlite-local-ingest-store.js';
|
import { SqliteLocalIngestStore } from './sqlite-local-ingest-store.js';
|
||||||
import type { KtxTableRefKey } from '../scan/table-ref.js';
|
import type { KtxTableRefKey } from '../scan/table-ref.js';
|
||||||
import type { IngestTrigger, SourceAdapter, WorkUnit } from './types.js';
|
import type { IngestTrigger, SourceAdapter, SourceFetchReport, WorkUnit } from './types.js';
|
||||||
|
|
||||||
type LocalIngestStatus = 'running' | 'done' | 'error';
|
type LocalIngestStatus = 'running' | 'done' | 'error';
|
||||||
|
|
||||||
|
|
@ -46,6 +46,8 @@ export interface LocalIngestRunRecord {
|
||||||
workUnits: Array<Pick<WorkUnit, 'unitKey' | 'rawFiles' | 'peerFileIndex' | 'dependencyPaths'>>;
|
workUnits: Array<Pick<WorkUnit, 'unitKey' | 'rawFiles' | 'peerFileIndex' | 'dependencyPaths'>>;
|
||||||
evictionDeletedRawPaths: string[];
|
evictionDeletedRawPaths: string[];
|
||||||
errors: string[];
|
errors: string[];
|
||||||
|
/** Fetch-phase outcome (e.g. objects skipped during introspection). */
|
||||||
|
fetch?: SourceFetchReport;
|
||||||
}
|
}
|
||||||
|
|
||||||
export type LocalIngestReport = LocalIngestRunRecord & {
|
export type LocalIngestReport = LocalIngestRunRecord & {
|
||||||
|
|
@ -70,7 +72,7 @@ const LOCAL_AUTHOR = 'ktx';
|
||||||
const LOCAL_AUTHOR_EMAIL = 'ktx@example.com';
|
const LOCAL_AUTHOR_EMAIL = 'ktx@example.com';
|
||||||
|
|
||||||
function safeSegment(kind: string, value: string): string {
|
function safeSegment(kind: string, value: string): string {
|
||||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(value)) {
|
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(value)) {
|
||||||
throw new Error(`Unsafe ${kind}: ${value}`);
|
throw new Error(`Unsafe ${kind}: ${value}`);
|
||||||
}
|
}
|
||||||
return value;
|
return value;
|
||||||
|
|
@ -291,6 +293,8 @@ async function runLocalStageOnlyIngestInner(options: RunLocalStageOnlyIngestOpti
|
||||||
throw new Error(`Adapter "${adapter.source}" did not recognize ${sourceDir ?? 'fetched source output'}`);
|
throw new Error(`Adapter "${adapter.source}" did not recognize ${sourceDir ?? 'fetched source output'}`);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const fetchReport = adapter.readFetchReport ? await adapter.readFetchReport(stagedDir) : null;
|
||||||
|
|
||||||
const relativeFiles = await walkFiles(stagedDir);
|
const relativeFiles = await walkFiles(stagedDir);
|
||||||
options.memoryFlow?.update({ sourceDir });
|
options.memoryFlow?.update({ sourceDir });
|
||||||
options.memoryFlow?.emit({
|
options.memoryFlow?.emit({
|
||||||
|
|
@ -405,6 +409,7 @@ async function runLocalStageOnlyIngestInner(options: RunLocalStageOnlyIngestOpti
|
||||||
})),
|
})),
|
||||||
evictionDeletedRawPaths: chunkResult.eviction?.deletedRawPaths ?? [],
|
evictionDeletedRawPaths: chunkResult.eviction?.deletedRawPaths ?? [],
|
||||||
errors: [],
|
errors: [],
|
||||||
|
...(fetchReport ? { fetch: fetchReport } : {}),
|
||||||
};
|
};
|
||||||
|
|
||||||
if (!options.dryRun) {
|
if (!options.dryRun) {
|
||||||
|
|
|
||||||
|
|
@ -26,14 +26,16 @@ export function buildWuSystemPrompt(params: {
|
||||||
skillsPrompt: string;
|
skillsPrompt: string;
|
||||||
syncId: string;
|
syncId: string;
|
||||||
sourceKey: string;
|
sourceKey: string;
|
||||||
|
connectionId?: string;
|
||||||
canonicalPins?: CanonicalPin[];
|
canonicalPins?: CanonicalPin[];
|
||||||
}): string {
|
}): string {
|
||||||
|
const connectionLine = params.connectionId ? `\nconnectionId: ${params.connectionId}` : '';
|
||||||
const parts = [
|
const parts = [
|
||||||
params.baseFraming.trimEnd(),
|
params.baseFraming.trimEnd(),
|
||||||
VERIFICATION_LEDGER_PROMPT,
|
VERIFICATION_LEDGER_PROMPT,
|
||||||
params.skillsPrompt.trimEnd(),
|
params.skillsPrompt.trimEnd(),
|
||||||
buildCanonicalPinsPromptBlock(params.canonicalPins ?? []),
|
buildCanonicalPinsPromptBlock(params.canonicalPins ?? []),
|
||||||
`\n<context>\nsyncId: ${params.syncId}\nsource: ${params.sourceKey}\n</context>`,
|
`\n<context>\nsyncId: ${params.syncId}\nsource: ${params.sourceKey}${connectionLine}\n</context>`,
|
||||||
];
|
];
|
||||||
return parts.filter(Boolean).join('\n');
|
return parts.filter(Boolean).join('\n');
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -4,7 +4,7 @@ import { BaseTool, type ToolContext, type ToolOutput } from '../../../../context
|
||||||
|
|
||||||
const discoverDataInputSchema = z.object({
|
const discoverDataInputSchema = z.object({
|
||||||
query: z.string().optional(),
|
query: z.string().optional(),
|
||||||
connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/).optional(),
|
connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/).optional(),
|
||||||
limit: z.number().int().positive().max(50).optional().default(10),
|
limit: z.number().int().positive().max(50).optional().default(10),
|
||||||
sourceName: z.string().optional(),
|
sourceName: z.string().optional(),
|
||||||
}).strict();
|
}).strict();
|
||||||
|
|
|
||||||
|
|
@ -14,7 +14,7 @@ const targetSchema = z.union([
|
||||||
]);
|
]);
|
||||||
|
|
||||||
const entityDetailsInputSchema = z.object({
|
const entityDetailsInputSchema = z.object({
|
||||||
connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/),
|
connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/),
|
||||||
targets: z.array(targetSchema).min(1).max(50),
|
targets: z.array(targetSchema).min(1).max(50),
|
||||||
}).strict();
|
}).strict();
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -6,7 +6,7 @@ import type { SqlAnalysisPort } from '../../../../context/sql-analysis/ports.js'
|
||||||
import { BaseTool, type ToolContext, type ToolOutput } from '../../../../context/tools/base-tool.js';
|
import { BaseTool, type ToolContext, type ToolOutput } from '../../../../context/tools/base-tool.js';
|
||||||
|
|
||||||
const sqlExecutionInputSchema = z.object({
|
const sqlExecutionInputSchema = z.object({
|
||||||
connectionId: z.string().regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/),
|
connectionId: z.string().regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/),
|
||||||
sql: z.string().min(1),
|
sql: z.string().min(1),
|
||||||
rowLimit: z.number().int().positive().max(1000).optional().default(100),
|
rowLimit: z.number().int().positive().max(1000).optional().default(100),
|
||||||
}).strict();
|
}).strict();
|
||||||
|
|
|
||||||
|
|
@ -172,6 +172,12 @@ export class AiSdkKtxLlmRuntime implements KtxLlmRuntimePort {
|
||||||
this.logger = deps.logger ?? noopLogger;
|
this.logger = deps.logger ?? noopLogger;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// HTTP backend: abortSignal cancels the underlying fetch natively, so there is
|
||||||
|
// no SDK-owned child to tree-kill.
|
||||||
|
subprocessForkSpec(): null {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
private async generateTextWithRateLimitRetry<T>(
|
private async generateTextWithRateLimitRetry<T>(
|
||||||
provider: RateLimitProvider,
|
provider: RateLimitProvider,
|
||||||
abortSignal: AbortSignal | undefined,
|
abortSignal: AbortSignal | undefined,
|
||||||
|
|
|
||||||
|
|
@ -6,6 +6,7 @@ import {
|
||||||
type SDKResultMessage,
|
type SDKResultMessage,
|
||||||
} from '@anthropic-ai/claude-agent-sdk';
|
} from '@anthropic-ai/claude-agent-sdk';
|
||||||
import { z } from 'zod';
|
import { z } from 'zod';
|
||||||
|
import type { KtxModelRole } from '../../llm/types.js';
|
||||||
import { createAbortError, isAbortError, throwIfAborted } from '../core/abort.js';
|
import { createAbortError, isAbortError, throwIfAborted } from '../core/abort.js';
|
||||||
import { createKtxClaudeCodeEnv } from './claude-code-env.js';
|
import { createKtxClaudeCodeEnv } from './claude-code-env.js';
|
||||||
import { resolveClaudeCodeModel } from './claude-code-models.js';
|
import { resolveClaudeCodeModel } from './claude-code-models.js';
|
||||||
|
|
@ -13,6 +14,7 @@ import type { RateLimitGovernor, RateLimitSignal } from './rate-limit-governor.j
|
||||||
import { createClaudeSdkTools, mcpToolIds } from './runtime-tools.js';
|
import { createClaudeSdkTools, mcpToolIds } from './runtime-tools.js';
|
||||||
import type {
|
import type {
|
||||||
KtxGenerateObjectInput,
|
KtxGenerateObjectInput,
|
||||||
|
KtxGenerateStructuredJsonInput,
|
||||||
KtxGenerateTextInput,
|
KtxGenerateTextInput,
|
||||||
KtxLlmRuntimePort,
|
KtxLlmRuntimePort,
|
||||||
KtxRuntimeToolSet,
|
KtxRuntimeToolSet,
|
||||||
|
|
@ -20,6 +22,7 @@ import type {
|
||||||
RunLoopParams,
|
RunLoopParams,
|
||||||
RunLoopResult,
|
RunLoopResult,
|
||||||
RunLoopStopReason,
|
RunLoopStopReason,
|
||||||
|
SubprocessRuntimeForkSpec,
|
||||||
} from './runtime-port.js';
|
} from './runtime-port.js';
|
||||||
|
|
||||||
type QueryResult = AsyncIterable<SDKMessage> & {
|
type QueryResult = AsyncIterable<SDKMessage> & {
|
||||||
|
|
@ -389,9 +392,15 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort {
|
||||||
return result.result;
|
return result.result;
|
||||||
}
|
}
|
||||||
|
|
||||||
async generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
// Structured generation has no tools, so generateObject and
|
||||||
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
// generateStructuredJson (the kill-boundary child path) share this one query.
|
||||||
): Promise<TOutput> {
|
private async runStructuredQuery(input: {
|
||||||
|
role: KtxModelRole;
|
||||||
|
prompt: string;
|
||||||
|
system?: string;
|
||||||
|
jsonSchema: Record<string, unknown>;
|
||||||
|
abortSignal?: AbortSignal;
|
||||||
|
}): Promise<SDKResultMessage> {
|
||||||
const options = {
|
const options = {
|
||||||
...baseOptions({
|
...baseOptions({
|
||||||
projectDir: this.deps.projectDir,
|
projectDir: this.deps.projectDir,
|
||||||
|
|
@ -403,19 +412,30 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort {
|
||||||
// 5 leaves headroom without enabling unbounded loops; the json_schema
|
// 5 leaves headroom without enabling unbounded loops; the json_schema
|
||||||
// constraint still forces the final answer to be the schema.
|
// constraint still forces the final answer to be the schema.
|
||||||
maxTurns: 5,
|
maxTurns: 5,
|
||||||
tools: input.tools,
|
|
||||||
}),
|
}),
|
||||||
outputFormat: { type: 'json_schema' as const, schema: jsonSchema(input.schema as z.ZodType) },
|
outputFormat: { type: 'json_schema' as const, schema: input.jsonSchema },
|
||||||
};
|
};
|
||||||
const startedAt = Date.now();
|
return collectResultWithRateLimitRetry({
|
||||||
const result = await collectResultWithRateLimitRetry({
|
|
||||||
query: this.runQuery,
|
query: this.runQuery,
|
||||||
prompt: [input.system, input.prompt].filter(Boolean).join('\n\n'),
|
prompt: [input.system, input.prompt].filter(Boolean).join('\n\n'),
|
||||||
options,
|
options,
|
||||||
allowedToolIds: new Set([...mcpToolIds(input.tools ?? {}), STRUCTURED_OUTPUT_TOOL_NAME]),
|
allowedToolIds: new Set([STRUCTURED_OUTPUT_TOOL_NAME]),
|
||||||
expectedMcpServerNames: expectedMcpServerNames(input.tools),
|
expectedMcpServerNames: new Set(),
|
||||||
rateLimitGovernor: this.deps.rateLimitGovernor,
|
rateLimitGovernor: this.deps.rateLimitGovernor,
|
||||||
abortSignal: input.abortSignal,
|
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
async generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||||
|
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
||||||
|
): Promise<TOutput> {
|
||||||
|
const startedAt = Date.now();
|
||||||
|
const result = await this.runStructuredQuery({
|
||||||
|
role: input.role,
|
||||||
|
prompt: input.prompt,
|
||||||
|
...(input.system !== undefined ? { system: input.system } : {}),
|
||||||
|
jsonSchema: jsonSchema(input.schema as z.ZodType),
|
||||||
|
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||||
});
|
});
|
||||||
input.onMetrics?.({ totalMs: Date.now() - startedAt, usage: claudeTokenUsage(result) });
|
input.onMetrics?.({ totalMs: Date.now() - startedAt, usage: claudeTokenUsage(result) });
|
||||||
const error = resultError(result);
|
const error = resultError(result);
|
||||||
|
|
@ -428,6 +448,28 @@ export class ClaudeCodeKtxLlmRuntime implements KtxLlmRuntimePort {
|
||||||
return (input.schema as z.ZodType<TOutput>).parse(result.structured_output);
|
return (input.schema as z.ZodType<TOutput>).parse(result.structured_output);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
async generateStructuredJson(input: KtxGenerateStructuredJsonInput): Promise<unknown> {
|
||||||
|
const result = await this.runStructuredQuery({
|
||||||
|
role: input.role,
|
||||||
|
prompt: input.prompt,
|
||||||
|
...(input.system !== undefined ? { system: input.system } : {}),
|
||||||
|
jsonSchema: input.jsonSchema,
|
||||||
|
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||||
|
});
|
||||||
|
const error = resultError(result);
|
||||||
|
if (error) {
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
if (result.subtype !== 'success') {
|
||||||
|
throw new Error(`Claude Code query failed (${result.subtype})`);
|
||||||
|
}
|
||||||
|
return result.structured_output;
|
||||||
|
}
|
||||||
|
|
||||||
|
subprocessForkSpec(): SubprocessRuntimeForkSpec {
|
||||||
|
return { backend: 'claude-code', projectDir: this.deps.projectDir, modelSlots: this.deps.modelSlots };
|
||||||
|
}
|
||||||
|
|
||||||
async runAgentLoop(params: RunLoopParams): Promise<RunLoopResult> {
|
async runAgentLoop(params: RunLoopParams): Promise<RunLoopResult> {
|
||||||
const startedAt = Date.now();
|
const startedAt = Date.now();
|
||||||
try {
|
try {
|
||||||
|
|
|
||||||
|
|
@ -9,14 +9,17 @@ import { resolveCodexModel } from './codex-models.js';
|
||||||
import { buildCodexRuntimeConfig } from './codex-runtime-config.js';
|
import { buildCodexRuntimeConfig } from './codex-runtime-config.js';
|
||||||
import { CodexSdkCliRunner, type CodexSdkRunner } from './codex-sdk-runner.js';
|
import { CodexSdkCliRunner, type CodexSdkRunner } from './codex-sdk-runner.js';
|
||||||
import type { RateLimitGovernor } from './rate-limit-governor.js';
|
import type { RateLimitGovernor } from './rate-limit-governor.js';
|
||||||
|
import type { KtxModelRole } from '../../llm/types.js';
|
||||||
import type {
|
import type {
|
||||||
KtxGenerateObjectInput,
|
KtxGenerateObjectInput,
|
||||||
|
KtxGenerateStructuredJsonInput,
|
||||||
KtxGenerateTextInput,
|
KtxGenerateTextInput,
|
||||||
KtxLlmRuntimePort,
|
KtxLlmRuntimePort,
|
||||||
KtxRuntimeToolSet,
|
KtxRuntimeToolSet,
|
||||||
LlmTokenUsage,
|
LlmTokenUsage,
|
||||||
RunLoopParams,
|
RunLoopParams,
|
||||||
RunLoopResult,
|
RunLoopResult,
|
||||||
|
SubprocessRuntimeForkSpec,
|
||||||
} from './runtime-port.js';
|
} from './runtime-port.js';
|
||||||
|
|
||||||
export interface CodexKtxLlmRuntimeDeps {
|
export interface CodexKtxLlmRuntimeDeps {
|
||||||
|
|
@ -249,56 +252,78 @@ export class CodexKtxLlmRuntime implements KtxLlmRuntimePort {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Structured generation has no tools, so it skips the MCP server that
|
||||||
|
// generateText/runAgentLoop need; generateObject and generateStructuredJson
|
||||||
|
// (the kill-boundary child path) share this one streaming implementation.
|
||||||
|
private async streamStructuredText(input: {
|
||||||
|
role: KtxModelRole;
|
||||||
|
prompt: string;
|
||||||
|
system?: string;
|
||||||
|
jsonSchema: Record<string, unknown>;
|
||||||
|
abortSignal?: AbortSignal;
|
||||||
|
}): Promise<{ text: string; summary: CodexExecEventSummary; startedAt: number }> {
|
||||||
|
const startedAt = Date.now();
|
||||||
|
const model = modelForRole(this.deps.modelSlots, input.role);
|
||||||
|
const config = buildCodexRuntimeConfig({ model });
|
||||||
|
const result = await this.runWithRateLimitRetry(
|
||||||
|
input.abortSignal,
|
||||||
|
async () => {
|
||||||
|
const collected = await collectEvents(
|
||||||
|
await this.runner.runStreamed({
|
||||||
|
projectDir: this.deps.projectDir,
|
||||||
|
model,
|
||||||
|
prompt: promptWithSystem(input.system, input.prompt),
|
||||||
|
configOverrides: config.configOverrides,
|
||||||
|
env: config.env,
|
||||||
|
outputSchema: input.jsonSchema,
|
||||||
|
...(input.abortSignal ? { signal: input.abortSignal } : {}),
|
||||||
|
}),
|
||||||
|
);
|
||||||
|
const summary = summarizeCodexExecEvents(collected.events, { startedAt });
|
||||||
|
return { collected, summary };
|
||||||
|
},
|
||||||
|
({ collected, summary }) => summaryError(summary, collected.streamError),
|
||||||
|
);
|
||||||
|
return {
|
||||||
|
text: assertSuccessfulText(result.summary, result.collected.streamError),
|
||||||
|
summary: result.summary,
|
||||||
|
startedAt,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
async generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
async generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||||
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
||||||
): Promise<TOutput> {
|
): Promise<TOutput> {
|
||||||
const startedAt = Date.now();
|
const { text, summary, startedAt } = await this.streamStructuredText({
|
||||||
const model = modelForRole(this.deps.modelSlots, input.role);
|
role: input.role,
|
||||||
const mcp = await mcpForTools({
|
prompt: input.prompt,
|
||||||
projectDir: this.deps.projectDir,
|
...(input.system !== undefined ? { system: input.system } : {}),
|
||||||
toolSet: input.tools,
|
jsonSchema: z.toJSONSchema(input.schema, { target: 'draft-7' }) as Record<string, unknown>,
|
||||||
startMcpServer: this.deps.startMcpServer,
|
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||||
|
});
|
||||||
|
input.onMetrics?.(metrics(summary, startedAt));
|
||||||
|
return parseStructuredOutput(input.schema, text);
|
||||||
|
}
|
||||||
|
|
||||||
|
async generateStructuredJson(input: KtxGenerateStructuredJsonInput): Promise<unknown> {
|
||||||
|
const { text } = await this.streamStructuredText({
|
||||||
|
role: input.role,
|
||||||
|
prompt: input.prompt,
|
||||||
|
...(input.system !== undefined ? { system: input.system } : {}),
|
||||||
|
jsonSchema: input.jsonSchema,
|
||||||
|
...(input.abortSignal ? { abortSignal: input.abortSignal } : {}),
|
||||||
});
|
});
|
||||||
try {
|
try {
|
||||||
const config = buildCodexRuntimeConfig({
|
return JSON.parse(text);
|
||||||
model,
|
} catch (error) {
|
||||||
...(mcp
|
throw new Error(`Codex structured output is not valid JSON: ${error instanceof Error ? error.message : String(error)}`);
|
||||||
? {
|
|
||||||
mcp: {
|
|
||||||
url: mcp.url,
|
|
||||||
bearerTokenEnvVar: mcp.bearerTokenEnvVar,
|
|
||||||
bearerToken: mcp.bearerToken,
|
|
||||||
toolNames: runtimeToolNames(input.tools),
|
|
||||||
},
|
|
||||||
}
|
|
||||||
: {}),
|
|
||||||
});
|
|
||||||
const result = await this.runWithRateLimitRetry(
|
|
||||||
input.abortSignal,
|
|
||||||
async () => {
|
|
||||||
const collected = await collectEvents(
|
|
||||||
await this.runner.runStreamed({
|
|
||||||
projectDir: this.deps.projectDir,
|
|
||||||
model,
|
|
||||||
prompt: promptWithSystem(input.system, input.prompt),
|
|
||||||
configOverrides: config.configOverrides,
|
|
||||||
env: config.env,
|
|
||||||
outputSchema: z.toJSONSchema(input.schema, { target: 'draft-7' }) as Record<string, unknown>,
|
|
||||||
...(input.abortSignal ? { signal: input.abortSignal } : {}),
|
|
||||||
}),
|
|
||||||
);
|
|
||||||
const summary = summarizeCodexExecEvents(collected.events, { startedAt });
|
|
||||||
return { collected, summary };
|
|
||||||
},
|
|
||||||
({ collected, summary }) => summaryError(summary, collected.streamError),
|
|
||||||
);
|
|
||||||
input.onMetrics?.(metrics(result.summary, startedAt));
|
|
||||||
return parseStructuredOutput(input.schema, assertSuccessfulText(result.summary, result.collected.streamError));
|
|
||||||
} finally {
|
|
||||||
await mcp?.close();
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
subprocessForkSpec(): SubprocessRuntimeForkSpec {
|
||||||
|
return { backend: 'codex', projectDir: this.deps.projectDir, modelSlots: this.deps.modelSlots };
|
||||||
|
}
|
||||||
|
|
||||||
async runAgentLoop(params: RunLoopParams): Promise<RunLoopResult> {
|
async runAgentLoop(params: RunLoopParams): Promise<RunLoopResult> {
|
||||||
const startedAt = Date.now();
|
const startedAt = Date.now();
|
||||||
const model = modelForRole(this.deps.modelSlots, params.modelRole);
|
const model = modelForRole(this.deps.modelSlots, params.modelRole);
|
||||||
|
|
|
||||||
|
|
@ -72,12 +72,38 @@ export interface KtxGenerateObjectInput<TOutput, TSchema extends z.ZodType<TOutp
|
||||||
abortSignal?: AbortSignal;
|
abortSignal?: AbortSignal;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** Structured generation keyed by a raw JSON Schema instead of a Zod schema, so
|
||||||
|
* the request can cross a process boundary; the caller validates the returned
|
||||||
|
* value against the real Zod schema. */
|
||||||
|
export interface KtxGenerateStructuredJsonInput {
|
||||||
|
role: KtxModelRole;
|
||||||
|
prompt: string;
|
||||||
|
system?: string;
|
||||||
|
jsonSchema: Record<string, unknown>;
|
||||||
|
abortSignal?: AbortSignal;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Serializable recipe to rebuild a subprocess-backed runtime inside a ktx-owned
|
||||||
|
* child the parent can tree-kill. Returned by {@link KtxLlmRuntimePort.subprocessForkSpec}. */
|
||||||
|
export interface SubprocessRuntimeForkSpec {
|
||||||
|
backend: 'codex' | 'claude-code';
|
||||||
|
projectDir: string;
|
||||||
|
modelSlots: { default: string } & Partial<Record<string, string>>;
|
||||||
|
}
|
||||||
|
|
||||||
export interface KtxLlmRuntimePort {
|
export interface KtxLlmRuntimePort {
|
||||||
generateText(input: KtxGenerateTextInput): Promise<string>;
|
generateText(input: KtxGenerateTextInput): Promise<string>;
|
||||||
generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
generateObject<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||||
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
input: KtxGenerateObjectInput<TOutput, TSchema>,
|
||||||
): Promise<TOutput>;
|
): Promise<TOutput>;
|
||||||
runAgentLoop(params: RunLoopParams): Promise<RunLoopResult>;
|
runAgentLoop(params: RunLoopParams): Promise<RunLoopResult>;
|
||||||
|
/**
|
||||||
|
* Non-null when this runtime drives an SDK-owned child process that ktx cannot
|
||||||
|
* cancel by abort alone (codex/claude-code spawn a binary the SDK owns and only
|
||||||
|
* SIGTERM on abort). ktx routes such calls through a tree-killable boundary.
|
||||||
|
* Null for HTTP backends, whose native fetch abort already settles promptly.
|
||||||
|
*/
|
||||||
|
subprocessForkSpec(): SubprocessRuntimeForkSpec | null;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface AgentRunnerPort {
|
export interface AgentRunnerPort {
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,39 @@
|
||||||
|
import { ClaudeCodeKtxLlmRuntime } from './claude-code-runtime.js';
|
||||||
|
import { CodexKtxLlmRuntime } from './codex-runtime.js';
|
||||||
|
import type { SubprocessRuntimeForkSpec } from './runtime-port.js';
|
||||||
|
import type { SubprocessGenerateObjectRequest, SubprocessGenerateObjectResponse } from './subprocess-generate-object.js';
|
||||||
|
|
||||||
|
// Forked by the parent as a process-group leader it can SIGKILL as a tree. Hosts
|
||||||
|
// one structured LLM call for a subprocess-backed runtime (codex/claude-code);
|
||||||
|
// the SDK spawns the model binary as this process's own child, so a parent
|
||||||
|
// tree-kill reaps the wedged model too. Credentials flow via inherited env — the
|
||||||
|
// runtimes re-derive their allowlisted env from process.env — never over IPC.
|
||||||
|
|
||||||
|
function buildRuntime(forkSpec: SubprocessRuntimeForkSpec): CodexKtxLlmRuntime | ClaudeCodeKtxLlmRuntime {
|
||||||
|
if (forkSpec.backend === 'codex') {
|
||||||
|
return new CodexKtxLlmRuntime({ projectDir: forkSpec.projectDir, modelSlots: forkSpec.modelSlots });
|
||||||
|
}
|
||||||
|
return new ClaudeCodeKtxLlmRuntime({ projectDir: forkSpec.projectDir, modelSlots: forkSpec.modelSlots });
|
||||||
|
}
|
||||||
|
|
||||||
|
// The parent owns this process's lifecycle. If the parent dies its IPC channel
|
||||||
|
// drops; exit rather than linger as an orphan holding a provider connection.
|
||||||
|
process.once('disconnect', () => process.exit(0));
|
||||||
|
|
||||||
|
process.once('message', (request: SubprocessGenerateObjectRequest) => {
|
||||||
|
void (async () => {
|
||||||
|
let response: SubprocessGenerateObjectResponse;
|
||||||
|
try {
|
||||||
|
const output = await buildRuntime(request.forkSpec).generateStructuredJson({
|
||||||
|
role: request.role,
|
||||||
|
prompt: request.prompt,
|
||||||
|
...(request.system !== undefined ? { system: request.system } : {}),
|
||||||
|
jsonSchema: request.jsonSchema,
|
||||||
|
});
|
||||||
|
response = { ok: true, output };
|
||||||
|
} catch (error) {
|
||||||
|
response = { ok: false, message: error instanceof Error ? error.message : String(error) };
|
||||||
|
}
|
||||||
|
process.send?.(response, () => process.exit(0));
|
||||||
|
})();
|
||||||
|
});
|
||||||
152
packages/cli/src/context/llm/subprocess-generate-object.ts
Normal file
152
packages/cli/src/context/llm/subprocess-generate-object.ts
Normal file
|
|
@ -0,0 +1,152 @@
|
||||||
|
import { fork, spawn, type ChildProcess } from 'node:child_process';
|
||||||
|
import { existsSync } from 'node:fs';
|
||||||
|
import { fileURLToPath } from 'node:url';
|
||||||
|
import type { z } from 'zod';
|
||||||
|
import type { KtxModelRole } from '../../llm/types.js';
|
||||||
|
import { createAbortError } from '../core/abort.js';
|
||||||
|
import type { SubprocessRuntimeForkSpec } from './runtime-port.js';
|
||||||
|
|
||||||
|
export interface SubprocessGenerateObjectRequest {
|
||||||
|
forkSpec: SubprocessRuntimeForkSpec;
|
||||||
|
role: KtxModelRole;
|
||||||
|
prompt: string;
|
||||||
|
system?: string;
|
||||||
|
jsonSchema: Record<string, unknown>;
|
||||||
|
}
|
||||||
|
|
||||||
|
export type SubprocessGenerateObjectResponse = { ok: true; output: unknown } | { ok: false; message: string };
|
||||||
|
|
||||||
|
// In dist, this file and the child are siblings; under vitest the compiled .js is
|
||||||
|
// absent and Node strips types from the .ts. The real child imports the codex /
|
||||||
|
// claude SDKs (which use constructor parameter properties), so it only runs as
|
||||||
|
// built .js — tests inject a fake child via the spawnChild seam.
|
||||||
|
function childUrl(): URL {
|
||||||
|
const builtChild = new URL('./subprocess-generate-object-child.js', import.meta.url);
|
||||||
|
return existsSync(fileURLToPath(builtChild)) ? builtChild : new URL('./subprocess-generate-object-child.ts', import.meta.url);
|
||||||
|
}
|
||||||
|
|
||||||
|
function forkSubprocessGenerateObjectChild(): ChildProcess {
|
||||||
|
// detached: the child becomes a process-group leader so the SDK's grandchild
|
||||||
|
// (the codex/claude binary) inherits its group and a negative-pid SIGKILL reaps
|
||||||
|
// the whole tree. Empty execArgv keeps it a clean Node process.
|
||||||
|
return fork(childUrl(), {
|
||||||
|
execArgv: [],
|
||||||
|
serialization: 'advanced',
|
||||||
|
detached: true,
|
||||||
|
stdio: ['ignore', 'ignore', 'inherit', 'ipc'],
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
/** A per-table enrichment subprocess that did not return before its deadline. */
|
||||||
|
export class KtxSubprocessDeadlineError extends Error {
|
||||||
|
constructor(public readonly deadlineMs: number) {
|
||||||
|
super(`enrichment subprocess exceeded ${Math.round(deadlineMs / 1000)}s`);
|
||||||
|
this.name = 'KtxSubprocessDeadlineError';
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// SIGTERM is too gentle for a child wedged on a hung provider socket; the SDK
|
||||||
|
// grandchild ignores it and survives. Kill the whole tree: the detached process
|
||||||
|
// group on POSIX, the process tree via taskkill /T on Windows.
|
||||||
|
function killProcessTree(child: ChildProcess): void {
|
||||||
|
if (child.pid === undefined) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (process.platform === 'win32') {
|
||||||
|
spawn('taskkill', ['/pid', String(child.pid), '/T', '/F'], { stdio: 'ignore' }).on('error', () => undefined);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
try {
|
||||||
|
process.kill(-child.pid, 'SIGKILL');
|
||||||
|
} catch {
|
||||||
|
try {
|
||||||
|
child.kill('SIGKILL');
|
||||||
|
} catch {
|
||||||
|
// Already exited.
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface RunGenerateObjectInSubprocessInput<TOutput, TSchema extends z.ZodType<TOutput>> {
|
||||||
|
forkSpec: SubprocessRuntimeForkSpec;
|
||||||
|
role: KtxModelRole;
|
||||||
|
prompt: string;
|
||||||
|
system?: string;
|
||||||
|
schema: TSchema;
|
||||||
|
jsonSchema: Record<string, unknown>;
|
||||||
|
deadlineMs: number;
|
||||||
|
signal?: AbortSignal;
|
||||||
|
/** @internal Test seam: spawn the child so tests can observe its lifecycle. */
|
||||||
|
spawnChild?: () => ChildProcess;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Run one structured LLM call for a subprocess-backed runtime behind a boundary
|
||||||
|
* ktx can hard-kill. On the deadline or an external abort, the whole process
|
||||||
|
* group/tree is SIGKILLed (reaping the SDK's wedged model child) and the promise
|
||||||
|
* settles promptly; on success the raw output is validated against the Zod schema.
|
||||||
|
*/
|
||||||
|
export function runGenerateObjectInSubprocess<TOutput, TSchema extends z.ZodType<TOutput>>(
|
||||||
|
input: RunGenerateObjectInSubprocessInput<TOutput, TSchema>,
|
||||||
|
): Promise<TOutput> {
|
||||||
|
return new Promise<TOutput>((resolvePromise, rejectPromise) => {
|
||||||
|
const child = (input.spawnChild ?? forkSubprocessGenerateObjectChild)();
|
||||||
|
let settled = false;
|
||||||
|
const onDeadline = () => settle(() => rejectPromise(new KtxSubprocessDeadlineError(input.deadlineMs)));
|
||||||
|
const onAbort = () => settle(() => rejectPromise(createAbortError()));
|
||||||
|
const timer = setTimeout(onDeadline, input.deadlineMs);
|
||||||
|
function settle(finish: () => void): void {
|
||||||
|
if (settled) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
settled = true;
|
||||||
|
clearTimeout(timer);
|
||||||
|
input.signal?.removeEventListener('abort', onAbort);
|
||||||
|
if (child.exitCode === null && child.signalCode === null) {
|
||||||
|
killProcessTree(child);
|
||||||
|
}
|
||||||
|
finish();
|
||||||
|
}
|
||||||
|
child.on('message', (message: SubprocessGenerateObjectResponse) => {
|
||||||
|
if (message.ok) {
|
||||||
|
let parsed: TOutput;
|
||||||
|
try {
|
||||||
|
parsed = input.schema.parse(message.output);
|
||||||
|
} catch (error) {
|
||||||
|
settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error))));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
settle(() => resolvePromise(parsed));
|
||||||
|
} else {
|
||||||
|
settle(() => rejectPromise(new Error(message.message)));
|
||||||
|
}
|
||||||
|
});
|
||||||
|
child.on('error', (error) => settle(() => rejectPromise(error)));
|
||||||
|
child.on('exit', (code, processSignal) => {
|
||||||
|
if (!settled) {
|
||||||
|
settle(() =>
|
||||||
|
rejectPromise(
|
||||||
|
new Error(`enrichment subprocess exited before returning a result (code ${code}, signal ${processSignal}).`),
|
||||||
|
),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
if (input.signal?.aborted) {
|
||||||
|
onAbort();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
input.signal?.addEventListener('abort', onAbort, { once: true });
|
||||||
|
try {
|
||||||
|
const request: SubprocessGenerateObjectRequest = {
|
||||||
|
forkSpec: input.forkSpec,
|
||||||
|
role: input.role,
|
||||||
|
prompt: input.prompt,
|
||||||
|
...(input.system !== undefined ? { system: input.system } : {}),
|
||||||
|
jsonSchema: input.jsonSchema,
|
||||||
|
};
|
||||||
|
child.send(request);
|
||||||
|
} catch (error) {
|
||||||
|
settle(() => rejectPromise(error instanceof Error ? error : new Error(String(error))));
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
@ -11,6 +11,7 @@ import {
|
||||||
} from '../../telemetry/index.js';
|
} from '../../telemetry/index.js';
|
||||||
import { collectTelemetryRedactionSecrets } from '../../telemetry/redaction-secrets.js';
|
import { collectTelemetryRedactionSecrets } from '../../telemetry/redaction-secrets.js';
|
||||||
import { formatErrorDetail, scrubErrorClass } from '../../telemetry/scrubber.js';
|
import { formatErrorDetail, scrubErrorClass } from '../../telemetry/scrubber.js';
|
||||||
|
import { mcpSlowToolMs, serializeMcpError, type KtxMcpLogger } from './logger.js';
|
||||||
import type {
|
import type {
|
||||||
KtxMcpClientInfo,
|
KtxMcpClientInfo,
|
||||||
KtxMcpContextPorts,
|
KtxMcpContextPorts,
|
||||||
|
|
@ -29,6 +30,7 @@ export interface RegisterKtxContextToolsDeps {
|
||||||
userContext: KtxMcpUserContext;
|
userContext: KtxMcpUserContext;
|
||||||
projectDir?: string;
|
projectDir?: string;
|
||||||
io?: KtxCliIo;
|
io?: KtxCliIo;
|
||||||
|
logger?: KtxMcpLogger;
|
||||||
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -50,6 +52,7 @@ const toolAnnotations = {
|
||||||
sl_read_source: { title: 'Semantic Layer Read Source', readOnlyHint: true, idempotentHint: true, openWorldHint: false },
|
sl_read_source: { title: 'Semantic Layer Read Source', readOnlyHint: true, idempotentHint: true, openWorldHint: false },
|
||||||
sl_query: { title: 'Semantic Layer Query', readOnlyHint: true, openWorldHint: false },
|
sl_query: { title: 'Semantic Layer Query', readOnlyHint: true, openWorldHint: false },
|
||||||
sql_execution: { title: 'SQL Execution', readOnlyHint: true, openWorldHint: false },
|
sql_execution: { title: 'SQL Execution', readOnlyHint: true, openWorldHint: false },
|
||||||
|
sql_dialect_notes: { title: 'SQL Dialect Notes', readOnlyHint: true, idempotentHint: true, openWorldHint: false },
|
||||||
memory_ingest: { title: 'Memory Ingest', destructiveHint: true, openWorldHint: false },
|
memory_ingest: { title: 'Memory Ingest', destructiveHint: true, openWorldHint: false },
|
||||||
memory_ingest_status: { title: 'Memory Ingest Status', readOnlyHint: true, openWorldHint: false },
|
memory_ingest_status: { title: 'Memory Ingest Status', readOnlyHint: true, openWorldHint: false },
|
||||||
} satisfies Record<string, ToolAnnotations>;
|
} satisfies Record<string, ToolAnnotations>;
|
||||||
|
|
@ -60,7 +63,7 @@ const toolDescriptions = {
|
||||||
discover_data:
|
discover_data:
|
||||||
'Search across ktx wiki pages, semantic-layer sources, measures, dimensions, raw tables, and columns. Example: discover_data({ query: "monthly orders by customer", connectionId: "warehouse", kinds: ["sl_source", "table"] }).',
|
'Search across ktx wiki pages, semantic-layer sources, measures, dimensions, raw tables, and columns. Example: discover_data({ query: "monthly orders by customer", connectionId: "warehouse", kinds: ["sl_source", "table"] }).',
|
||||||
wiki_search:
|
wiki_search:
|
||||||
'Search ktx wiki pages for reusable business context. Example: wiki_search({ query: "revenue recognition", limit: 5 }).',
|
'Search ktx wiki pages for reusable business context. Pass connectionId to scope results to one warehouse (unscoped pages plus pages tagged with that connection) when a concept name collides across databases. Example: wiki_search({ query: "revenue recognition", connectionId: "warehouse", limit: 5 }).',
|
||||||
wiki_read: 'Read a ktx wiki page by key returned from wiki_search. Example: wiki_read({ key: "global/revenue" }).',
|
wiki_read: 'Read a ktx wiki page by key returned from wiki_search. Example: wiki_read({ key: "global/revenue" }).',
|
||||||
entity_details:
|
entity_details:
|
||||||
'Read table and column metadata from the latest live-database scan snapshot. Example: entity_details({ connectionId: "warehouse", entities: [{ table: { catalog: null, db: "public", name: "orders" }, columns: ["id"] }] }).',
|
'Read table and column metadata from the latest live-database scan snapshot. Example: entity_details({ connectionId: "warehouse", entities: [{ table: { catalog: null, db: "public", name: "orders" }, columns: ["id"] }] }).',
|
||||||
|
|
@ -72,6 +75,8 @@ const toolDescriptions = {
|
||||||
'Execute a semantic-layer query and return headers, rows, and total row count, plus correctness notes (e.g. compile-only or fan-out) when relevant. The generated SQL and full query plan are omitted by default; request them with include: ["sql"] and/or include: ["plan"]. Example: sl_query({ connectionId: "warehouse", measures: ["orders.order_count"], dimensions: [{ field: "orders.created_at", granularity: "month" }], include: ["sql"] }).',
|
'Execute a semantic-layer query and return headers, rows, and total row count, plus correctness notes (e.g. compile-only or fan-out) when relevant. The generated SQL and full query plan are omitted by default; request them with include: ["sql"] and/or include: ["plan"]. Example: sl_query({ connectionId: "warehouse", measures: ["orders.order_count"], dimensions: [{ field: "orders.created_at", granularity: "month" }], include: ["sql"] }).',
|
||||||
sql_execution:
|
sql_execution:
|
||||||
'Execute one parser-validated read-only SQL query against a configured ktx connection. Example: sql_execution({ connectionId: "warehouse", sql: "select count(*) from public.orders", maxRows: 100 }).',
|
'Execute one parser-validated read-only SQL query against a configured ktx connection. Example: sql_execution({ connectionId: "warehouse", sql: "select count(*) from public.orders", maxRows: 100 }).',
|
||||||
|
sql_dialect_notes:
|
||||||
|
'Return the SQL syntax conventions for the dialect of a ktx connection: fully-qualified table-name form, identifier quoting and case-folding, date/time functions, top-N / window-filtering idiom, and JSON access. Call this before writing raw sql_execution SQL against a connection so the SQL matches that engine. Example: sql_dialect_notes({ connectionId: "warehouse" }).',
|
||||||
memory_ingest:
|
memory_ingest:
|
||||||
'Ingest free-form markdown knowledge into durable ktx memory. Use this for business rules, metric definitions, schema gotchas, recurring findings, or explicit user requests to remember something. Example: memory_ingest({ connectionId: "warehouse", content: "ARR is reported in cents in this warehouse." }).',
|
'Ingest free-form markdown knowledge into durable ktx memory. Use this for business rules, metric definitions, schema gotchas, recurring findings, or explicit user requests to remember something. Example: memory_ingest({ connectionId: "warehouse", content: "ARR is reported in cents in this warehouse." }).',
|
||||||
memory_ingest_status:
|
memory_ingest_status:
|
||||||
|
|
@ -83,6 +88,11 @@ const connectionListSchema = z.object({});
|
||||||
const knowledgeSearchSchema = z.object({
|
const knowledgeSearchSchema = z.object({
|
||||||
query: z.string().min(1).describe('Natural-language wiki search query, e.g. "revenue recognition policy".'),
|
query: z.string().min(1).describe('Natural-language wiki search query, e.g. "revenue recognition policy".'),
|
||||||
limit: z.number().int().min(1).max(50).default(10).describe('Maximum wiki pages to return.'),
|
limit: z.number().int().min(1).max(50).default(10).describe('Maximum wiki pages to return.'),
|
||||||
|
connectionId: connectionIdSchema
|
||||||
|
.optional()
|
||||||
|
.describe(
|
||||||
|
'Scope results to one connection: returns unscoped pages plus pages tagged with this connection. Omit to search all pages.',
|
||||||
|
),
|
||||||
});
|
});
|
||||||
|
|
||||||
const knowledgeReadSchema = z.object({
|
const knowledgeReadSchema = z.object({
|
||||||
|
|
@ -203,6 +213,10 @@ const sqlExecutionSchema = z.object({
|
||||||
maxRows: z.number().int().min(1).max(10_000).default(1000).optional().describe('Maximum rows to return.'),
|
maxRows: z.number().int().min(1).max(10_000).default(1000).optional().describe('Maximum rows to return.'),
|
||||||
});
|
});
|
||||||
|
|
||||||
|
const sqlDialectNotesSchema = z.object({
|
||||||
|
connectionId: connectionIdSchema.describe('Connection id whose engine dialect conventions to return.'),
|
||||||
|
});
|
||||||
|
|
||||||
const memoryIngestSchema = z.object({
|
const memoryIngestSchema = z.object({
|
||||||
content: z
|
content: z
|
||||||
.string()
|
.string()
|
||||||
|
|
@ -405,6 +419,12 @@ const sqlExecutionOutputSchema = z.object({
|
||||||
rowCount: z.number(),
|
rowCount: z.number(),
|
||||||
});
|
});
|
||||||
|
|
||||||
|
const sqlDialectNotesOutputSchema = z.object({
|
||||||
|
connectionId: z.string(),
|
||||||
|
dialect: z.string(),
|
||||||
|
notes: z.string(),
|
||||||
|
});
|
||||||
|
|
||||||
const memoryIngestOutputSchema = z.object({
|
const memoryIngestOutputSchema = z.object({
|
||||||
runId: z.string(),
|
runId: z.string(),
|
||||||
});
|
});
|
||||||
|
|
@ -566,6 +586,63 @@ function clientTelemetryFields(
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function toolResultIsError(result: unknown): boolean {
|
||||||
|
return (
|
||||||
|
typeof result === 'object' && result !== null && 'isError' in result && (result as { isError?: unknown }).isError === true
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Tool-agnostic size: byte length of the serialized text content the client reads. */
|
||||||
|
function toolResultSize(result: unknown): number {
|
||||||
|
if (typeof result !== 'object' || result === null || !('content' in result)) {
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
const content = (result as { content?: unknown }).content;
|
||||||
|
if (!Array.isArray(content)) {
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
let size = 0;
|
||||||
|
for (const item of content) {
|
||||||
|
if (item && typeof item === 'object' && (item as { type?: unknown }).type === 'text') {
|
||||||
|
const text = (item as { text?: unknown }).text;
|
||||||
|
if (typeof text === 'string') {
|
||||||
|
size += Buffer.byteLength(text, 'utf8');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return size;
|
||||||
|
}
|
||||||
|
|
||||||
|
function toolResultErrorText(result: unknown): string {
|
||||||
|
if (typeof result === 'object' && result !== null && 'content' in result) {
|
||||||
|
const content = (result as { content?: unknown }).content;
|
||||||
|
if (Array.isArray(content)) {
|
||||||
|
const text = content
|
||||||
|
.filter(
|
||||||
|
(item): item is { type: 'text'; text: string } =>
|
||||||
|
!!item &&
|
||||||
|
typeof item === 'object' &&
|
||||||
|
(item as { type?: unknown }).type === 'text' &&
|
||||||
|
typeof (item as { text?: unknown }).text === 'string',
|
||||||
|
)
|
||||||
|
.map((item) => item.text)
|
||||||
|
.join('\n');
|
||||||
|
if (text.length > 0) {
|
||||||
|
return text;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return 'Tool returned an error result.';
|
||||||
|
}
|
||||||
|
|
||||||
|
interface InstrumentMcpServerDeps {
|
||||||
|
projectDir?: string;
|
||||||
|
io?: KtxCliIo;
|
||||||
|
logger?: KtxMcpLogger;
|
||||||
|
slowToolMs: number;
|
||||||
|
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
||||||
|
}
|
||||||
|
|
||||||
// Tools registered via registerParsedTool catch their own errors and return an
|
// Tools registered via registerParsedTool catch their own errors and return an
|
||||||
// isError result, so the telemetry layer never sees the thrown Error. Recover
|
// isError result, so the telemetry layer never sees the thrown Error. Recover
|
||||||
// the failure message from the result's text content (the same string the agent
|
// the failure message from the result's text content (the same string the agent
|
||||||
|
|
@ -588,68 +665,91 @@ function mcpErrorResultDetail(result: unknown): string | undefined {
|
||||||
return formatErrorDetail(text);
|
return formatErrorDetail(text);
|
||||||
}
|
}
|
||||||
|
|
||||||
function instrumentMcpServer(
|
function instrumentMcpServer(server: KtxMcpServerLike, deps: InstrumentMcpServerDeps): KtxMcpServerLike {
|
||||||
server: KtxMcpServerLike,
|
|
||||||
telemetry: { projectDir?: string; io?: KtxCliIo; getClientInfo?: () => KtxMcpClientInfo | undefined },
|
|
||||||
): KtxMcpServerLike {
|
|
||||||
return {
|
return {
|
||||||
registerTool(name, config, handler) {
|
registerTool(name, config, handler) {
|
||||||
server.registerTool(name, config, async (input, context) => {
|
server.registerTool(name, config, async (input, context) => {
|
||||||
|
const callId = randomUUID();
|
||||||
|
const callLogger = deps.logger?.child({
|
||||||
|
tool: name,
|
||||||
|
callId,
|
||||||
|
...(context?.sessionId ? { sessionId: context.sessionId } : {}),
|
||||||
|
});
|
||||||
const startedAt = performance.now();
|
const startedAt = performance.now();
|
||||||
|
// Synchronous, before the (possibly blocking) handler: a runaway query that never
|
||||||
|
// returns still leaves this start line — with its exact params — on disk.
|
||||||
|
callLogger?.info({ params: input }, 'tool.start');
|
||||||
try {
|
try {
|
||||||
const result = await handler(input, context);
|
const result = await handler(input, context);
|
||||||
if (telemetry.io && telemetry.projectDir && shouldEmitMcpTelemetry()) {
|
const durationMs = Math.max(0, performance.now() - startedAt);
|
||||||
const isError =
|
const isError = toolResultIsError(result);
|
||||||
typeof result === 'object' && result !== null && 'isError' in result && result.isError === true;
|
if (deps.io && deps.projectDir && shouldEmitMcpTelemetry()) {
|
||||||
const errorDetail = isError ? mcpErrorResultDetail(result) : undefined;
|
const errorDetail = isError ? mcpErrorResultDetail(result) : undefined;
|
||||||
await emitTelemetryEvent({
|
await emitTelemetryEvent({
|
||||||
name: 'mcp_request_completed',
|
name: 'mcp_request_completed',
|
||||||
projectDir: telemetry.projectDir,
|
projectDir: deps.projectDir,
|
||||||
io: telemetry.io,
|
io: deps.io,
|
||||||
fields: {
|
fields: {
|
||||||
toolName: name,
|
toolName: name,
|
||||||
outcome: isError ? 'error' : 'ok',
|
outcome: isError ? 'error' : 'ok',
|
||||||
durationMs: Math.max(0, performance.now() - startedAt),
|
durationMs,
|
||||||
sampleRate: mcpTelemetrySampleRate(),
|
sampleRate: mcpTelemetrySampleRate(),
|
||||||
...(errorDetail ? { errorDetail } : {}),
|
...(errorDetail ? { errorDetail } : {}),
|
||||||
...clientTelemetryFields(telemetry.getClientInfo),
|
...clientTelemetryFields(deps.getClientInfo),
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
if (callLogger) {
|
||||||
|
if (isError) {
|
||||||
|
callLogger.error(
|
||||||
|
{ durationMs, outcome: 'error', err: serializeMcpError(toolResultErrorText(result)) },
|
||||||
|
'tool.end',
|
||||||
|
);
|
||||||
|
} else {
|
||||||
|
const fields = { durationMs, outcome: 'ok' as const, resultSize: toolResultSize(result) };
|
||||||
|
if (durationMs > deps.slowToolMs) {
|
||||||
|
callLogger.warn(fields, 'tool.end');
|
||||||
|
} else {
|
||||||
|
callLogger.info(fields, 'tool.end');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
return result;
|
return result;
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
if (telemetry.io) {
|
const durationMs = Math.max(0, performance.now() - startedAt);
|
||||||
|
if (deps.io) {
|
||||||
await reportException({
|
await reportException({
|
||||||
error,
|
error,
|
||||||
context: { source: `mcp:${name}`, handled: true, fatal: false },
|
context: { source: `mcp:${name}`, handled: true, fatal: false },
|
||||||
projectDir: telemetry.projectDir,
|
projectDir: deps.projectDir,
|
||||||
io: telemetry.io,
|
io: deps.io,
|
||||||
redactionSecrets: await collectTelemetryRedactionSecrets({
|
redactionSecrets: await collectTelemetryRedactionSecrets({
|
||||||
projectDir: telemetry.projectDir,
|
projectDir: deps.projectDir,
|
||||||
includeLlm: true,
|
includeLlm: true,
|
||||||
includeEmbeddings: true,
|
includeEmbeddings: true,
|
||||||
env: process.env,
|
env: process.env,
|
||||||
}),
|
}),
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
if (telemetry.io && telemetry.projectDir && shouldEmitMcpTelemetry()) {
|
if (deps.io && deps.projectDir && shouldEmitMcpTelemetry()) {
|
||||||
const errorClass = scrubErrorClass(error);
|
const errorClass = scrubErrorClass(error);
|
||||||
const errorDetail = formatErrorDetail(error);
|
const errorDetail = formatErrorDetail(error);
|
||||||
await emitTelemetryEvent({
|
await emitTelemetryEvent({
|
||||||
name: 'mcp_request_completed',
|
name: 'mcp_request_completed',
|
||||||
projectDir: telemetry.projectDir,
|
projectDir: deps.projectDir,
|
||||||
io: telemetry.io,
|
io: deps.io,
|
||||||
fields: {
|
fields: {
|
||||||
toolName: name,
|
toolName: name,
|
||||||
outcome: 'error',
|
outcome: 'error',
|
||||||
...(errorClass ? { errorClass } : {}),
|
...(errorClass ? { errorClass } : {}),
|
||||||
...(errorDetail ? { errorDetail } : {}),
|
...(errorDetail ? { errorDetail } : {}),
|
||||||
durationMs: Math.max(0, performance.now() - startedAt),
|
durationMs,
|
||||||
sampleRate: mcpTelemetrySampleRate(),
|
sampleRate: mcpTelemetrySampleRate(),
|
||||||
...clientTelemetryFields(telemetry.getClientInfo),
|
...clientTelemetryFields(deps.getClientInfo),
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
callLogger?.error({ durationMs, outcome: 'error', err: serializeMcpError(error) }, 'tool.end');
|
||||||
throw error;
|
throw error;
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
@ -663,6 +763,8 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void
|
||||||
const server = instrumentMcpServer(deps.server, {
|
const server = instrumentMcpServer(deps.server, {
|
||||||
projectDir: deps.projectDir,
|
projectDir: deps.projectDir,
|
||||||
io: deps.io,
|
io: deps.io,
|
||||||
|
logger: deps.logger,
|
||||||
|
slowToolMs: mcpSlowToolMs(),
|
||||||
getClientInfo: deps.getClientInfo,
|
getClientInfo: deps.getClientInfo,
|
||||||
});
|
});
|
||||||
|
|
||||||
|
|
@ -703,6 +805,7 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void
|
||||||
userId: userContext.userId,
|
userId: userContext.userId,
|
||||||
query: input.query,
|
query: input.query,
|
||||||
limit: input.limit,
|
limit: input.limit,
|
||||||
|
...(input.connectionId !== undefined ? { connectionId: input.connectionId } : {}),
|
||||||
}),
|
}),
|
||||||
),
|
),
|
||||||
toolTelemetry,
|
toolTelemetry,
|
||||||
|
|
@ -867,6 +970,24 @@ export function registerKtxContextTools(deps: RegisterKtxContextToolsDeps): void
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (ports.dialectNotes) {
|
||||||
|
const dialectNotes = ports.dialectNotes;
|
||||||
|
registerParsedTool(
|
||||||
|
server,
|
||||||
|
'sql_dialect_notes',
|
||||||
|
{
|
||||||
|
title: toolAnnotations.sql_dialect_notes.title!,
|
||||||
|
description: toolDescriptions.sql_dialect_notes,
|
||||||
|
inputSchema: sqlDialectNotesSchema.shape,
|
||||||
|
outputSchema: sqlDialectNotesOutputSchema,
|
||||||
|
annotations: toolAnnotations.sql_dialect_notes,
|
||||||
|
},
|
||||||
|
sqlDialectNotesSchema,
|
||||||
|
async (input) => jsonToolResult(await dialectNotes.read(input)),
|
||||||
|
toolTelemetry,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
if (ports.memoryIngest) {
|
if (ports.memoryIngest) {
|
||||||
const memoryIngest = ports.memoryIngest;
|
const memoryIngest = ports.memoryIngest;
|
||||||
registerParsedTool(
|
registerParsedTool(
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,8 @@
|
||||||
import type { KtxSqlQueryExecutorPort } from '../../context/connections/query-executor.js';
|
import type { KtxSqlQueryExecutorPort } from '../../context/connections/query-executor.js';
|
||||||
import { KtxExpectedError, KtxQueryError, isNativeProgrammingFault } from '../../errors.js';
|
import { KtxExpectedError, KtxQueryError, isNativeProgrammingFault } from '../../errors.js';
|
||||||
|
import { isDatabaseDriver, normalizeConnectionDriver } from '../../connection-drivers.js';
|
||||||
|
import { sqlDialectNotes } from '../../context/sql-analysis/dialect-notes.js';
|
||||||
|
import type { KtxProjectConnectionConfig } from '../../context/project/config.js';
|
||||||
import { executeProjectReadOnlySql } from '../../context/connections/project-sql-executor.js';
|
import { executeProjectReadOnlySql } from '../../context/connections/project-sql-executor.js';
|
||||||
import { FEDERATED_CONNECTION_ID, federatedConnectionListing } from '../../context/connections/federation.js';
|
import { FEDERATED_CONNECTION_ID, federatedConnectionListing } from '../../context/connections/federation.js';
|
||||||
import { assertSqlQueryableConnection } from '../../context/connections/dialects.js';
|
import { assertSqlQueryableConnection } from '../../context/connections/dialects.js';
|
||||||
|
|
@ -20,6 +23,7 @@ import { compileLocalSlQuery } from '../../context/sl/local-query.js';
|
||||||
import { createKtxDictionarySearchService } from '../../context/sl/dictionary-search.js';
|
import { createKtxDictionarySearchService } from '../../context/sl/dictionary-search.js';
|
||||||
import { readLocalSlSource } from '../../context/sl/local-sl.js';
|
import { readLocalSlSource } from '../../context/sl/local-sl.js';
|
||||||
import { assertSafeConnectionId } from '../../context/sl/source-files.js';
|
import { assertSafeConnectionId } from '../../context/sl/source-files.js';
|
||||||
|
import { assertConfiguredConnectionId } from '../../context/connections/configured-connections.js';
|
||||||
import { readLocalKnowledgePage, searchLocalKnowledgePages } from '../wiki/local-knowledge.js';
|
import { readLocalKnowledgePage, searchLocalKnowledgePages } from '../wiki/local-knowledge.js';
|
||||||
import type { KtxMcpContextPorts, KtxMcpProgressCallback, KtxSqlExecutionResponse } from './types.js';
|
import type { KtxMcpContextPorts, KtxMcpProgressCallback, KtxSqlExecutionResponse } from './types.js';
|
||||||
|
|
||||||
|
|
@ -94,6 +98,24 @@ async function executeValidatedReadOnlySql(
|
||||||
return response;
|
return response;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** @internal Resolves a connection's dialect SQL notes; throws KtxExpectedError for an unknown or non-SQL-warehouse connection. */
|
||||||
|
export function resolveDialectNotesForConnection(
|
||||||
|
connectionId: string,
|
||||||
|
connection: KtxProjectConnectionConfig | undefined,
|
||||||
|
): { connectionId: string; dialect: string; notes: string } {
|
||||||
|
if (!connection) {
|
||||||
|
throw new KtxExpectedError(`Connection "${connectionId}" is not configured in ktx.yaml`);
|
||||||
|
}
|
||||||
|
const driver = normalizeConnectionDriver(connection);
|
||||||
|
if (!isDatabaseDriver(driver)) {
|
||||||
|
throw new KtxExpectedError(
|
||||||
|
`Connection "${connectionId}" uses the "${driver}" context source, not a SQL warehouse; sql_dialect_notes applies only to SQL database connections.`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
const dialect = sqlAnalysisDialectForDriver(driver);
|
||||||
|
return { connectionId, dialect, notes: sqlDialectNotes(dialect) };
|
||||||
|
}
|
||||||
|
|
||||||
export function createLocalProjectMcpContextPorts(
|
export function createLocalProjectMcpContextPorts(
|
||||||
project: KtxLocalProject,
|
project: KtxLocalProject,
|
||||||
options: CreateLocalProjectMcpContextPortsOptions,
|
options: CreateLocalProjectMcpContextPortsOptions,
|
||||||
|
|
@ -121,11 +143,16 @@ export function createLocalProjectMcpContextPorts(
|
||||||
},
|
},
|
||||||
knowledge: {
|
knowledge: {
|
||||||
async search(input) {
|
async search(input) {
|
||||||
|
const connectionId =
|
||||||
|
input.connectionId === undefined
|
||||||
|
? undefined
|
||||||
|
: assertConfiguredConnectionId(project.config.connections, input.connectionId);
|
||||||
const results = await searchLocalKnowledgePages(project, {
|
const results = await searchLocalKnowledgePages(project, {
|
||||||
query: input.query,
|
query: input.query,
|
||||||
userId: input.userId,
|
userId: input.userId,
|
||||||
limit: input.limit,
|
limit: input.limit,
|
||||||
embeddingService,
|
embeddingService,
|
||||||
|
...(connectionId !== undefined ? { connectionId } : {}),
|
||||||
});
|
});
|
||||||
return {
|
return {
|
||||||
results: results.slice(0, input.limit).map((result) => ({
|
results: results.slice(0, input.limit).map((result) => ({
|
||||||
|
|
@ -196,6 +223,12 @@ export function createLocalProjectMcpContextPorts(
|
||||||
return createKtxDiscoverDataService(project, { userId: 'local', embeddingService }).search(input);
|
return createKtxDiscoverDataService(project, { userId: 'local', embeddingService }).search(input);
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
dialectNotes: {
|
||||||
|
async read(input) {
|
||||||
|
const connectionId = assertSafeConnectionId(input.connectionId);
|
||||||
|
return resolveDialectNotesForConnection(connectionId, project.config.connections[connectionId]);
|
||||||
|
},
|
||||||
|
},
|
||||||
};
|
};
|
||||||
|
|
||||||
if (options.sqlAnalysis && options.localScan?.createConnector) {
|
if (options.sqlAnalysis && options.localScan?.createConnector) {
|
||||||
|
|
|
||||||
58
packages/cli/src/context/mcp/logger.ts
Normal file
58
packages/cli/src/context/mcp/logger.ts
Normal file
|
|
@ -0,0 +1,58 @@
|
||||||
|
import { Writable } from 'node:stream';
|
||||||
|
import pino, { type DestinationStream, type Logger } from 'pino';
|
||||||
|
import PinoPretty from 'pino-pretty';
|
||||||
|
import type { KtxCliIo } from '../../cli-runtime.js';
|
||||||
|
|
||||||
|
export type KtxMcpLogger = Logger;
|
||||||
|
|
||||||
|
const LOG_LEVELS = new Set(['trace', 'debug', 'info', 'warn', 'error', 'fatal', 'silent']);
|
||||||
|
|
||||||
|
const DEFAULT_LEVEL = 'info';
|
||||||
|
const DEFAULT_SLOW_TOOL_MS = 10_000;
|
||||||
|
|
||||||
|
/** @internal */
|
||||||
|
export function mcpLogLevel(env: NodeJS.ProcessEnv = process.env): string {
|
||||||
|
const raw = env.KTX_MCP_LOG_LEVEL?.trim().toLowerCase();
|
||||||
|
return raw && LOG_LEVELS.has(raw) ? raw : DEFAULT_LEVEL;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** @internal */
|
||||||
|
export function mcpSlowToolMs(env: NodeJS.ProcessEnv = process.env): number {
|
||||||
|
const raw = Number(env.KTX_MCP_SLOW_TOOL_MS);
|
||||||
|
return Number.isFinite(raw) && raw >= 0 ? raw : DEFAULT_SLOW_TOOL_MS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Serialize an error for a structured `err` field. Genuine `Error`s get pino's
|
||||||
|
* standard serializer (type + message + stack); everything else is reduced to a
|
||||||
|
* message — the in-band tool-error path has already lost the original stack.
|
||||||
|
*/
|
||||||
|
export function serializeMcpError(error: unknown): Record<string, unknown> {
|
||||||
|
if (error instanceof Error) {
|
||||||
|
return { ...pino.stdSerializers.err(error) };
|
||||||
|
}
|
||||||
|
return { message: typeof error === 'string' ? error : String(error) };
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* One synchronous pino logger per MCP server process, written to the `io.stderr`
|
||||||
|
* sink. stderr is the only universally-correct sink: the stdio transport reserves
|
||||||
|
* stdout for JSON-RPC, and the HTTP daemon redirects stderr into `.ktx/logs/mcp.log`.
|
||||||
|
* Synchronous writes are load-bearing — a `tool.start` line must reach the fd before
|
||||||
|
* a blocking handler runs, so a runaway query still leaves its start record on disk.
|
||||||
|
* Format follows the terminal, not a flag: pretty for a TTY, plain JSON otherwise.
|
||||||
|
*/
|
||||||
|
export function createMcpLogger(io: KtxCliIo, options: { isTTY?: boolean } = {}): KtxMcpLogger {
|
||||||
|
const level = mcpLogLevel();
|
||||||
|
const isTTY = options.isTTY ?? process.stderr.isTTY === true;
|
||||||
|
if (isTTY) {
|
||||||
|
const sink = new Writable({
|
||||||
|
write(chunk: Buffer | string, _encoding, callback) {
|
||||||
|
io.stderr.write(typeof chunk === 'string' ? chunk : chunk.toString('utf8'));
|
||||||
|
callback();
|
||||||
|
},
|
||||||
|
});
|
||||||
|
return pino({ level }, PinoPretty({ colorize: true, sync: true, destination: sink }));
|
||||||
|
}
|
||||||
|
return pino({ level }, io.stderr as DestinationStream);
|
||||||
|
}
|
||||||
|
|
@ -11,6 +11,7 @@ export function createKtxMcpServer(deps: KtxMcpServerDeps): KtxMcpServerDeps['se
|
||||||
userContext: deps.userContext,
|
userContext: deps.userContext,
|
||||||
projectDir: deps.projectDir,
|
projectDir: deps.projectDir,
|
||||||
io: deps.io,
|
io: deps.io,
|
||||||
|
logger: deps.logger,
|
||||||
getClientInfo: deps.getClientInfo,
|
getClientInfo: deps.getClientInfo,
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
@ -31,6 +32,7 @@ export function createDefaultKtxMcpServer(
|
||||||
contextTools: deps.contextTools,
|
contextTools: deps.contextTools,
|
||||||
projectDir: deps.projectDir,
|
projectDir: deps.projectDir,
|
||||||
io: deps.io,
|
io: deps.io,
|
||||||
|
logger: deps.logger,
|
||||||
// The SDK populates the client identity after the initialize handshake, so
|
// The SDK populates the client identity after the initialize handshake, so
|
||||||
// read it lazily at emit time rather than at registration (undefined here).
|
// read it lazily at emit time rather than at registration (undefined here).
|
||||||
getClientInfo: () => server.server.getClientVersion(),
|
getClientInfo: () => server.server.getClientVersion(),
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,6 @@
|
||||||
import type { MemoryIngestService } from '../../context/memory/memory-runs.js';
|
import type { MemoryIngestService } from '../../context/memory/memory-runs.js';
|
||||||
import type { KtxCliIo } from '../../cli-runtime.js';
|
import type { KtxCliIo } from '../../cli-runtime.js';
|
||||||
|
import type { KtxMcpLogger } from './logger.js';
|
||||||
import type { KtxEntityDetailsInput, KtxEntityDetailsResponse } from '../scan/entity-details.js';
|
import type { KtxEntityDetailsInput, KtxEntityDetailsResponse } from '../scan/entity-details.js';
|
||||||
import type { KtxDiscoverDataInput, KtxDiscoverDataResponse } from '../../context/search/discover.js';
|
import type { KtxDiscoverDataInput, KtxDiscoverDataResponse } from '../../context/search/discover.js';
|
||||||
import type { KtxDictionarySearchInput, KtxDictionarySearchResponse } from '../../context/sl/dictionary-search.js';
|
import type { KtxDictionarySearchInput, KtxDictionarySearchResponse } from '../../context/sl/dictionary-search.js';
|
||||||
|
|
@ -28,6 +29,8 @@ interface KtxMcpProgressEvent {
|
||||||
export type KtxMcpProgressCallback = (event: KtxMcpProgressEvent) => void | Promise<void>;
|
export type KtxMcpProgressCallback = (event: KtxMcpProgressEvent) => void | Promise<void>;
|
||||||
|
|
||||||
export interface KtxMcpToolHandlerContext {
|
export interface KtxMcpToolHandlerContext {
|
||||||
|
/** Present for the HTTP StreamableHTTP transport (one per session); absent for stdio. */
|
||||||
|
sessionId?: string;
|
||||||
_meta?: { progressToken?: string | number; [key: string]: unknown };
|
_meta?: { progressToken?: string | number; [key: string]: unknown };
|
||||||
sendNotification?: (notification: {
|
sendNotification?: (notification: {
|
||||||
method: 'notifications/progress';
|
method: 'notifications/progress';
|
||||||
|
|
@ -113,7 +116,12 @@ interface KtxKnowledgePage {
|
||||||
|
|
||||||
/** @internal */
|
/** @internal */
|
||||||
export interface KtxKnowledgeMcpPort {
|
export interface KtxKnowledgeMcpPort {
|
||||||
search(input: { userId: string; query: string; limit: number }): Promise<KtxKnowledgeSearchResponse>;
|
search(input: {
|
||||||
|
userId: string;
|
||||||
|
query: string;
|
||||||
|
limit: number;
|
||||||
|
connectionId?: string;
|
||||||
|
}): Promise<KtxKnowledgeSearchResponse>;
|
||||||
read(input: { userId: string; key: string }): Promise<KtxKnowledgePage | null>;
|
read(input: { userId: string; key: string }): Promise<KtxKnowledgePage | null>;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -172,6 +180,11 @@ export interface KtxSqlExecutionMcpPort {
|
||||||
): Promise<KtxSqlExecutionResponse>;
|
): Promise<KtxSqlExecutionResponse>;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** @internal */
|
||||||
|
export interface KtxDialectNotesMcpPort {
|
||||||
|
read(input: { connectionId: string }): Promise<{ connectionId: string; dialect: string; notes: string }>;
|
||||||
|
}
|
||||||
|
|
||||||
export interface KtxMcpContextPorts {
|
export interface KtxMcpContextPorts {
|
||||||
connections?: KtxConnectionsMcpPort;
|
connections?: KtxConnectionsMcpPort;
|
||||||
knowledge?: KtxKnowledgeMcpPort;
|
knowledge?: KtxKnowledgeMcpPort;
|
||||||
|
|
@ -180,6 +193,7 @@ export interface KtxMcpContextPorts {
|
||||||
dictionarySearch?: KtxDictionarySearchMcpPort;
|
dictionarySearch?: KtxDictionarySearchMcpPort;
|
||||||
discover?: KtxDiscoverDataMcpPort;
|
discover?: KtxDiscoverDataMcpPort;
|
||||||
sqlExecution?: KtxSqlExecutionMcpPort;
|
sqlExecution?: KtxSqlExecutionMcpPort;
|
||||||
|
dialectNotes?: KtxDialectNotesMcpPort;
|
||||||
memoryIngest?: MemoryIngestPort;
|
memoryIngest?: MemoryIngestPort;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -189,6 +203,8 @@ export interface KtxMcpServerDeps {
|
||||||
contextTools?: KtxMcpContextPorts;
|
contextTools?: KtxMcpContextPorts;
|
||||||
projectDir?: string;
|
projectDir?: string;
|
||||||
io?: KtxCliIo;
|
io?: KtxCliIo;
|
||||||
|
/** Shared per-process logger for tool-call observability; tool-call logging is off when absent. */
|
||||||
|
logger?: KtxMcpLogger;
|
||||||
/** Reads the connected client's identity once the initialize handshake completes. */
|
/** Reads the connected client's identity once the initialize handshake completes. */
|
||||||
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
getClientInfo?: () => KtxMcpClientInfo | undefined;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -168,7 +168,7 @@ export class MemoryAgentService {
|
||||||
: '';
|
: '';
|
||||||
const prompt = [
|
const prompt = [
|
||||||
`# Wiki Index\n\n${wikiIndex}`,
|
`# Wiki Index\n\n${wikiIndex}`,
|
||||||
hasSL ? `\n# Semantic Layer Sources\n\n${slIndex}` : '',
|
hasSL ? `\n# Semantic Layer Sources (connectionId: ${input.connectionId})\n\n${slIndex}` : '',
|
||||||
'\n---\n',
|
'\n---\n',
|
||||||
assistantSection,
|
assistantSection,
|
||||||
`\n## User Message\n\n${input.userMessage.trim()}`,
|
`\n## User Message\n\n${input.userMessage.trim()}`,
|
||||||
|
|
|
||||||
|
|
@ -209,6 +209,11 @@ const scanRelationshipsSchema = z
|
||||||
.union([z.literal('all'), z.int().nonnegative()])
|
.union([z.literal('all'), z.int().nonnegative()])
|
||||||
.optional()
|
.optional()
|
||||||
.describe('Cap on validation queries per scan run. Use "all" for unlimited, an integer for a hard cap, or omit for the runtime default.'),
|
.describe('Cap on validation queries per scan run. Use "all" for unlimited, an integer for a hard cap, or omit for the runtime default.'),
|
||||||
|
detectionBudgetMs: z
|
||||||
|
.int()
|
||||||
|
.positive()
|
||||||
|
.default(600_000)
|
||||||
|
.describe('Wall-clock budget (ms) for the whole relationship-detection stage. Checked at table-profile, LLM-proposal, candidate-validation, and composite-probe boundaries; above the per-query deadline. On exhaustion the stage stops scheduling new work and returns the relationships found so far, marked partial. Raise it to trigger a fresher, fuller run.'),
|
||||||
})
|
})
|
||||||
.describe('Schema-scan relationship discovery and validation tunables.');
|
.describe('Schema-scan relationship discovery and validation tunables.');
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -30,7 +30,15 @@ function warehouseConnectionSchema<const Driver extends WarehouseDriver>(driver:
|
||||||
.array(z.string().min(1))
|
.array(z.string().min(1))
|
||||||
.optional()
|
.optional()
|
||||||
.describe(
|
.describe(
|
||||||
'Optional allowlist of fully-qualified table names ("schema.table") to ingest. When set, live-database ingest discards any table whose schema-qualified name is not in this list. Useful for smoke-testing ingest on a single table.',
|
'Optional allowlist of object names to ingest. Accepted forms: "catalog.db.name", "db.name" (schema-qualified), or bare "name". When set, live-database ingest restricts the scan to the listed objects and fails with a clear error if none match. For SQLite, "main.<name>" and the bare "<name>" are equivalent (SQLite exposes a single "main" schema). Useful for smoke-testing ingest on a single table.',
|
||||||
|
),
|
||||||
|
query_timeout_ms: z
|
||||||
|
.number()
|
||||||
|
.int()
|
||||||
|
.positive()
|
||||||
|
.optional()
|
||||||
|
.describe(
|
||||||
|
'Maximum execution time for a single read-only query, in milliseconds (default 30000). Enforced as a server-side statement timeout for remote engines and by SIGKILL-ing a forked query subprocess for in-process SQLite. A query exceeding it is cancelled and returns a "query exceeded Ns" error so the agent can revise.',
|
||||||
),
|
),
|
||||||
})
|
})
|
||||||
.describe(
|
.describe(
|
||||||
|
|
|
||||||
|
|
@ -37,7 +37,7 @@ export interface InitKtxProjectResult extends KtxLocalProject {
|
||||||
const TRACKED_SCAFFOLD_FILES: Array<{ path: string; content: string }> = [
|
const TRACKED_SCAFFOLD_FILES: Array<{ path: string; content: string }> = [
|
||||||
{
|
{
|
||||||
path: '.ktx/.gitignore',
|
path: '.ktx/.gitignore',
|
||||||
content: 'cache/\ndb.sqlite\ndb.sqlite-*\ningest-transcripts/\nsecrets/\nsetup/\nagents/\n',
|
content: 'cache/\ndb.sqlite\ndb.sqlite-*\ningest-transcripts/\nlogs/\nsecrets/\nsetup/\nagents/\n',
|
||||||
},
|
},
|
||||||
{ path: '.ktx/prompts/.gitkeep', content: '' },
|
{ path: '.ktx/prompts/.gitkeep', content: '' },
|
||||||
{ path: '.ktx/skills/.gitkeep', content: '' },
|
{ path: '.ktx/skills/.gitkeep', content: '' },
|
||||||
|
|
|
||||||
|
|
@ -24,6 +24,7 @@ const SETUP_GITIGNORE_ENTRIES = [
|
||||||
'db.sqlite',
|
'db.sqlite',
|
||||||
'db.sqlite-*',
|
'db.sqlite-*',
|
||||||
'ingest-transcripts/',
|
'ingest-transcripts/',
|
||||||
|
'logs/',
|
||||||
'secrets/',
|
'secrets/',
|
||||||
'setup/',
|
'setup/',
|
||||||
'agents/',
|
'agents/',
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,10 @@
|
||||||
import type { KtxLlmRuntimePort } from '../../context/llm/runtime-port.js';
|
import type { ChildProcess } from 'node:child_process';
|
||||||
import { z } from 'zod';
|
import { z } from 'zod';
|
||||||
|
import type { KtxLlmRuntimePort } from '../../context/llm/runtime-port.js';
|
||||||
|
import {
|
||||||
|
KtxSubprocessDeadlineError,
|
||||||
|
runGenerateObjectInSubprocess,
|
||||||
|
} from '../../context/llm/subprocess-generate-object.js';
|
||||||
import type {
|
import type {
|
||||||
KtxColumnSampleInput,
|
KtxColumnSampleInput,
|
||||||
KtxColumnSampleResult,
|
KtxColumnSampleResult,
|
||||||
|
|
@ -145,6 +150,8 @@ export interface KtxDescriptionGeneratorOptions {
|
||||||
logger?: KtxScanLoggerPort;
|
logger?: KtxScanLoggerPort;
|
||||||
onWarning?: (warning: KtxScanWarning) => void;
|
onWarning?: (warning: KtxScanWarning) => void;
|
||||||
settings: KtxDescriptionGenerationSettings;
|
settings: KtxDescriptionGenerationSettings;
|
||||||
|
/** @internal Test seam: spawn the kill-boundary child for subprocess backends. */
|
||||||
|
spawnSubprocessGenerateChild?: () => ChildProcess;
|
||||||
}
|
}
|
||||||
|
|
||||||
interface ColumnTaskResult {
|
interface ColumnTaskResult {
|
||||||
|
|
@ -510,12 +517,14 @@ export class KtxDescriptionGenerator {
|
||||||
private readonly logger?: KtxScanLoggerPort;
|
private readonly logger?: KtxScanLoggerPort;
|
||||||
private readonly onWarning?: (warning: KtxScanWarning) => void;
|
private readonly onWarning?: (warning: KtxScanWarning) => void;
|
||||||
private readonly settings: ResolvedKtxDescriptionGenerationSettings;
|
private readonly settings: ResolvedKtxDescriptionGenerationSettings;
|
||||||
|
private readonly spawnSubprocessGenerateChild?: () => ChildProcess;
|
||||||
|
|
||||||
constructor(options: KtxDescriptionGeneratorOptions) {
|
constructor(options: KtxDescriptionGeneratorOptions) {
|
||||||
this.llmRuntime = options.llmRuntime;
|
this.llmRuntime = options.llmRuntime;
|
||||||
this.cache = options.cache;
|
this.cache = options.cache;
|
||||||
this.logger = options.logger;
|
this.logger = options.logger;
|
||||||
this.onWarning = options.onWarning;
|
this.onWarning = options.onWarning;
|
||||||
|
this.spawnSubprocessGenerateChild = options.spawnSubprocessGenerateChild;
|
||||||
this.settings = {
|
this.settings = {
|
||||||
columnMaxWords: options.settings.columnMaxWords,
|
columnMaxWords: options.settings.columnMaxWords,
|
||||||
tableMaxWords: options.settings.tableMaxWords,
|
tableMaxWords: options.settings.tableMaxWords,
|
||||||
|
|
@ -757,6 +766,21 @@ export class KtxDescriptionGenerator {
|
||||||
let tableDescription: string | null = null;
|
let tableDescription: string | null = null;
|
||||||
let structuredGenerationSucceeded = false;
|
let structuredGenerationSucceeded = false;
|
||||||
|
|
||||||
|
// Bound + retry the per-table enrichment LLM call. A transient backend error
|
||||||
|
// (e.g. an "overloaded"/burst rejection when many tables enrich concurrently)
|
||||||
|
// otherwise nulls a whole table's descriptions on the FIRST failure — sampleTable
|
||||||
|
// already retries, this call did not, so transient errors silently dropped most
|
||||||
|
// tables of a db. retryAsync gives it the same 3-attempt backoff. A FRESH timeout
|
||||||
|
// per attempt still bounds a wedged wide table (it never returns a result message);
|
||||||
|
// a timeout is surfaced as KtxAbortedError so retryAsync does NOT retry it (one
|
||||||
|
// wedge stays one timeout, not 3×). Tune via KTX_ENRICH_LLM_TIMEOUT_MS (default
|
||||||
|
// 120s) and KTX_ENRICH_LLM_ATTEMPTS (default 3).
|
||||||
|
const rawEnrichTimeoutMs = Number(process.env.KTX_ENRICH_LLM_TIMEOUT_MS);
|
||||||
|
const enrichTimeoutMs = Number.isFinite(rawEnrichTimeoutMs) && rawEnrichTimeoutMs > 0 ? rawEnrichTimeoutMs : 120_000;
|
||||||
|
const enrichAttempts = Math.max(1, Number(process.env.KTX_ENRICH_LLM_ATTEMPTS ?? 3) || 3);
|
||||||
|
let llmStartedAt = 0;
|
||||||
|
let lastTimedOut = false;
|
||||||
|
|
||||||
try {
|
try {
|
||||||
const prompt = batchedPrompt({
|
const prompt = batchedPrompt({
|
||||||
table: input.table,
|
table: input.table,
|
||||||
|
|
@ -765,15 +789,91 @@ export class KtxDescriptionGenerator {
|
||||||
tableMaxWords: this.settings.tableMaxWords,
|
tableMaxWords: this.settings.tableMaxWords,
|
||||||
columnMaxWords: this.settings.columnMaxWords,
|
columnMaxWords: this.settings.columnMaxWords,
|
||||||
});
|
});
|
||||||
const generated = await this.llmRuntime.generateObject<
|
llmStartedAt = Date.now();
|
||||||
BatchedTableDescriptionOutput,
|
this.logger?.info(
|
||||||
typeof batchedTableDescriptionSchema
|
`[enrich] llm:start table=${input.table.name} cols=${input.table.columns.length} promptChars=${prompt.user.length} timeoutMs=${enrichTimeoutMs} attempts=${enrichAttempts}`,
|
||||||
>({
|
{ connectorId: input.connector.id, table: input.table.name, columns: input.table.columns.length },
|
||||||
role: 'candidateExtraction',
|
);
|
||||||
system: prompt.system,
|
// Subprocess backends (codex/claude-code) own an SDK child that ignores the
|
||||||
prompt: prompt.user,
|
// in-process abort, so each attempt runs behind a tree-killable boundary;
|
||||||
schema: batchedTableDescriptionSchema,
|
// HTTP backends keep the native abortSignal -> fetch cancellation.
|
||||||
temperature: this.settings.temperature,
|
const enrichForkSpec = this.llmRuntime.subprocessForkSpec();
|
||||||
|
const enrichJsonSchema = enrichForkSpec
|
||||||
|
? (z.toJSONSchema(batchedTableDescriptionSchema, { target: 'draft-7' }) as Record<string, unknown>)
|
||||||
|
: null;
|
||||||
|
const generated = await retryAsync(
|
||||||
|
async () => {
|
||||||
|
if (enrichForkSpec && enrichJsonSchema) {
|
||||||
|
try {
|
||||||
|
return await runGenerateObjectInSubprocess<
|
||||||
|
BatchedTableDescriptionOutput,
|
||||||
|
typeof batchedTableDescriptionSchema
|
||||||
|
>({
|
||||||
|
forkSpec: enrichForkSpec,
|
||||||
|
role: 'candidateExtraction',
|
||||||
|
system: prompt.system,
|
||||||
|
prompt: prompt.user,
|
||||||
|
schema: batchedTableDescriptionSchema,
|
||||||
|
jsonSchema: enrichJsonSchema,
|
||||||
|
deadlineMs: enrichTimeoutMs,
|
||||||
|
...(input.context.signal ? { signal: input.context.signal } : {}),
|
||||||
|
...(this.spawnSubprocessGenerateChild
|
||||||
|
? { spawnChild: this.spawnSubprocessGenerateChild }
|
||||||
|
: {}),
|
||||||
|
});
|
||||||
|
} catch (error) {
|
||||||
|
// The boundary tree-kills the wedged child on deadline; a per-table
|
||||||
|
// timeout is not worth retrying (it would just time out again), so
|
||||||
|
// surface it as KtxAbortedError so retryAsync stops immediately.
|
||||||
|
if (error instanceof KtxSubprocessDeadlineError && !input.context.signal?.aborted) {
|
||||||
|
lastTimedOut = true;
|
||||||
|
throw new KtxAbortedError();
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
const enrichTimeout = AbortSignal.timeout(enrichTimeoutMs);
|
||||||
|
const abortSignal = input.context.signal
|
||||||
|
? AbortSignal.any([enrichTimeout, input.context.signal])
|
||||||
|
: enrichTimeout;
|
||||||
|
try {
|
||||||
|
return await this.llmRuntime.generateObject<
|
||||||
|
BatchedTableDescriptionOutput,
|
||||||
|
typeof batchedTableDescriptionSchema
|
||||||
|
>({
|
||||||
|
role: 'candidateExtraction',
|
||||||
|
system: prompt.system,
|
||||||
|
prompt: prompt.user,
|
||||||
|
schema: batchedTableDescriptionSchema,
|
||||||
|
temperature: this.settings.temperature,
|
||||||
|
abortSignal,
|
||||||
|
});
|
||||||
|
} catch (error) {
|
||||||
|
// A per-table timeout is not worth retrying (it would just time out
|
||||||
|
// again); surface it as KtxAbortedError so retryAsync stops immediately.
|
||||||
|
// A genuine context cancellation is handled by retryAsync's own signal check.
|
||||||
|
if (enrichTimeout.aborted && !input.context.signal?.aborted) {
|
||||||
|
lastTimedOut = true;
|
||||||
|
throw new KtxAbortedError();
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
attempts: enrichAttempts,
|
||||||
|
baseDelayMs: 500,
|
||||||
|
...(input.context.signal ? { signal: input.context.signal } : {}),
|
||||||
|
onAttemptFailure: (error, attempt) => {
|
||||||
|
this.logger?.warn(
|
||||||
|
`[enrich] llm:retry table=${input.table.name} attempt=${attempt}: ${errorMessage(error)}`,
|
||||||
|
{ connectorId: input.connector.id, table: input.table.name, attempt },
|
||||||
|
);
|
||||||
|
},
|
||||||
|
},
|
||||||
|
);
|
||||||
|
this.logger?.info(`[enrich] llm:done table=${input.table.name} ms=${Date.now() - llmStartedAt}`, {
|
||||||
|
connectorId: input.connector.id,
|
||||||
|
table: input.table.name,
|
||||||
});
|
});
|
||||||
structuredGenerationSucceeded = true;
|
structuredGenerationSucceeded = true;
|
||||||
tableDescription = generated.tableDescription.trim() || null;
|
tableDescription = generated.tableDescription.trim() || null;
|
||||||
|
|
@ -794,16 +894,25 @@ export class KtxDescriptionGenerator {
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
this.logger?.warn(`Batched table description failed for ${input.table.name}: ${errorMessage(error)}`, {
|
// A genuine cancellation propagates so the stage fails and resumes; a
|
||||||
connectorId: input.connector.id,
|
// per-table timeout (context.signal not aborted) still degrades to null.
|
||||||
table: input.table.name,
|
if (input.context.signal?.aborted) {
|
||||||
});
|
throw error;
|
||||||
|
}
|
||||||
|
const elapsedMs = llmStartedAt ? Date.now() - llmStartedAt : 0;
|
||||||
|
const timedOut = lastTimedOut;
|
||||||
|
this.logger?.warn(
|
||||||
|
`[enrich] llm:${timedOut ? 'TIMEOUT' : 'fail'} table=${input.table.name} cols=${input.table.columns.length} ms=${elapsedMs}: ${errorMessage(error)}`,
|
||||||
|
{ connectorId: input.connector.id, table: input.table.name, timedOut, elapsedMs },
|
||||||
|
);
|
||||||
this.onWarning?.({
|
this.onWarning?.({
|
||||||
code: 'enrichment_failed',
|
code: timedOut ? 'enrichment_timeout' : 'enrichment_failed',
|
||||||
message: `Failed to generate batched description for table ${input.table.name}: ${errorMessage(error)}`,
|
message: `${
|
||||||
|
timedOut ? `Timed out after ${elapsedMs}ms generating` : 'Failed to generate'
|
||||||
|
} batched description for table ${input.table.name}: ${errorMessage(error)}`,
|
||||||
table: input.table.name,
|
table: input.table.name,
|
||||||
recoverable: true,
|
recoverable: true,
|
||||||
metadata: { connectorId: input.connector.id },
|
metadata: { connectorId: input.connector.id, ...(timedOut ? { timeoutMs: enrichTimeoutMs } : {}) },
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -10,21 +10,34 @@ import type { KtxTableRef } from './types.js';
|
||||||
* "catalog.db.name" — fully qualified
|
* "catalog.db.name" — fully qualified
|
||||||
* "db.name" — schema-qualified (catalog = null)
|
* "db.name" — schema-qualified (catalog = null)
|
||||||
* "name" — bare (catalog = db = null; SQLite-shape)
|
* "name" — bare (catalog = db = null; SQLite-shape)
|
||||||
|
*
|
||||||
|
* SQLite exposes a single schema named `main` but the connector emits objects
|
||||||
|
* with `db: null`, so the `"main.<name>"` form is normalized to the bare shape
|
||||||
|
* to match. Both `"main.customers"` and `"customers"` therefore select the same
|
||||||
|
* object.
|
||||||
*/
|
*/
|
||||||
export function resolveEnabledTables(
|
export function resolveEnabledTables(
|
||||||
connection: Record<string, unknown> | undefined,
|
connection: Record<string, unknown> | undefined,
|
||||||
): ReadonlySet<KtxTableRefKey> | null {
|
): ReadonlySet<KtxTableRefKey> | null {
|
||||||
const raw = connection?.enabled_tables;
|
const raw = connection?.enabled_tables;
|
||||||
if (!Array.isArray(raw) || raw.length === 0) return null;
|
if (!Array.isArray(raw) || raw.length === 0) return null;
|
||||||
|
const driver = typeof connection?.driver === 'string' ? connection.driver : undefined;
|
||||||
const refs: KtxTableRef[] = [];
|
const refs: KtxTableRef[] = [];
|
||||||
for (const value of raw) {
|
for (const value of raw) {
|
||||||
const parsed = parseEnabledTableEntry(value);
|
const parsed = parseEnabledTableEntry(value);
|
||||||
if (parsed) refs.push(parsed);
|
if (parsed) refs.push(normalizeRefForDriver(parsed, driver));
|
||||||
}
|
}
|
||||||
if (refs.length === 0) return null;
|
if (refs.length === 0) return null;
|
||||||
return tableRefSet(refs);
|
return tableRefSet(refs);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function normalizeRefForDriver(ref: KtxTableRef, driver: string | undefined): KtxTableRef {
|
||||||
|
if (driver === 'sqlite' && ref.catalog === null && ref.db === 'main') {
|
||||||
|
return { catalog: null, db: null, name: ref.name };
|
||||||
|
}
|
||||||
|
return ref;
|
||||||
|
}
|
||||||
|
|
||||||
function parseEnabledTableEntry(value: unknown): KtxTableRef | null {
|
function parseEnabledTableEntry(value: unknown): KtxTableRef | null {
|
||||||
if (typeof value === 'string') {
|
if (typeof value === 'string') {
|
||||||
return parseDottedTableEntry(value);
|
return parseDottedTableEntry(value);
|
||||||
|
|
|
||||||
|
|
@ -1,14 +1,19 @@
|
||||||
import { createHash } from 'node:crypto';
|
import { createHash } from 'node:crypto';
|
||||||
|
import type { KtxScanRelationshipConfig } from '../project/config.js';
|
||||||
import type { KtxScanEnrichmentStage, KtxScanEnrichmentStateSummary, KtxScanMode, KtxSchemaSnapshot } from './types.js';
|
import type { KtxScanEnrichmentStage, KtxScanEnrichmentStateSummary, KtxScanMode, KtxSchemaSnapshot } from './types.js';
|
||||||
|
|
||||||
const KTX_SCAN_ENRICHMENT_STAGES: readonly KtxScanEnrichmentStage[] = [
|
/**
|
||||||
|
* Canonical enrichment-stage registry. The `--stages` CLI parser validates
|
||||||
|
* against this list, and stage selection / iteration derives its order here.
|
||||||
|
*/
|
||||||
|
export const KTX_SCAN_ENRICHMENT_STAGES: readonly KtxScanEnrichmentStage[] = [
|
||||||
'descriptions',
|
'descriptions',
|
||||||
'embeddings',
|
'embeddings',
|
||||||
'relationships',
|
'relationships',
|
||||||
] as const;
|
] as const;
|
||||||
|
|
||||||
export interface KtxScanEnrichmentStageLookup {
|
export interface KtxScanEnrichmentStageLookup {
|
||||||
runId: string;
|
connectionId: string;
|
||||||
stage: KtxScanEnrichmentStage;
|
stage: KtxScanEnrichmentStage;
|
||||||
inputHash: string;
|
inputHash: string;
|
||||||
}
|
}
|
||||||
|
|
@ -47,6 +52,15 @@ export interface KtxScanEnrichmentStateStore {
|
||||||
findCompletedStage<TOutput = unknown>(
|
findCompletedStage<TOutput = unknown>(
|
||||||
input: KtxScanEnrichmentStageLookup,
|
input: KtxScanEnrichmentStageLookup,
|
||||||
): Promise<KtxScanEnrichmentCompletedStage<TOutput> | null>;
|
): Promise<KtxScanEnrichmentCompletedStage<TOutput> | null>;
|
||||||
|
/**
|
||||||
|
* The most recently completed row for a (connection, stage) pair regardless of
|
||||||
|
* input hash. Used by the staleness check to compare a stage's stored hash
|
||||||
|
* against its freshly recomputed one (D4).
|
||||||
|
*/
|
||||||
|
findLatestCompletedStage(input: {
|
||||||
|
connectionId: string;
|
||||||
|
stage: KtxScanEnrichmentStage;
|
||||||
|
}): Promise<KtxScanEnrichmentCompletedStage | null>;
|
||||||
saveCompletedStage<TOutput = unknown>(
|
saveCompletedStage<TOutput = unknown>(
|
||||||
input: Omit<KtxScanEnrichmentCompletedStage<TOutput>, 'status' | 'errorMessage'>,
|
input: Omit<KtxScanEnrichmentCompletedStage<TOutput>, 'status' | 'errorMessage'>,
|
||||||
): Promise<void>;
|
): Promise<void>;
|
||||||
|
|
@ -54,12 +68,35 @@ export interface KtxScanEnrichmentStateStore {
|
||||||
listRunStages(runId: string): Promise<KtxScanEnrichmentStageRecord[]>;
|
listRunStages(runId: string): Promise<KtxScanEnrichmentStageRecord[]>;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface ComputeKtxScanEnrichmentInputHashInput {
|
/** Description-LLM identity: the inputs that change a description's content. */
|
||||||
|
export interface KtxScanLlmIdentity {
|
||||||
|
model: string | null;
|
||||||
|
baseUrlConfigured: boolean;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Embedding-model identity: the inputs that change an embedding vector. */
|
||||||
|
export interface KtxScanEmbeddingIdentity {
|
||||||
|
model: string | null;
|
||||||
|
dimensions: number | null;
|
||||||
|
batchSize: number | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface KtxDescriptionsStageHashInput {
|
||||||
snapshot: KtxSchemaSnapshot;
|
snapshot: KtxSchemaSnapshot;
|
||||||
mode: KtxScanMode;
|
llmIdentity: KtxScanLlmIdentity;
|
||||||
detectRelationships: boolean;
|
}
|
||||||
providerIdentity: Record<string, unknown>;
|
|
||||||
relationshipSettings?: unknown;
|
export interface KtxEmbeddingsStageHashInput {
|
||||||
|
snapshot: KtxSchemaSnapshot;
|
||||||
|
embeddingIdentity: KtxScanEmbeddingIdentity;
|
||||||
|
/** Digest of the resolved description text the embeddings consume (see {@link computeKtxScanDescriptionDigest}). */
|
||||||
|
descriptionDigest: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface KtxRelationshipsStageHashInput {
|
||||||
|
snapshot: KtxSchemaSnapshot;
|
||||||
|
relationshipSettings: KtxScanRelationshipConfig;
|
||||||
|
llmIdentity: KtxScanLlmIdentity;
|
||||||
}
|
}
|
||||||
|
|
||||||
function stableJson(value: unknown): string {
|
function stableJson(value: unknown): string {
|
||||||
|
|
@ -75,8 +112,38 @@ function stableJson(value: unknown): string {
|
||||||
return JSON.stringify(value);
|
return JSON.stringify(value);
|
||||||
}
|
}
|
||||||
|
|
||||||
export function computeKtxScanEnrichmentInputHash(input: ComputeKtxScanEnrichmentInputHashInput): string {
|
function sha256(value: unknown): string {
|
||||||
return createHash('sha256').update(stableJson(input)).digest('hex');
|
return createHash('sha256').update(stableJson(value)).digest('hex');
|
||||||
|
}
|
||||||
|
|
||||||
|
export function computeKtxDescriptionsStageHash(input: KtxDescriptionsStageHashInput): string {
|
||||||
|
return sha256({ snapshot: input.snapshot, llmIdentity: input.llmIdentity });
|
||||||
|
}
|
||||||
|
|
||||||
|
export function computeKtxEmbeddingsStageHash(input: KtxEmbeddingsStageHashInput): string {
|
||||||
|
return sha256({
|
||||||
|
snapshot: input.snapshot,
|
||||||
|
embeddingIdentity: input.embeddingIdentity,
|
||||||
|
descriptionDigest: input.descriptionDigest,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
export function computeKtxRelationshipsStageHash(input: KtxRelationshipsStageHashInput): string {
|
||||||
|
return sha256({
|
||||||
|
snapshot: input.snapshot,
|
||||||
|
relationshipSettings: input.relationshipSettings,
|
||||||
|
llmIdentity: input.llmIdentity,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Content digest of the resolved per-column description text the embeddings
|
||||||
|
* stage consumes. Folding it into the embeddings hash content-addresses
|
||||||
|
* embeddings on their real upstream, so re-describing busts only the embeddings
|
||||||
|
* that depend on the changed text (D4 self-healing).
|
||||||
|
*/
|
||||||
|
export function computeKtxScanDescriptionDigest(texts: readonly string[]): string {
|
||||||
|
return sha256(texts);
|
||||||
}
|
}
|
||||||
|
|
||||||
function uniqueStages(stages: KtxScanEnrichmentStage[]): KtxScanEnrichmentStage[] {
|
function uniqueStages(stages: KtxScanEnrichmentStage[]): KtxScanEnrichmentStage[] {
|
||||||
|
|
|
||||||
|
|
@ -1,10 +1,11 @@
|
||||||
import YAML from 'yaml';
|
import YAML from 'yaml';
|
||||||
import { buildLiveDatabaseManifestShards, type LiveDatabaseManifestExistingDescriptions, type LiveDatabaseManifestJoinData, type LiveDatabaseManifestJoinEntry, type LiveDatabaseManifestShard, type LiveDatabaseManifestTableData } from '../../context/ingest/adapters/live-database/manifest.js';
|
import { buildLiveDatabaseManifestShards, buildTableRef, type LiveDatabaseManifestExistingDescriptions, type LiveDatabaseManifestJoinData, type LiveDatabaseManifestJoinEntry, type LiveDatabaseManifestShard, type LiveDatabaseManifestTableData } from '../../context/ingest/adapters/live-database/manifest.js';
|
||||||
import type { TableUsageOutput } from '../../context/ingest/adapters/historic-sql/skill-schemas.js';
|
import type { TableUsageOutput } from '../../context/ingest/adapters/historic-sql/skill-schemas.js';
|
||||||
import type { KtxScanRelationshipConfig } from '../project/config.js';
|
import type { KtxScanRelationshipConfig } from '../project/config.js';
|
||||||
import type { KtxLocalProject } from '../../context/project/project.js';
|
import type { KtxLocalProject } from '../../context/project/project.js';
|
||||||
import { isSlYamlPath } from '../../context/sl/source-files.js';
|
import { isSlYamlPath } from '../../context/sl/source-files.js';
|
||||||
import { deriveFederatedConnection } from '../connections/federation.js';
|
import { deriveFederatedConnection } from '../connections/federation.js';
|
||||||
|
import { tableRefKey } from './table-ref.js';
|
||||||
import type { KtxLocalScanEnrichmentResult } from './local-enrichment.js';
|
import type { KtxLocalScanEnrichmentResult } from './local-enrichment.js';
|
||||||
import {
|
import {
|
||||||
buildKtxRelationshipArtifacts,
|
buildKtxRelationshipArtifacts,
|
||||||
|
|
@ -28,6 +29,12 @@ export interface WriteLocalScanManifestShardsInput {
|
||||||
dryRun: boolean;
|
dryRun: boolean;
|
||||||
descriptionUpdates?: KtxLocalScanEnrichmentResult['descriptionUpdates'];
|
descriptionUpdates?: KtxLocalScanEnrichmentResult['descriptionUpdates'];
|
||||||
relationshipUpdate?: KtxLocalScanEnrichmentResult['relationshipUpdate'];
|
relationshipUpdate?: KtxLocalScanEnrichmentResult['relationshipUpdate'];
|
||||||
|
/**
|
||||||
|
* When set, write only the shards that contain one of these tables. All shards
|
||||||
|
* are still built (so merging preserves prior content); the unlisted shards are
|
||||||
|
* left untouched on disk. Used by the incremental flush to bound git commits.
|
||||||
|
*/
|
||||||
|
onlyChangedTableNames?: ReadonlySet<string>;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface WriteLocalScanManifestShardsResult {
|
export interface WriteLocalScanManifestShardsResult {
|
||||||
|
|
@ -75,9 +82,8 @@ function schemaDir(connectionId: string): string {
|
||||||
|
|
||||||
function tableDescription(
|
function tableDescription(
|
||||||
table: KtxSchemaTable,
|
table: KtxSchemaTable,
|
||||||
descriptionUpdates: LocalDescriptionUpdates = [],
|
update: LocalDescriptionUpdates[number] | undefined,
|
||||||
): Record<string, string> | undefined {
|
): Record<string, string> | undefined {
|
||||||
const update = descriptionUpdates.find((candidate) => candidate.table.name === table.name);
|
|
||||||
const descriptions: Record<string, string> = {};
|
const descriptions: Record<string, string> = {};
|
||||||
if (table.comment) {
|
if (table.comment) {
|
||||||
descriptions.db = table.comment;
|
descriptions.db = table.comment;
|
||||||
|
|
@ -89,11 +95,9 @@ function tableDescription(
|
||||||
}
|
}
|
||||||
|
|
||||||
function columnDescription(
|
function columnDescription(
|
||||||
table: KtxSchemaTable,
|
|
||||||
column: KtxSchemaColumn,
|
column: KtxSchemaColumn,
|
||||||
descriptionUpdates: LocalDescriptionUpdates = [],
|
update: LocalDescriptionUpdates[number] | undefined,
|
||||||
): Record<string, string> | undefined {
|
): Record<string, string> | undefined {
|
||||||
const update = descriptionUpdates.find((candidate) => candidate.table.name === table.name);
|
|
||||||
const aiDescription = update?.columnDescriptions[column.name] ?? null;
|
const aiDescription = update?.columnDescriptions[column.name] ?? null;
|
||||||
const descriptions: Record<string, string> = {};
|
const descriptions: Record<string, string> = {};
|
||||||
if (column.comment) {
|
if (column.comment) {
|
||||||
|
|
@ -109,19 +113,25 @@ function snapshotTablesToManifestData(
|
||||||
snapshot: KtxSchemaSnapshot,
|
snapshot: KtxSchemaSnapshot,
|
||||||
descriptionUpdates: LocalDescriptionUpdates = [],
|
descriptionUpdates: LocalDescriptionUpdates = [],
|
||||||
): LiveDatabaseManifestTableData[] {
|
): LiveDatabaseManifestTableData[] {
|
||||||
return snapshot.tables.map((table) => ({
|
// Resolve a table's descriptions by full identity: two same-named tables in
|
||||||
name: table.name,
|
// different schemas must not collapse onto one update.
|
||||||
catalog: table.catalog,
|
const updateByRef = new Map(descriptionUpdates.map((update) => [tableRefKey(update.table), update]));
|
||||||
db: table.db,
|
return snapshot.tables.map((table) => {
|
||||||
descriptions: tableDescription(table, descriptionUpdates),
|
const update = updateByRef.get(tableRefKey({ catalog: table.catalog, db: table.db, name: table.name }));
|
||||||
columns: table.columns.map((column) => ({
|
return {
|
||||||
name: column.name,
|
name: table.name,
|
||||||
type: column.dimensionType,
|
catalog: table.catalog,
|
||||||
...(column.primaryKey ? { pk: true } : {}),
|
db: table.db,
|
||||||
...(column.nullable === false ? { nullable: false } : {}),
|
descriptions: tableDescription(table, update),
|
||||||
descriptions: columnDescription(table, column, descriptionUpdates),
|
columns: table.columns.map((column) => ({
|
||||||
})),
|
name: column.name,
|
||||||
}));
|
type: column.dimensionType,
|
||||||
|
...(column.primaryKey ? { pk: true } : {}),
|
||||||
|
...(column.nullable === false ? { nullable: false } : {}),
|
||||||
|
descriptions: columnDescription(column, update),
|
||||||
|
})),
|
||||||
|
};
|
||||||
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
function formalJoins(snapshot: KtxSchemaSnapshot): LiveDatabaseManifestJoinData[] {
|
function formalJoins(snapshot: KtxSchemaSnapshot): LiveDatabaseManifestJoinData[] {
|
||||||
|
|
@ -256,7 +266,10 @@ async function loadExistingManifestState(
|
||||||
if (!validTableNames.has(tableName)) {
|
if (!validTableNames.has(tableName)) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
descriptions.set(tableName, {
|
// Descriptions/usage key on the fully-qualified `entry.table` ref so two
|
||||||
|
// same-named tables across schemas stay distinct; joins remain keyed by
|
||||||
|
// bare name to match the bare-name join graph.
|
||||||
|
descriptions.set(entry.table, {
|
||||||
table: entry.descriptions ? { ...entry.descriptions } : undefined,
|
table: entry.descriptions ? { ...entry.descriptions } : undefined,
|
||||||
columns: new Map(
|
columns: new Map(
|
||||||
(entry.columns ?? []).flatMap((column) =>
|
(entry.columns ?? []).flatMap((column) =>
|
||||||
|
|
@ -265,7 +278,7 @@ async function loadExistingManifestState(
|
||||||
),
|
),
|
||||||
});
|
});
|
||||||
if (entry.usage) {
|
if (entry.usage) {
|
||||||
usage.set(tableName, { ...entry.usage });
|
usage.set(entry.table, { ...entry.usage });
|
||||||
}
|
}
|
||||||
const joins = (entry.joins ?? []).filter((join) => {
|
const joins = (entry.joins ?? []).filter((join) => {
|
||||||
return (
|
return (
|
||||||
|
|
@ -286,6 +299,108 @@ async function loadExistingManifestState(
|
||||||
return { descriptions, preservedJoins, usage };
|
return { descriptions, preservedJoins, usage };
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Reconstructs the descriptions already persisted in the on-disk `_schema` as
|
||||||
|
* the in-memory `descriptionUpdates` shape, so a stage-selective run that skips
|
||||||
|
* the descriptions stage (e.g. `--stages relationships`/`--stages embeddings`)
|
||||||
|
* can still feed embeddings + relationships the prior AI descriptions. Tables or
|
||||||
|
* columns with no AI description carry `null`.
|
||||||
|
*/
|
||||||
|
export async function loadOnDiskDescriptionUpdates(
|
||||||
|
project: KtxLocalProject,
|
||||||
|
connectionId: string,
|
||||||
|
snapshot: KtxSchemaSnapshot,
|
||||||
|
): Promise<LocalDescriptionUpdates> {
|
||||||
|
const siblingTargets = await federatedSiblingTargets(project, connectionId);
|
||||||
|
const existing = await loadExistingManifestState(project, connectionId, snapshot, siblingTargets);
|
||||||
|
return snapshot.tables.map((table) => {
|
||||||
|
const entry = existing.descriptions.get(buildTableRef(table.name, table.catalog, table.db));
|
||||||
|
const columnDescriptions: Record<string, string | null> = {};
|
||||||
|
for (const column of table.columns) {
|
||||||
|
columnDescriptions[column.name] = entry?.columns.get(column.name)?.ai ?? null;
|
||||||
|
}
|
||||||
|
return {
|
||||||
|
table: { catalog: table.catalog, db: table.db, name: table.name },
|
||||||
|
tableDescription: entry?.table?.ai ?? null,
|
||||||
|
columnDescriptions,
|
||||||
|
};
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// The incremental descriptions resume record. It lives at a stable, NON-syncId
|
||||||
|
// path: a from-scratch interruption gets a fresh syncId on the next run, so a
|
||||||
|
// syncId-scoped record would be unreachable on resume. The manifest already lives
|
||||||
|
// at the same stable per-connection scope.
|
||||||
|
function descriptionsProgressPath(connectionId: string): string {
|
||||||
|
return `raw-sources/${connectionId}/${LIVE_DATABASE_ADAPTER}/enrichment-progress/descriptions.json`;
|
||||||
|
}
|
||||||
|
|
||||||
|
interface DescriptionsProgressRecord {
|
||||||
|
inputHash: string;
|
||||||
|
descriptions: LocalDescriptionUpdates;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface KtxScanDescriptionResumeStore {
|
||||||
|
/** Prior enriched descriptions when the durable record matches `inputHash`, else null. */
|
||||||
|
load(inputHash: string): Promise<LocalDescriptionUpdates | null>;
|
||||||
|
/** Persist the descriptions so far + the manifest shards that gained a table this batch. */
|
||||||
|
flush(input: {
|
||||||
|
inputHash: string;
|
||||||
|
snapshot: KtxSchemaSnapshot;
|
||||||
|
descriptionUpdates: LocalDescriptionUpdates;
|
||||||
|
changedTableNames: ReadonlySet<string>;
|
||||||
|
}): Promise<void>;
|
||||||
|
}
|
||||||
|
|
||||||
|
export function createKtxScanDescriptionResumeStore(deps: {
|
||||||
|
project: KtxLocalProject;
|
||||||
|
connectionId: string;
|
||||||
|
syncId: string;
|
||||||
|
driver: KtxConnectionDriver;
|
||||||
|
}): KtxScanDescriptionResumeStore {
|
||||||
|
const path = descriptionsProgressPath(deps.connectionId);
|
||||||
|
return {
|
||||||
|
async load(inputHash) {
|
||||||
|
let content: string;
|
||||||
|
try {
|
||||||
|
({ content } = await deps.project.fileStore.readFile(path));
|
||||||
|
} catch {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
try {
|
||||||
|
const record = JSON.parse(content) as DescriptionsProgressRecord | null;
|
||||||
|
// A changed inputHash (schema or enrichment settings changed) ignores the
|
||||||
|
// prior record and recomputes — spec-19's inputHash-gated resume semantics.
|
||||||
|
if (!record || record.inputHash !== inputHash || !Array.isArray(record.descriptions)) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
return record.descriptions;
|
||||||
|
} catch {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
async flush({ inputHash, snapshot, descriptionUpdates, changedTableNames }) {
|
||||||
|
const record: DescriptionsProgressRecord = { inputHash, descriptions: descriptionUpdates };
|
||||||
|
await writeJsonArtifact(
|
||||||
|
deps.project,
|
||||||
|
path,
|
||||||
|
record,
|
||||||
|
`scan(${LIVE_DATABASE_ADAPTER}): flush enrichment descriptions progress syncId=${deps.syncId}`,
|
||||||
|
);
|
||||||
|
await writeLocalScanManifestShards({
|
||||||
|
project: deps.project,
|
||||||
|
connectionId: deps.connectionId,
|
||||||
|
syncId: deps.syncId,
|
||||||
|
driver: deps.driver,
|
||||||
|
snapshot,
|
||||||
|
descriptionUpdates,
|
||||||
|
dryRun: false,
|
||||||
|
onlyChangedTableNames: changedTableNames,
|
||||||
|
});
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
async function writeJsonArtifact(
|
async function writeJsonArtifact(
|
||||||
project: KtxLocalProject,
|
project: KtxLocalProject,
|
||||||
path: string,
|
path: string,
|
||||||
|
|
@ -331,6 +446,9 @@ export async function writeLocalScanManifestShards(
|
||||||
|
|
||||||
const manifestShards: string[] = [];
|
const manifestShards: string[] = [];
|
||||||
for (const [shardKey, shard] of [...shards.entries()].sort(([left], [right]) => left.localeCompare(right))) {
|
for (const [shardKey, shard] of [...shards.entries()].sort(([left], [right]) => left.localeCompare(right))) {
|
||||||
|
if (input.onlyChangedTableNames && !Object.keys(shard.tables).some((table) => input.onlyChangedTableNames!.has(table))) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
const path = `${schemaDir(input.connectionId)}/${shardKey}.yaml`;
|
const path = `${schemaDir(input.connectionId)}/${shardKey}.yaml`;
|
||||||
await input.project.fileStore.writeFile(
|
await input.project.fileStore.writeFile(
|
||||||
path,
|
path,
|
||||||
|
|
@ -348,23 +466,14 @@ export async function writeLocalScanManifestShards(
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function writeLocalScanEnrichmentArtifacts(
|
async function writeEnrichmentDescriptionArtifacts(input: {
|
||||||
input: WriteLocalScanEnrichmentArtifactsInput,
|
project: KtxLocalProject;
|
||||||
): Promise<WriteLocalScanEnrichmentArtifactsResult> {
|
enrichmentRoot: string;
|
||||||
if (input.dryRun) {
|
syncId: string;
|
||||||
return {
|
enrichment: KtxLocalScanEnrichmentResult;
|
||||||
enrichmentArtifacts: [],
|
}): Promise<string[]> {
|
||||||
manifestShards: [],
|
const descriptionsArtifact = `${input.enrichmentRoot}/descriptions.json`;
|
||||||
manifestShardsWritten: 0,
|
const embeddingsArtifact = `${input.enrichmentRoot}/embeddings.json`;
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
const enrichmentRoot = artifactDir(input.connectionId, input.syncId);
|
|
||||||
const descriptionsArtifact = `${enrichmentRoot}/descriptions.json`;
|
|
||||||
const embeddingsArtifact = `${enrichmentRoot}/embeddings.json`;
|
|
||||||
const relationshipsArtifact = `${enrichmentRoot}/relationships.json`;
|
|
||||||
const relationshipProfileArtifact = `${enrichmentRoot}/relationship-profile.json`;
|
|
||||||
const relationshipDiagnosticsArtifact = `${enrichmentRoot}/relationship-diagnostics.json`;
|
|
||||||
const enrichmentArtifacts: string[] = [];
|
const enrichmentArtifacts: string[] = [];
|
||||||
|
|
||||||
if (
|
if (
|
||||||
|
|
@ -388,6 +497,67 @@ export async function writeLocalScanEnrichmentArtifacts(
|
||||||
`scan(${LIVE_DATABASE_ADAPTER}): write enrichment embeddings syncId=${input.syncId}`,
|
`scan(${LIVE_DATABASE_ADAPTER}): write enrichment embeddings syncId=${input.syncId}`,
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
return enrichmentArtifacts;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Promote the descriptions + embeddings into the queryable `_schema` manifest
|
||||||
|
* (and the raw enrichment artifacts) before relationship detection runs. The
|
||||||
|
* generated joins and the relationship diagnostics are deliberately left to the
|
||||||
|
* final write, so an interrupted relationship stage never loses the paid LLM
|
||||||
|
* enrichment and never emits empty relationship diagnostics.
|
||||||
|
*/
|
||||||
|
export async function writeLocalScanEnrichmentCheckpoint(
|
||||||
|
input: WriteLocalScanEnrichmentArtifactsInput,
|
||||||
|
): Promise<WriteLocalScanEnrichmentArtifactsResult> {
|
||||||
|
if (input.dryRun) {
|
||||||
|
return { enrichmentArtifacts: [], manifestShards: [], manifestShardsWritten: 0 };
|
||||||
|
}
|
||||||
|
|
||||||
|
const enrichmentArtifacts = await writeEnrichmentDescriptionArtifacts({
|
||||||
|
project: input.project,
|
||||||
|
enrichmentRoot: artifactDir(input.connectionId, input.syncId),
|
||||||
|
syncId: input.syncId,
|
||||||
|
enrichment: input.enrichment,
|
||||||
|
});
|
||||||
|
const manifestResult = await writeLocalScanManifestShards({
|
||||||
|
project: input.project,
|
||||||
|
connectionId: input.connectionId,
|
||||||
|
syncId: input.syncId,
|
||||||
|
driver: input.driver,
|
||||||
|
snapshot: input.enrichment.snapshot,
|
||||||
|
descriptionUpdates: input.enrichment.descriptionUpdates,
|
||||||
|
dryRun: false,
|
||||||
|
});
|
||||||
|
|
||||||
|
return {
|
||||||
|
enrichmentArtifacts,
|
||||||
|
manifestShards: manifestResult.manifestShards,
|
||||||
|
manifestShardsWritten: manifestResult.manifestShardsWritten,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function writeLocalScanEnrichmentArtifacts(
|
||||||
|
input: WriteLocalScanEnrichmentArtifactsInput,
|
||||||
|
): Promise<WriteLocalScanEnrichmentArtifactsResult> {
|
||||||
|
if (input.dryRun) {
|
||||||
|
return {
|
||||||
|
enrichmentArtifacts: [],
|
||||||
|
manifestShards: [],
|
||||||
|
manifestShardsWritten: 0,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
const enrichmentRoot = artifactDir(input.connectionId, input.syncId);
|
||||||
|
const relationshipsArtifact = `${enrichmentRoot}/relationships.json`;
|
||||||
|
const relationshipProfileArtifact = `${enrichmentRoot}/relationship-profile.json`;
|
||||||
|
const relationshipDiagnosticsArtifact = `${enrichmentRoot}/relationship-diagnostics.json`;
|
||||||
|
const enrichmentArtifacts = await writeEnrichmentDescriptionArtifacts({
|
||||||
|
project: input.project,
|
||||||
|
enrichmentRoot,
|
||||||
|
syncId: input.syncId,
|
||||||
|
enrichment: input.enrichment,
|
||||||
|
});
|
||||||
enrichmentArtifacts.push(relationshipsArtifact, relationshipProfileArtifact, relationshipDiagnosticsArtifact);
|
enrichmentArtifacts.push(relationshipsArtifact, relationshipProfileArtifact, relationshipDiagnosticsArtifact);
|
||||||
const hasResolvedRelationships = input.enrichment.resolvedRelationships !== null;
|
const hasResolvedRelationships = input.enrichment.resolvedRelationships !== null;
|
||||||
const relationshipArtifacts = buildKtxRelationshipArtifacts({
|
const relationshipArtifacts = buildKtxRelationshipArtifacts({
|
||||||
|
|
@ -413,6 +583,7 @@ export async function writeLocalScanEnrichmentArtifacts(
|
||||||
artifacts: relationshipArtifacts,
|
artifacts: relationshipArtifacts,
|
||||||
profile: relationshipProfile,
|
profile: relationshipProfile,
|
||||||
warnings: input.enrichment.warnings,
|
warnings: input.enrichment.warnings,
|
||||||
|
partial: input.enrichment.relationshipPartial,
|
||||||
thresholds: input.relationshipSettings
|
thresholds: input.relationshipSettings
|
||||||
? {
|
? {
|
||||||
acceptThreshold: input.relationshipSettings.acceptThreshold,
|
acceptThreshold: input.relationshipSettings.acceptThreshold,
|
||||||
|
|
|
||||||
|
|
@ -6,11 +6,19 @@ import { KtxDescriptionGenerator } from './description-generation.js';
|
||||||
import { buildKtxColumnEmbeddingText } from './embedding-text.js';
|
import { buildKtxColumnEmbeddingText } from './embedding-text.js';
|
||||||
import {
|
import {
|
||||||
completedKtxScanEnrichmentStateSummary,
|
completedKtxScanEnrichmentStateSummary,
|
||||||
computeKtxScanEnrichmentInputHash,
|
computeKtxDescriptionsStageHash,
|
||||||
|
computeKtxEmbeddingsStageHash,
|
||||||
|
computeKtxRelationshipsStageHash,
|
||||||
|
computeKtxScanDescriptionDigest,
|
||||||
|
KTX_SCAN_ENRICHMENT_STAGES,
|
||||||
|
type KtxScanEmbeddingIdentity,
|
||||||
type KtxScanEnrichmentStateStore,
|
type KtxScanEnrichmentStateStore,
|
||||||
|
type KtxScanLlmIdentity,
|
||||||
summarizeKtxScanEnrichmentState,
|
summarizeKtxScanEnrichmentState,
|
||||||
} from './enrichment-state.js';
|
} from './enrichment-state.js';
|
||||||
import { skippedKtxScanEnrichmentSummary } from './enrichment-summary.js';
|
import { skippedKtxScanEnrichmentSummary } from './enrichment-summary.js';
|
||||||
|
import type { KtxScanDescriptionResumeStore } from './local-enrichment-artifacts.js';
|
||||||
|
import { tableRefKey } from './table-ref.js';
|
||||||
import type {
|
import type {
|
||||||
KtxEmbeddingUpdate,
|
KtxEmbeddingUpdate,
|
||||||
KtxEnrichedColumn,
|
KtxEnrichedColumn,
|
||||||
|
|
@ -21,6 +29,7 @@ import type {
|
||||||
KtxRelationshipUpdate,
|
KtxRelationshipUpdate,
|
||||||
} from './enrichment-types.js';
|
} from './enrichment-types.js';
|
||||||
import type { KtxCompositeRelationshipCandidate } from './relationship-composite-candidates.js';
|
import type { KtxCompositeRelationshipCandidate } from './relationship-composite-candidates.js';
|
||||||
|
import type { KtxRelationshipDetectionStopReason } from './relationship-detection-budget.js';
|
||||||
import type { KtxResolvedRelationshipDiscoveryCandidate } from './relationship-graph-resolver.js';
|
import type { KtxResolvedRelationshipDiscoveryCandidate } from './relationship-graph-resolver.js';
|
||||||
import { discoverKtxRelationships } from './relationship-discovery.js';
|
import { discoverKtxRelationships } from './relationship-discovery.js';
|
||||||
import type { KtxRelationshipProfileArtifact } from './relationship-profiling.js';
|
import type { KtxRelationshipProfileArtifact } from './relationship-profiling.js';
|
||||||
|
|
@ -42,7 +51,13 @@ import type {
|
||||||
KtxTableRef,
|
KtxTableRef,
|
||||||
} from './types.js';
|
} from './types.js';
|
||||||
|
|
||||||
const DESCRIPTION_TABLE_CONCURRENCY = 4;
|
// Parallel per-table description generations. Default 4; raise via
|
||||||
|
// KTX_ENRICH_TABLE_CONCURRENCY for large schemas (the rate-limit governor still
|
||||||
|
// throttles if the provider pushes back, so a higher cap is safe headroom).
|
||||||
|
const DESCRIPTION_TABLE_CONCURRENCY = (() => {
|
||||||
|
const raw = Number(process.env.KTX_ENRICH_TABLE_CONCURRENCY);
|
||||||
|
return Number.isInteger(raw) && raw >= 1 && raw <= 64 ? raw : 4;
|
||||||
|
})();
|
||||||
|
|
||||||
export interface KtxLocalScanEnrichmentProviders {
|
export interface KtxLocalScanEnrichmentProviders {
|
||||||
llmRuntime: KtxLlmRuntimePort;
|
llmRuntime: KtxLlmRuntimePort;
|
||||||
|
|
@ -53,15 +68,45 @@ export interface KtxLocalScanEnrichmentInput {
|
||||||
connectionId: string;
|
connectionId: string;
|
||||||
mode: KtxScanMode;
|
mode: KtxScanMode;
|
||||||
detectRelationships?: boolean;
|
detectRelationships?: boolean;
|
||||||
|
/**
|
||||||
|
* Enrichment stages to (re)run this invocation. Undefined runs every eligible
|
||||||
|
* stage and respects the completed-stage short-circuit (spec-19 resume). When
|
||||||
|
* present, only the named stages run — each force-recomputes (bypassing the
|
||||||
|
* short-circuit) while unselected stages are left untouched on disk (D3).
|
||||||
|
*/
|
||||||
|
stages?: KtxScanEnrichmentStage[];
|
||||||
connector: KtxScanConnector;
|
connector: KtxScanConnector;
|
||||||
snapshot?: KtxSchemaSnapshot;
|
snapshot?: KtxSchemaSnapshot;
|
||||||
context: KtxScanContext;
|
context: KtxScanContext;
|
||||||
providers: KtxLocalScanEnrichmentProviders | null;
|
providers: KtxLocalScanEnrichmentProviders | null;
|
||||||
stateStore?: KtxScanEnrichmentStateStore | null;
|
stateStore?: KtxScanEnrichmentStateStore | null;
|
||||||
|
/**
|
||||||
|
* Durable per-batch resume record for the descriptions stage. When present, an
|
||||||
|
* interrupted descriptions stage resumes by re-enriching only the tables not
|
||||||
|
* already flushed (inputHash-gated). Null/undefined disables incremental flush.
|
||||||
|
*/
|
||||||
|
descriptionResumeStore?: KtxScanDescriptionResumeStore | null;
|
||||||
|
/**
|
||||||
|
* Lazily loads the descriptions already persisted in the on-disk _schema, used
|
||||||
|
* to feed embeddings + relationships their description context when the
|
||||||
|
* descriptions stage does not run this invocation (e.g. `--stages relationships`).
|
||||||
|
* Called at most once and only when a downstream stage needs it, so a normal
|
||||||
|
* full run never pays the read.
|
||||||
|
*/
|
||||||
|
loadPriorDescriptions?: (snapshot: KtxSchemaSnapshot) => Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']>;
|
||||||
syncId?: string;
|
syncId?: string;
|
||||||
providerIdentity?: Record<string, unknown>;
|
/** Description-LLM identity that keys the descriptions + relationships stage hashes. */
|
||||||
|
llmIdentity?: KtxScanLlmIdentity;
|
||||||
|
/** Embedding-model identity that keys the embeddings stage hash. */
|
||||||
|
embeddingIdentity?: KtxScanEmbeddingIdentity;
|
||||||
relationshipSettings?: KtxScanRelationshipConfig;
|
relationshipSettings?: KtxScanRelationshipConfig;
|
||||||
now?: () => Date;
|
now?: () => Date;
|
||||||
|
/**
|
||||||
|
* Invoked once the last non-relationship stage completes and before
|
||||||
|
* relationship detection runs, so the descriptions + embeddings reach the
|
||||||
|
* queryable layer even if the relationship stage is later interrupted.
|
||||||
|
*/
|
||||||
|
onCheckpoint?: (checkpoint: KtxLocalScanEnrichmentResult) => Promise<void>;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface KtxLocalScanEnrichmentResult {
|
export interface KtxLocalScanEnrichmentResult {
|
||||||
|
|
@ -80,6 +125,7 @@ export interface KtxLocalScanEnrichmentResult {
|
||||||
relationshipProfile: KtxRelationshipProfileArtifact | null;
|
relationshipProfile: KtxRelationshipProfileArtifact | null;
|
||||||
resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null;
|
resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null;
|
||||||
compositeRelationships: KtxCompositeRelationshipCandidate[] | null;
|
compositeRelationships: KtxCompositeRelationshipCandidate[] | null;
|
||||||
|
relationshipPartial: { reason: KtxRelationshipDetectionStopReason } | null;
|
||||||
}
|
}
|
||||||
|
|
||||||
function tableId(table: KtxSchemaTable): string {
|
function tableId(table: KtxSchemaTable): string {
|
||||||
|
|
@ -182,6 +228,17 @@ function providerlessEnrichedWarning(relationshipDetection: boolean): KtxScanWar
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function stagePrerequisiteReason(stage: KtxScanEnrichmentStage): string {
|
||||||
|
switch (stage) {
|
||||||
|
case 'descriptions':
|
||||||
|
return 'LLM enrichment is not configured (set scan.enrichment.mode and an LLM provider)';
|
||||||
|
case 'embeddings':
|
||||||
|
return 'no embedding provider is configured (set scan.enrichment.embeddings)';
|
||||||
|
case 'relationships':
|
||||||
|
return 'relationship discovery is disabled (scan.relationships.enabled is false)';
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
export function createDeterministicLocalScanEnrichmentProviders(): KtxLocalScanEnrichmentProviders {
|
export function createDeterministicLocalScanEnrichmentProviders(): KtxLocalScanEnrichmentProviders {
|
||||||
return {
|
return {
|
||||||
llmRuntime: deterministicLlmRuntime(),
|
llmRuntime: deterministicLlmRuntime(),
|
||||||
|
|
@ -209,18 +266,25 @@ function deterministicLlmRuntime(): KtxLlmRuntimePort {
|
||||||
async runAgentLoop() {
|
async runAgentLoop() {
|
||||||
return { stopReason: 'natural' };
|
return { stopReason: 'natural' };
|
||||||
},
|
},
|
||||||
|
subprocessForkSpec() {
|
||||||
|
return null;
|
||||||
|
},
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
export function snapshotToKtxEnrichedSchema(
|
export function snapshotToKtxEnrichedSchema(
|
||||||
snapshot: KtxSchemaSnapshot,
|
snapshot: KtxSchemaSnapshot,
|
||||||
embeddingsByColumnId: ReadonlyMap<string, number[]> = new Map(),
|
embeddingsByColumnId: ReadonlyMap<string, number[]> = new Map(),
|
||||||
|
descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [],
|
||||||
): KtxEnrichedSchema {
|
): KtxEnrichedSchema {
|
||||||
|
const descriptionByTable = new Map(descriptions.map((item) => [tableRefKey(item.table), item]));
|
||||||
const tables: KtxEnrichedTable[] = snapshot.tables.map((table) => {
|
const tables: KtxEnrichedTable[] = snapshot.tables.map((table) => {
|
||||||
const id = tableId(table);
|
const id = tableId(table);
|
||||||
const ref = tableRef(table);
|
const ref = tableRef(table);
|
||||||
|
const tableDescription = descriptionByTable.get(tableRefKey(ref));
|
||||||
const columns: KtxEnrichedColumn[] = table.columns.map((column) => {
|
const columns: KtxEnrichedColumn[] = table.columns.map((column) => {
|
||||||
const idForColumn = columnId(table, column);
|
const idForColumn = columnId(table, column);
|
||||||
|
const aiColumnDescription = tableDescription?.columnDescriptions[column.name] ?? null;
|
||||||
return {
|
return {
|
||||||
id: idForColumn,
|
id: idForColumn,
|
||||||
tableId: id,
|
tableId: id,
|
||||||
|
|
@ -234,6 +298,7 @@ export function snapshotToKtxEnrichedSchema(
|
||||||
parentColumnId: null,
|
parentColumnId: null,
|
||||||
descriptions: {
|
descriptions: {
|
||||||
...(column.comment ? { db: column.comment } : {}),
|
...(column.comment ? { db: column.comment } : {}),
|
||||||
|
...(aiColumnDescription ? { ai: aiColumnDescription } : {}),
|
||||||
},
|
},
|
||||||
embedding: embeddingsByColumnId.get(idForColumn) ?? null,
|
embedding: embeddingsByColumnId.get(idForColumn) ?? null,
|
||||||
sampleValues: null,
|
sampleValues: null,
|
||||||
|
|
@ -246,6 +311,7 @@ export function snapshotToKtxEnrichedSchema(
|
||||||
enabled: true,
|
enabled: true,
|
||||||
descriptions: {
|
descriptions: {
|
||||||
...(table.comment ? { db: table.comment } : {}),
|
...(table.comment ? { db: table.comment } : {}),
|
||||||
|
...(tableDescription?.tableDescription ? { ai: tableDescription.tableDescription } : {}),
|
||||||
},
|
},
|
||||||
columns,
|
columns,
|
||||||
};
|
};
|
||||||
|
|
@ -262,11 +328,31 @@ function embeddingBatchSize(maxBatchSize: number): number {
|
||||||
return Number.isInteger(maxBatchSize) && maxBatchSize > 0 ? maxBatchSize : 100;
|
return Number.isInteger(maxBatchSize) && maxBatchSize > 0 ? maxBatchSize : 100;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
type KtxScanDescriptionUpdate = KtxLocalScanEnrichmentResult['descriptionUpdates'][number];
|
||||||
|
|
||||||
|
// Per-batch flush cadence: bounds the at-risk window (and the manifest-rewrite /
|
||||||
|
// git-commit cost) to a small number of tables.
|
||||||
|
const DESCRIPTION_FLUSH_EVERY = 10;
|
||||||
|
|
||||||
|
function isEnrichedDescriptionUpdate(update: KtxScanDescriptionUpdate): boolean {
|
||||||
|
return update.tableDescription !== null || Object.values(update.columnDescriptions).some((value) => value !== null);
|
||||||
|
}
|
||||||
|
|
||||||
|
function nullDescriptionUpdate(table: KtxSchemaTable): KtxScanDescriptionUpdate {
|
||||||
|
return {
|
||||||
|
table: tableRef(table),
|
||||||
|
tableDescription: null,
|
||||||
|
columnDescriptions: Object.fromEntries(table.columns.map((column) => [column.name, null])),
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
async function generateDescriptions(input: {
|
async function generateDescriptions(input: {
|
||||||
snapshot: KtxSchemaSnapshot;
|
snapshot: KtxSchemaSnapshot;
|
||||||
connector: KtxScanConnector;
|
connector: KtxScanConnector;
|
||||||
context: KtxScanContext;
|
context: KtxScanContext;
|
||||||
providers: KtxLocalScanEnrichmentProviders;
|
providers: KtxLocalScanEnrichmentProviders;
|
||||||
|
inputHash: string;
|
||||||
|
resumeStore?: KtxScanDescriptionResumeStore | null;
|
||||||
progress?: KtxProgressPort;
|
progress?: KtxProgressPort;
|
||||||
warnings?: KtxScanWarning[];
|
warnings?: KtxScanWarning[];
|
||||||
}): Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']> {
|
}): Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']> {
|
||||||
|
|
@ -289,67 +375,139 @@ async function generateDescriptions(input: {
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
|
|
||||||
const updates: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [];
|
|
||||||
const totalTables = input.snapshot.tables.length;
|
const totalTables = input.snapshot.tables.length;
|
||||||
if (totalTables === 0) {
|
if (totalTables === 0) {
|
||||||
await input.progress?.update(1, 'No tables to describe');
|
await input.progress?.update(1, 'No tables to describe');
|
||||||
return updates;
|
return [];
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Resume: recover already-enriched tables (inputHash-gated) and re-issue LLM
|
||||||
|
// calls only for the remainder. A failed/skipped table carries null descriptions
|
||||||
|
// and is not recovered, so it is retried.
|
||||||
|
const recovered = input.resumeStore ? ((await input.resumeStore.load(input.inputHash)) ?? []) : [];
|
||||||
|
const enrichedById = new Map<string, KtxScanDescriptionUpdate>();
|
||||||
|
for (const update of recovered) {
|
||||||
|
if (isEnrichedDescriptionUpdate(update)) {
|
||||||
|
enrichedById.set(tableRefKey(update.table), update);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
const remaining = input.snapshot.tables.filter((table) => !enrichedById.has(tableRefKey(tableRef(table))));
|
||||||
|
const recoveredCount = enrichedById.size;
|
||||||
|
if (recoveredCount > 0) {
|
||||||
|
input.context.logger?.info(
|
||||||
|
`[enrich] resume: recovered ${recoveredCount}/${totalTables} descriptions, enriching ${remaining.length}`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const pendingChanged = new Set<string>();
|
||||||
|
let sinceFlush = 0;
|
||||||
|
let flushing = false;
|
||||||
|
const flush = async (force: boolean): Promise<void> => {
|
||||||
|
if (!input.resumeStore || flushing || pendingChanged.size === 0) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (!force && sinceFlush < DESCRIPTION_FLUSH_EVERY) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
flushing = true;
|
||||||
|
const changedTableNames = new Set(pendingChanged);
|
||||||
|
pendingChanged.clear();
|
||||||
|
sinceFlush = 0;
|
||||||
|
try {
|
||||||
|
await input.resumeStore.flush({
|
||||||
|
inputHash: input.inputHash,
|
||||||
|
snapshot: input.snapshot,
|
||||||
|
descriptionUpdates: [...enrichedById.values()],
|
||||||
|
changedTableNames,
|
||||||
|
});
|
||||||
|
} finally {
|
||||||
|
flushing = false;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
const limitTable = pLimit(DESCRIPTION_TABLE_CONCURRENCY);
|
const limitTable = pLimit(DESCRIPTION_TABLE_CONCURRENCY);
|
||||||
const tableUpdates = await Promise.all(
|
await Promise.all(
|
||||||
input.snapshot.tables.map((table, index) =>
|
remaining.map((table, index) =>
|
||||||
limitTable(async () => {
|
limitTable(async () => {
|
||||||
await input.progress?.update(
|
await input.progress?.update(
|
||||||
(index + 1) / totalTables,
|
(recoveredCount + index + 1) / totalTables,
|
||||||
`Generating descriptions ${index + 1}/${totalTables} tables`,
|
`Generating descriptions ${recoveredCount + index + 1}/${totalTables} (${table.name}, ${table.columns.length} cols)`,
|
||||||
{
|
{
|
||||||
transient: true,
|
transient: true,
|
||||||
},
|
},
|
||||||
);
|
);
|
||||||
const batched = await generator.generateBatchedTableDescriptions({
|
// Stage-level guarantee: a single table's failure costs one missing
|
||||||
connectionId: input.snapshot.connectionId,
|
// description, never the whole stage's output. (generateBatchedTableDescriptions
|
||||||
connector: input.connector,
|
// already degrades its own failures to null descriptions; this backstop keeps
|
||||||
context: input.context,
|
// the guarantee at the fan-out even if a future path throws.) A genuine
|
||||||
dataSourceType: input.snapshot.driver,
|
// cancellation still propagates so the stage fails and resumes.
|
||||||
supportsNestedAnalysis: input.connector.capabilities.nestedAnalysis,
|
let update: KtxScanDescriptionUpdate;
|
||||||
table: {
|
try {
|
||||||
catalog: table.catalog,
|
const batched = await generator.generateBatchedTableDescriptions({
|
||||||
db: table.db,
|
connectionId: input.snapshot.connectionId,
|
||||||
name: table.name,
|
connector: input.connector,
|
||||||
rawDescriptions: table.comment ? { db: table.comment } : {},
|
context: input.context,
|
||||||
columns: table.columns.map((column) => ({
|
dataSourceType: input.snapshot.driver,
|
||||||
name: column.name,
|
supportsNestedAnalysis: input.connector.capabilities.nestedAnalysis,
|
||||||
type: column.nativeType,
|
table: {
|
||||||
...(column.comment ? { rawDescriptions: { db: column.comment } } : {}),
|
catalog: table.catalog,
|
||||||
})),
|
db: table.db,
|
||||||
},
|
name: table.name,
|
||||||
});
|
rawDescriptions: table.comment ? { db: table.comment } : {},
|
||||||
return {
|
columns: table.columns.map((column) => ({
|
||||||
table: tableRef(table),
|
name: column.name,
|
||||||
tableDescription: batched.tableDescription,
|
type: column.nativeType,
|
||||||
columnDescriptions: Object.fromEntries(batched.columnDescriptions),
|
...(column.comment ? { rawDescriptions: { db: column.comment } } : {}),
|
||||||
};
|
})),
|
||||||
|
},
|
||||||
|
});
|
||||||
|
update = {
|
||||||
|
table: tableRef(table),
|
||||||
|
tableDescription: batched.tableDescription,
|
||||||
|
columnDescriptions: Object.fromEntries(batched.columnDescriptions),
|
||||||
|
};
|
||||||
|
} catch (error) {
|
||||||
|
if (input.context.signal?.aborted) {
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
const message = error instanceof Error ? error.message : String(error);
|
||||||
|
input.context.logger?.warn(`[enrich] table ${table.name} failed: ${message}`);
|
||||||
|
warningSink?.push({
|
||||||
|
code: 'enrichment_failed',
|
||||||
|
message: `Failed to generate description for ${table.name}: ${message}`,
|
||||||
|
table: table.name,
|
||||||
|
recoverable: true,
|
||||||
|
metadata: {},
|
||||||
|
});
|
||||||
|
update = nullDescriptionUpdate(table);
|
||||||
|
}
|
||||||
|
if (isEnrichedDescriptionUpdate(update)) {
|
||||||
|
enrichedById.set(tableRefKey(tableRef(table)), update);
|
||||||
|
pendingChanged.add(table.name);
|
||||||
|
sinceFlush += 1;
|
||||||
|
await flush(false);
|
||||||
|
}
|
||||||
}),
|
}),
|
||||||
),
|
),
|
||||||
);
|
);
|
||||||
updates.push(...tableUpdates);
|
await flush(true);
|
||||||
await input.progress?.update(1, `Generated descriptions for ${totalTables} tables`);
|
await input.progress?.update(1, `Generated descriptions for ${totalTables} tables`);
|
||||||
return updates;
|
// Full set in snapshot order: recovered + freshly enriched, null for any still-failed.
|
||||||
|
return input.snapshot.tables.map((table) => enrichedById.get(tableRefKey(tableRef(table))) ?? nullDescriptionUpdate(table));
|
||||||
}
|
}
|
||||||
|
|
||||||
async function buildEmbeddings(input: {
|
// The exact per-column text fed to the embedding model. Shared by the embeddings
|
||||||
snapshot: KtxSchemaSnapshot;
|
// stage and the descriptionDigest so the embeddings hash content-addresses the
|
||||||
embedding: KtxEmbeddingPort;
|
// real text the model sees (D4).
|
||||||
descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'];
|
function buildKtxColumnEmbeddingTexts(
|
||||||
progress?: KtxProgressPort;
|
snapshot: KtxSchemaSnapshot,
|
||||||
}): Promise<{ updates: KtxEmbeddingUpdate[]; byColumnId: Map<string, number[]> }> {
|
descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'],
|
||||||
const descriptionByTable = new Map(input.descriptions.map((item) => [item.table.name, item]));
|
): Array<{ columnId: string; text: string }> {
|
||||||
|
const descriptionByTable = new Map(descriptions.map((item) => [tableRefKey(item.table), item]));
|
||||||
const texts: Array<{ columnId: string; text: string }> = [];
|
const texts: Array<{ columnId: string; text: string }> = [];
|
||||||
|
for (const table of snapshot.tables) {
|
||||||
for (const table of input.snapshot.tables) {
|
const tableDescriptions = descriptionByTable.get(tableRefKey(tableRef(table)));
|
||||||
const tableDescriptions = descriptionByTable.get(table.name);
|
|
||||||
for (const column of table.columns) {
|
for (const column of table.columns) {
|
||||||
const id = columnId(table, column);
|
|
||||||
const text = buildKtxColumnEmbeddingText({
|
const text = buildKtxColumnEmbeddingText({
|
||||||
tableName: table.name,
|
tableName: table.name,
|
||||||
columnName: column.name,
|
columnName: column.name,
|
||||||
|
|
@ -364,9 +522,18 @@ async function buildEmbeddings(input: {
|
||||||
incoming: [],
|
incoming: [],
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
texts.push({ columnId: id, text });
|
texts.push({ columnId: columnId(table, column), text });
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
return texts;
|
||||||
|
}
|
||||||
|
|
||||||
|
async function buildEmbeddings(input: {
|
||||||
|
embedding: KtxEmbeddingPort;
|
||||||
|
texts: Array<{ columnId: string; text: string }>;
|
||||||
|
progress?: KtxProgressPort;
|
||||||
|
}): Promise<{ updates: KtxEmbeddingUpdate[]; byColumnId: Map<string, number[]> }> {
|
||||||
|
const texts = input.texts;
|
||||||
|
|
||||||
const embeddings: number[][] = [];
|
const embeddings: number[][] = [];
|
||||||
const maxBatchSize = embeddingBatchSize(input.embedding.maxBatchSize);
|
const maxBatchSize = embeddingBatchSize(input.embedding.maxBatchSize);
|
||||||
|
|
@ -416,17 +583,26 @@ async function runEnrichmentStage<TOutput>(input: {
|
||||||
resumedStages: KtxScanEnrichmentStage[];
|
resumedStages: KtxScanEnrichmentStage[];
|
||||||
completedStages: KtxScanEnrichmentStage[];
|
completedStages: KtxScanEnrichmentStage[];
|
||||||
failedStages: KtxScanEnrichmentStage[];
|
failedStages: KtxScanEnrichmentStage[];
|
||||||
|
/**
|
||||||
|
* When true the stage re-enters compute() even if a completed row matches,
|
||||||
|
* skipping the spec-19 short-circuit. The intent of naming a stage in
|
||||||
|
* `--stages` is "recompute this" (D3); the inner compute() still honors the
|
||||||
|
* spec-20 per-table resume record.
|
||||||
|
*/
|
||||||
|
forceRecompute?: boolean;
|
||||||
compute: () => Promise<TOutput>;
|
compute: () => Promise<TOutput>;
|
||||||
}): Promise<TOutput> {
|
}): Promise<TOutput> {
|
||||||
const existing = await input.stateStore?.findCompletedStage<TOutput>({
|
if (!input.forceRecompute) {
|
||||||
runId: input.runId,
|
const existing = await input.stateStore?.findCompletedStage<TOutput>({
|
||||||
stage: input.stage,
|
connectionId: input.connectionId,
|
||||||
inputHash: input.inputHash,
|
stage: input.stage,
|
||||||
});
|
inputHash: input.inputHash,
|
||||||
if (existing) {
|
});
|
||||||
input.resumedStages.push(input.stage);
|
if (existing) {
|
||||||
input.completedStages.push(input.stage);
|
input.resumedStages.push(input.stage);
|
||||||
return existing.output;
|
input.completedStages.push(input.stage);
|
||||||
|
return existing.output;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
try {
|
try {
|
||||||
|
|
@ -493,17 +669,39 @@ export async function runLocalScanEnrichment(
|
||||||
const state = completedKtxScanEnrichmentStateSummary();
|
const state = completedKtxScanEnrichmentStateSummary();
|
||||||
const syncId = input.syncId ?? input.context.runId;
|
const syncId = input.syncId ?? input.context.runId;
|
||||||
const relationshipSettings = input.relationshipSettings ?? buildDefaultKtxProjectConfig().scan.relationships;
|
const relationshipSettings = input.relationshipSettings ?? buildDefaultKtxProjectConfig().scan.relationships;
|
||||||
const inputHash = computeKtxScanEnrichmentInputHash({
|
const llmIdentity: KtxScanLlmIdentity = input.llmIdentity ?? { model: null, baseUrlConfigured: false };
|
||||||
snapshot,
|
const embeddingIdentity: KtxScanEmbeddingIdentity = input.embeddingIdentity ?? {
|
||||||
mode: input.mode,
|
model: null,
|
||||||
detectRelationships: input.detectRelationships ?? false,
|
dimensions: null,
|
||||||
providerIdentity: input.providerIdentity ?? {},
|
batchSize: null,
|
||||||
relationshipSettings,
|
};
|
||||||
});
|
const descriptionsHash = computeKtxDescriptionsStageHash({ snapshot, llmIdentity });
|
||||||
|
const relationshipsHash = computeKtxRelationshipsStageHash({ snapshot, relationshipSettings, llmIdentity });
|
||||||
const warnings: KtxScanWarning[] = [];
|
const warnings: KtxScanWarning[] = [];
|
||||||
|
const selectedStages = input.stages;
|
||||||
|
const runsStage = (stage: KtxScanEnrichmentStage): boolean =>
|
||||||
|
selectedStages === undefined || selectedStages.includes(stage);
|
||||||
|
const forcesStage = (stage: KtxScanEnrichmentStage): boolean =>
|
||||||
|
selectedStages !== undefined && selectedStages.includes(stage);
|
||||||
|
|
||||||
let descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [];
|
let descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [];
|
||||||
|
let descriptionsRanThisInvocation = false;
|
||||||
|
let priorDescriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] | null | undefined;
|
||||||
|
// Best-available descriptions for the downstream stages (embeddings,
|
||||||
|
// relationships): fresh ones when descriptions ran this invocation, else the
|
||||||
|
// descriptions persisted in the on-disk _schema. Behavior follows the input
|
||||||
|
// (did descriptions run?), not which stage subset the caller selected (D5).
|
||||||
|
const resolveDownstreamDescriptions = async (): Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']> => {
|
||||||
|
if (descriptionsRanThisInvocation) {
|
||||||
|
return descriptions;
|
||||||
|
}
|
||||||
|
if (priorDescriptions === undefined) {
|
||||||
|
priorDescriptions = input.loadPriorDescriptions ? await input.loadPriorDescriptions(snapshot) : null;
|
||||||
|
}
|
||||||
|
return priorDescriptions ?? [];
|
||||||
|
};
|
||||||
|
|
||||||
let embeddingUpdates: KtxEmbeddingUpdate[] = [];
|
let embeddingUpdates: KtxEmbeddingUpdate[] = [];
|
||||||
let schema = snapshotToKtxEnrichedSchema(snapshot);
|
|
||||||
const summary: KtxScanEnrichmentSummary = { ...skippedKtxScanEnrichmentSummary };
|
const summary: KtxScanEnrichmentSummary = { ...skippedKtxScanEnrichmentSummary };
|
||||||
const relationshipDetectionEnabled = relationshipSettings.enabled;
|
const relationshipDetectionEnabled = relationshipSettings.enabled;
|
||||||
const shouldDetectRelationships =
|
const shouldDetectRelationships =
|
||||||
|
|
@ -514,38 +712,70 @@ export async function runLocalScanEnrichment(
|
||||||
warnings.push(providerlessEnrichedWarning(shouldDetectRelationships));
|
warnings.push(providerlessEnrichedWarning(shouldDetectRelationships));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// A stage explicitly named in --stages whose prerequisite is missing must be
|
||||||
|
// surfaced, never silently no-op (D2).
|
||||||
|
if (selectedStages !== undefined) {
|
||||||
|
const stageEligible: Record<KtxScanEnrichmentStage, boolean> = {
|
||||||
|
descriptions: input.mode === 'enriched' && input.providers != null,
|
||||||
|
embeddings: input.mode === 'enriched' && input.providers?.embedding != null,
|
||||||
|
relationships: shouldDetectRelationships,
|
||||||
|
};
|
||||||
|
for (const stage of selectedStages) {
|
||||||
|
if (!stageEligible[stage]) {
|
||||||
|
warnings.push({
|
||||||
|
code: 'enrichment_stage_skipped',
|
||||||
|
message: `Requested --stages ${stage}, but it cannot run: ${stagePrerequisiteReason(stage)}.`,
|
||||||
|
recoverable: true,
|
||||||
|
metadata: { stage },
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
if (input.mode === 'enriched' && input.providers) {
|
if (input.mode === 'enriched' && input.providers) {
|
||||||
const providers = input.providers;
|
const providers = input.providers;
|
||||||
const descriptionProgress = progress?.startPhase(0.45);
|
if (runsStage('descriptions')) {
|
||||||
descriptions = await runEnrichmentStage({
|
const descriptionProgress = progress?.startPhase(0.45);
|
||||||
stateStore: input.stateStore,
|
descriptions = await runEnrichmentStage({
|
||||||
runId: input.context.runId,
|
stateStore: input.stateStore,
|
||||||
connectionId: input.connectionId,
|
runId: input.context.runId,
|
||||||
syncId,
|
connectionId: input.connectionId,
|
||||||
mode: input.mode,
|
syncId,
|
||||||
stage: 'descriptions',
|
mode: input.mode,
|
||||||
inputHash,
|
stage: 'descriptions',
|
||||||
now,
|
inputHash: descriptionsHash,
|
||||||
resumedStages: state.resumedStages,
|
now,
|
||||||
completedStages: state.completedStages,
|
forceRecompute: forcesStage('descriptions'),
|
||||||
failedStages: state.failedStages,
|
resumedStages: state.resumedStages,
|
||||||
compute: () =>
|
completedStages: state.completedStages,
|
||||||
generateDescriptions({
|
failedStages: state.failedStages,
|
||||||
snapshot,
|
compute: () =>
|
||||||
connector: input.connector,
|
generateDescriptions({
|
||||||
context: input.context,
|
snapshot,
|
||||||
providers,
|
connector: input.connector,
|
||||||
progress: descriptionProgress,
|
context: input.context,
|
||||||
warnings,
|
providers,
|
||||||
}),
|
inputHash: descriptionsHash,
|
||||||
});
|
resumeStore: input.descriptionResumeStore,
|
||||||
summary.dataDictionary = input.connector.sampleColumn ? 'completed' : 'skipped';
|
progress: descriptionProgress,
|
||||||
summary.tableDescriptions = 'completed';
|
warnings,
|
||||||
summary.columnDescriptions = 'completed';
|
}),
|
||||||
|
});
|
||||||
|
descriptionsRanThisInvocation = true;
|
||||||
|
summary.dataDictionary = input.connector.sampleColumn ? 'completed' : 'skipped';
|
||||||
|
summary.tableDescriptions = 'completed';
|
||||||
|
summary.columnDescriptions = 'completed';
|
||||||
|
}
|
||||||
|
|
||||||
const embeddingProgress = progress?.startPhase(0.2);
|
|
||||||
const embedding = providers.embedding;
|
const embedding = providers.embedding;
|
||||||
if (embedding) {
|
if (embedding && runsStage('embeddings')) {
|
||||||
|
const embeddingProgress = progress?.startPhase(0.2);
|
||||||
|
const embeddingTexts = buildKtxColumnEmbeddingTexts(snapshot, await resolveDownstreamDescriptions());
|
||||||
|
const embeddingsHash = computeKtxEmbeddingsStageHash({
|
||||||
|
snapshot,
|
||||||
|
embeddingIdentity,
|
||||||
|
descriptionDigest: computeKtxScanDescriptionDigest(embeddingTexts.map((item) => item.text)),
|
||||||
|
});
|
||||||
embeddingUpdates = await runEnrichmentStage({
|
embeddingUpdates = await runEnrichmentStage({
|
||||||
stateStore: input.stateStore,
|
stateStore: input.stateStore,
|
||||||
runId: input.context.runId,
|
runId: input.context.runId,
|
||||||
|
|
@ -553,22 +783,21 @@ export async function runLocalScanEnrichment(
|
||||||
syncId,
|
syncId,
|
||||||
mode: input.mode,
|
mode: input.mode,
|
||||||
stage: 'embeddings',
|
stage: 'embeddings',
|
||||||
inputHash,
|
inputHash: embeddingsHash,
|
||||||
now,
|
now,
|
||||||
|
forceRecompute: forcesStage('embeddings'),
|
||||||
resumedStages: state.resumedStages,
|
resumedStages: state.resumedStages,
|
||||||
completedStages: state.completedStages,
|
completedStages: state.completedStages,
|
||||||
failedStages: state.failedStages,
|
failedStages: state.failedStages,
|
||||||
compute: async () => {
|
compute: async () => {
|
||||||
const embeddings = await buildEmbeddings({
|
const embeddings = await buildEmbeddings({
|
||||||
snapshot,
|
|
||||||
embedding,
|
embedding,
|
||||||
descriptions,
|
texts: embeddingTexts,
|
||||||
progress: embeddingProgress,
|
progress: embeddingProgress,
|
||||||
});
|
});
|
||||||
return embeddings.updates;
|
return embeddings.updates;
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
schema = snapshotToKtxEnrichedSchema(snapshot, embeddingsByColumnId(embeddingUpdates));
|
|
||||||
summary.embeddings = 'completed';
|
summary.embeddings = 'completed';
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
@ -577,9 +806,40 @@ export async function runLocalScanEnrichment(
|
||||||
let relationshipProfile: KtxRelationshipProfileArtifact | null = null;
|
let relationshipProfile: KtxRelationshipProfileArtifact | null = null;
|
||||||
let resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null = null;
|
let resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null = null;
|
||||||
let compositeRelationships: KtxCompositeRelationshipCandidate[] | null = null;
|
let compositeRelationships: KtxCompositeRelationshipCandidate[] | null = null;
|
||||||
|
let relationshipPartial: { reason: KtxRelationshipDetectionStopReason } | null = null;
|
||||||
let relationships: KtxScanRelationshipSummary = { accepted: 0, review: 0, rejected: 0, skipped: 0 };
|
let relationships: KtxScanRelationshipSummary = { accepted: 0, review: 0, rejected: 0, skipped: 0 };
|
||||||
if (shouldDetectRelationships) {
|
|
||||||
|
// Promote the paid descriptions + embeddings to the queryable layer at the
|
||||||
|
// cost boundary, before the slow, kill-prone relationship stage — so an
|
||||||
|
// interrupted relationship stage degrades to "no joins," never "no descriptions."
|
||||||
|
if (shouldDetectRelationships && summary.tableDescriptions === 'completed' && input.onCheckpoint) {
|
||||||
|
await input.onCheckpoint({
|
||||||
|
snapshot,
|
||||||
|
summary: { ...summary },
|
||||||
|
relationships,
|
||||||
|
state: summarizeKtxScanEnrichmentState(state),
|
||||||
|
warnings: [...warnings],
|
||||||
|
descriptionUpdates: descriptions,
|
||||||
|
embeddingUpdates,
|
||||||
|
relationshipUpdate: null,
|
||||||
|
relationshipProfile: null,
|
||||||
|
resolvedRelationships: null,
|
||||||
|
compositeRelationships: null,
|
||||||
|
relationshipPartial: null,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shouldDetectRelationships && runsStage('relationships')) {
|
||||||
const relationshipProgress = progress?.startPhase(0.25);
|
const relationshipProgress = progress?.startPhase(0.25);
|
||||||
|
// Relationship detection (incl. llmProposals) runs against the
|
||||||
|
// best-available descriptions + this run's embeddings, so the join-proposal
|
||||||
|
// prompt carries descriptions on both the full-run and relationships-only
|
||||||
|
// paths (D5). Embeddings are this run's only — they are not re-hydrated.
|
||||||
|
const relationshipSchema = snapshotToKtxEnrichedSchema(
|
||||||
|
snapshot,
|
||||||
|
embeddingsByColumnId(embeddingUpdates),
|
||||||
|
await resolveDownstreamDescriptions(),
|
||||||
|
);
|
||||||
const relationshipStage = await runEnrichmentStage({
|
const relationshipStage = await runEnrichmentStage({
|
||||||
stateStore: input.stateStore,
|
stateStore: input.stateStore,
|
||||||
runId: input.context.runId,
|
runId: input.context.runId,
|
||||||
|
|
@ -587,8 +847,9 @@ export async function runLocalScanEnrichment(
|
||||||
syncId,
|
syncId,
|
||||||
mode: input.mode,
|
mode: input.mode,
|
||||||
stage: 'relationships',
|
stage: 'relationships',
|
||||||
inputHash,
|
inputHash: relationshipsHash,
|
||||||
now,
|
now,
|
||||||
|
forceRecompute: forcesStage('relationships'),
|
||||||
resumedStages: state.resumedStages,
|
resumedStages: state.resumedStages,
|
||||||
completedStages: state.completedStages,
|
completedStages: state.completedStages,
|
||||||
failedStages: state.failedStages,
|
failedStages: state.failedStages,
|
||||||
|
|
@ -598,10 +859,12 @@ export async function runLocalScanEnrichment(
|
||||||
connectionId: input.connectionId,
|
connectionId: input.connectionId,
|
||||||
dialect,
|
dialect,
|
||||||
connector: input.connector,
|
connector: input.connector,
|
||||||
schema,
|
schema: relationshipSchema,
|
||||||
context: input.context,
|
context: input.context,
|
||||||
settings: relationshipSettings,
|
settings: relationshipSettings,
|
||||||
llmRuntime: input.providers?.llmRuntime ?? null,
|
llmRuntime: input.providers?.llmRuntime ?? null,
|
||||||
|
...(relationshipProgress ? { progress: relationshipProgress } : {}),
|
||||||
|
...(input.now ? { now: () => input.now!().getTime() } : {}),
|
||||||
});
|
});
|
||||||
|
|
||||||
await relationshipProgress?.update(
|
await relationshipProgress?.update(
|
||||||
|
|
@ -617,6 +880,7 @@ export async function runLocalScanEnrichment(
|
||||||
statisticalValidation: detection.statisticalValidation,
|
statisticalValidation: detection.statisticalValidation,
|
||||||
llmRelationshipValidation: detection.llmRelationshipValidation,
|
llmRelationshipValidation: detection.llmRelationshipValidation,
|
||||||
warnings: detection.warnings,
|
warnings: detection.warnings,
|
||||||
|
partial: detection.partial,
|
||||||
};
|
};
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
|
|
@ -629,21 +893,77 @@ export async function runLocalScanEnrichment(
|
||||||
resolvedRelationships = relationshipStage.resolvedRelationships;
|
resolvedRelationships = relationshipStage.resolvedRelationships;
|
||||||
compositeRelationships = relationshipStage.compositeRelationships;
|
compositeRelationships = relationshipStage.compositeRelationships;
|
||||||
relationships = relationshipStage.relationships;
|
relationships = relationshipStage.relationships;
|
||||||
|
relationshipPartial = relationshipStage.partial;
|
||||||
warnings.push(...relationshipStage.warnings);
|
warnings.push(...relationshipStage.warnings);
|
||||||
|
if (relationshipPartial) {
|
||||||
|
warnings.push({
|
||||||
|
code: 'relationship_detection_partial',
|
||||||
|
message:
|
||||||
|
relationshipPartial.reason === 'aborted'
|
||||||
|
? 'Relationship detection was cancelled before completing; the joins found so far are partial.'
|
||||||
|
: 'Relationship detection hit its wall-clock budget (scan.relationships.detectionBudgetMs) before completing; the joins found so far are partial. Raise the budget to run a fuller pass.',
|
||||||
|
recoverable: true,
|
||||||
|
metadata: { reason: relationshipPartial.reason },
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Derived staleness: after a selective run, surface (never silently leave) any
|
||||||
|
// unselected stage whose stored hash no longer matches its current inputs (D4).
|
||||||
|
// The embeddings hash includes the description digest, so a re-describe makes
|
||||||
|
// embeddings diverge here; relationships are deliberately decoupled (D5) and so
|
||||||
|
// never diverge from a description change.
|
||||||
|
if (selectedStages !== undefined && input.stateStore) {
|
||||||
|
const currentStageHash: Record<KtxScanEnrichmentStage, () => Promise<string>> = {
|
||||||
|
descriptions: () => Promise.resolve(descriptionsHash),
|
||||||
|
relationships: () => Promise.resolve(relationshipsHash),
|
||||||
|
embeddings: async () => {
|
||||||
|
const embeddingTexts = buildKtxColumnEmbeddingTexts(snapshot, await resolveDownstreamDescriptions());
|
||||||
|
return computeKtxEmbeddingsStageHash({
|
||||||
|
snapshot,
|
||||||
|
embeddingIdentity,
|
||||||
|
descriptionDigest: computeKtxScanDescriptionDigest(embeddingTexts.map((item) => item.text)),
|
||||||
|
});
|
||||||
|
},
|
||||||
|
};
|
||||||
|
for (const stage of KTX_SCAN_ENRICHMENT_STAGES) {
|
||||||
|
if (selectedStages.includes(stage)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
const completed = await input.stateStore.findLatestCompletedStage({ connectionId: input.connectionId, stage });
|
||||||
|
if (!completed) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if (completed.inputHash !== (await currentStageHash[stage]())) {
|
||||||
|
warnings.push({
|
||||||
|
code: 'enrichment_stage_stale',
|
||||||
|
message: `The ${stage} enrichment stage is now stale: its inputs changed since it last ran. Refresh it with \`ktx ingest ${input.connectionId} --stages ${stage}\`.`,
|
||||||
|
recoverable: true,
|
||||||
|
metadata: { stage },
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
await progress?.update(1, 'Enrichment complete');
|
await progress?.update(1, 'Enrichment complete');
|
||||||
|
// The manifest merge treats ai/db descriptions as scan-managed and overwrites
|
||||||
|
// them with whatever this run emits, so a subset run that skips descriptions
|
||||||
|
// must still emit the prior on-disk ones — else the write deletes them (D3
|
||||||
|
// "unselected stages are left untouched on disk"). Fresh-this-run if descriptions
|
||||||
|
// ran, else loaded from the on-disk _schema.
|
||||||
|
const writtenDescriptionUpdates = await resolveDownstreamDescriptions();
|
||||||
return {
|
return {
|
||||||
snapshot,
|
snapshot,
|
||||||
summary,
|
summary,
|
||||||
relationships,
|
relationships,
|
||||||
state: summarizeKtxScanEnrichmentState(state),
|
state: summarizeKtxScanEnrichmentState(state),
|
||||||
warnings,
|
warnings,
|
||||||
descriptionUpdates: descriptions,
|
descriptionUpdates: writtenDescriptionUpdates,
|
||||||
embeddingUpdates,
|
embeddingUpdates,
|
||||||
relationshipUpdate,
|
relationshipUpdate,
|
||||||
relationshipProfile,
|
relationshipProfile,
|
||||||
resolvedRelationships,
|
resolvedRelationships,
|
||||||
compositeRelationships,
|
compositeRelationships,
|
||||||
|
relationshipPartial,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -6,25 +6,36 @@ import { getLocalStageOnlyIngestStatus, type LocalIngestRunRecord, runLocalStage
|
||||||
import type { SourceAdapter } from '../../context/ingest/types.js';
|
import type { SourceAdapter } from '../../context/ingest/types.js';
|
||||||
import { createLocalKtxLlmRuntimeFromConfig } from '../../context/llm/local-config.js';
|
import { createLocalKtxLlmRuntimeFromConfig } from '../../context/llm/local-config.js';
|
||||||
import { KtxScanEmbeddingPortAdapter } from '../../context/llm/embedding-port.js';
|
import { KtxScanEmbeddingPortAdapter } from '../../context/llm/embedding-port.js';
|
||||||
import type { KtxProjectLlmConfig, KtxScanEnrichmentConfig, KtxScanRelationshipConfig } from '../project/config.js';
|
import type { KtxProjectLlmConfig, KtxScanEnrichmentConfig } from '../project/config.js';
|
||||||
import type { KtxLocalProject } from '../../context/project/project.js';
|
import type { KtxLocalProject } from '../../context/project/project.js';
|
||||||
import { ktxLocalStateDbPath } from '../project/local-state-db.js';
|
import { ktxLocalStateDbPath } from '../project/local-state-db.js';
|
||||||
import { redactKtxScanReport } from './credentials.js';
|
import { redactKtxScanReport } from './credentials.js';
|
||||||
import { resolveEnabledTables } from './enabled-tables.js';
|
import { resolveEnabledTables } from './enabled-tables.js';
|
||||||
import { completedKtxScanEnrichmentStateSummary } from './enrichment-state.js';
|
import {
|
||||||
|
completedKtxScanEnrichmentStateSummary,
|
||||||
|
type KtxScanEmbeddingIdentity,
|
||||||
|
type KtxScanLlmIdentity,
|
||||||
|
} from './enrichment-state.js';
|
||||||
import { failedKtxScanEnrichmentSummary, ktxScanErrorMessage } from './enrichment-summary.js';
|
import { failedKtxScanEnrichmentSummary, ktxScanErrorMessage } from './enrichment-summary.js';
|
||||||
import {
|
import {
|
||||||
createDeterministicLocalScanEnrichmentProviders,
|
createDeterministicLocalScanEnrichmentProviders,
|
||||||
type KtxLocalScanEnrichmentProviders,
|
type KtxLocalScanEnrichmentProviders,
|
||||||
runLocalScanEnrichment,
|
runLocalScanEnrichment,
|
||||||
} from './local-enrichment.js';
|
} from './local-enrichment.js';
|
||||||
import { writeLocalScanEnrichmentArtifacts, writeLocalScanManifestShards } from './local-enrichment-artifacts.js';
|
import {
|
||||||
|
createKtxScanDescriptionResumeStore,
|
||||||
|
loadOnDiskDescriptionUpdates,
|
||||||
|
writeLocalScanEnrichmentArtifacts,
|
||||||
|
writeLocalScanEnrichmentCheckpoint,
|
||||||
|
writeLocalScanManifestShards,
|
||||||
|
} from './local-enrichment-artifacts.js';
|
||||||
import { readLocalScanStructuralSnapshot } from './local-structural-artifacts.js';
|
import { readLocalScanStructuralSnapshot } from './local-structural-artifacts.js';
|
||||||
import { SqliteLocalScanEnrichmentStateStore } from './sqlite-local-enrichment-state-store.js';
|
import { SqliteLocalScanEnrichmentStateStore } from './sqlite-local-enrichment-state-store.js';
|
||||||
import type {
|
import type {
|
||||||
KtxConnectionDriver,
|
KtxConnectionDriver,
|
||||||
KtxProgressPort,
|
KtxProgressPort,
|
||||||
KtxScanConnector,
|
KtxScanConnector,
|
||||||
|
KtxScanEnrichmentStage,
|
||||||
KtxScanEnrichmentStateSummary,
|
KtxScanEnrichmentStateSummary,
|
||||||
KtxScanMode,
|
KtxScanMode,
|
||||||
KtxScanReport,
|
KtxScanReport,
|
||||||
|
|
@ -68,6 +79,8 @@ export interface RunLocalScanOptions {
|
||||||
connectionId: string;
|
connectionId: string;
|
||||||
mode?: KtxScanMode;
|
mode?: KtxScanMode;
|
||||||
detectRelationships?: boolean;
|
detectRelationships?: boolean;
|
||||||
|
/** Enrichment stages to (re)run; omit to run all eligible stages. */
|
||||||
|
stages?: KtxScanEnrichmentStage[];
|
||||||
dryRun?: boolean;
|
dryRun?: boolean;
|
||||||
trigger?: KtxScanTrigger;
|
trigger?: KtxScanTrigger;
|
||||||
databaseIntrospectionUrl?: string;
|
databaseIntrospectionUrl?: string;
|
||||||
|
|
@ -80,6 +93,7 @@ export interface RunLocalScanOptions {
|
||||||
enrichmentStateStore?: SqliteLocalScanEnrichmentStateStore | null;
|
enrichmentStateStore?: SqliteLocalScanEnrichmentStateStore | null;
|
||||||
progress?: KtxProgressPort;
|
progress?: KtxProgressPort;
|
||||||
embeddingProvider?: KtxEmbeddingProvider | null;
|
embeddingProvider?: KtxEmbeddingProvider | null;
|
||||||
|
signal?: AbortSignal;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface LocalScanRunResult {
|
export interface LocalScanRunResult {
|
||||||
|
|
@ -233,19 +247,18 @@ function createLocalScanEnrichmentStateStore(options: RunLocalScanOptions): Sqli
|
||||||
return new SqliteLocalScanEnrichmentStateStore({ dbPath: ktxLocalStateDbPath(options.project) });
|
return new SqliteLocalScanEnrichmentStateStore({ dbPath: ktxLocalStateDbPath(options.project) });
|
||||||
}
|
}
|
||||||
|
|
||||||
function localScanProviderIdentity(
|
function localScanLlmIdentity(llmConfig: KtxProjectLlmConfig): KtxScanLlmIdentity {
|
||||||
config: KtxScanEnrichmentConfig,
|
|
||||||
llmConfig: KtxProjectLlmConfig,
|
|
||||||
relationships: KtxScanRelationshipConfig,
|
|
||||||
): Record<string, unknown> {
|
|
||||||
return {
|
return {
|
||||||
mode: config.mode,
|
model: llmConfig.models.default ?? null,
|
||||||
embeddingDimensions: config.embeddings?.dimensions ?? null,
|
|
||||||
llmModel: llmConfig.models.default ?? null,
|
|
||||||
embeddingModel: config.embeddings?.model ?? null,
|
|
||||||
batchSize: config.embeddings?.batchSize ?? null,
|
|
||||||
baseUrlConfigured: Boolean(llmConfig.provider.gateway?.base_url),
|
baseUrlConfigured: Boolean(llmConfig.provider.gateway?.base_url),
|
||||||
relationships,
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
function localScanEmbeddingIdentity(config: KtxScanEnrichmentConfig): KtxScanEmbeddingIdentity {
|
||||||
|
return {
|
||||||
|
model: config.embeddings?.model ?? null,
|
||||||
|
dimensions: config.embeddings?.dimensions ?? null,
|
||||||
|
batchSize: config.embeddings?.batchSize ?? null,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -458,6 +471,13 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise<LocalS
|
||||||
const enrichmentStateStore = connector ? createLocalScanEnrichmentStateStore(options) : null;
|
const enrichmentStateStore = connector ? createLocalScanEnrichmentStateStore(options) : null;
|
||||||
let enrichmentState: KtxScanEnrichmentStateSummary = completedKtxScanEnrichmentStateSummary();
|
let enrichmentState: KtxScanEnrichmentStateSummary = completedKtxScanEnrichmentStateSummary();
|
||||||
let enrichmentSnapshot: KtxSchemaSnapshot | null = null;
|
let enrichmentSnapshot: KtxSchemaSnapshot | null = null;
|
||||||
|
// On a `--stages` subset run, the structural manifest write below (and the
|
||||||
|
// later enrichment write) merge with on-disk shards, but the merge treats ai/db
|
||||||
|
// descriptions as scan-managed and overwrites them with whatever the run emits.
|
||||||
|
// A subset that skips `descriptions` emits none, so without this the structural
|
||||||
|
// write would delete the prior descriptions before enrichment can preserve them.
|
||||||
|
// Capture them up front (only for subset runs) and feed them to both writes.
|
||||||
|
let priorDescriptionUpdates: Awaited<ReturnType<typeof loadOnDiskDescriptionUpdates>> | null = null;
|
||||||
if (!reusedExistingScanArtifacts && !report.dryRun && report.artifactPaths.rawSourcesDir) {
|
if (!reusedExistingScanArtifacts && !report.dryRun && report.artifactPaths.rawSourcesDir) {
|
||||||
await options.progress?.update(0.7, 'Writing schema artifacts');
|
await options.progress?.update(0.7, 'Writing schema artifacts');
|
||||||
const rawSnapshot = await readLocalScanStructuralSnapshot({
|
const rawSnapshot = await readLocalScanStructuralSnapshot({
|
||||||
|
|
@ -471,12 +491,20 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise<LocalS
|
||||||
if (rawSnapshot.warnings?.length) {
|
if (rawSnapshot.warnings?.length) {
|
||||||
report.warnings.push(...rawSnapshot.warnings);
|
report.warnings.push(...rawSnapshot.warnings);
|
||||||
}
|
}
|
||||||
|
if (options.stages !== undefined && connector) {
|
||||||
|
priorDescriptionUpdates = await loadOnDiskDescriptionUpdates(
|
||||||
|
options.project,
|
||||||
|
options.connectionId,
|
||||||
|
rawSnapshot,
|
||||||
|
);
|
||||||
|
}
|
||||||
const manifestArtifacts = await writeLocalScanManifestShards({
|
const manifestArtifacts = await writeLocalScanManifestShards({
|
||||||
project: options.project,
|
project: options.project,
|
||||||
connectionId: options.connectionId,
|
connectionId: options.connectionId,
|
||||||
syncId: record.syncId,
|
syncId: record.syncId,
|
||||||
driver,
|
driver,
|
||||||
snapshot: rawSnapshot,
|
snapshot: rawSnapshot,
|
||||||
|
...(priorDescriptionUpdates ? { descriptionUpdates: priorDescriptionUpdates } : {}),
|
||||||
dryRun: false,
|
dryRun: false,
|
||||||
});
|
});
|
||||||
report.artifactPaths.manifestShards = manifestArtifacts.manifestShards;
|
report.artifactPaths.manifestShards = manifestArtifacts.manifestShards;
|
||||||
|
|
@ -494,19 +522,43 @@ export async function runLocalScan(options: RunLocalScanOptions): Promise<LocalS
|
||||||
connectionId: options.connectionId,
|
connectionId: options.connectionId,
|
||||||
mode,
|
mode,
|
||||||
detectRelationships: options.detectRelationships,
|
detectRelationships: options.detectRelationships,
|
||||||
|
...(options.stages ? { stages: options.stages } : {}),
|
||||||
connector,
|
connector,
|
||||||
...(enrichmentSnapshot ? { snapshot: enrichmentSnapshot } : {}),
|
...(enrichmentSnapshot ? { snapshot: enrichmentSnapshot } : {}),
|
||||||
context: { runId: record.runId, progress: options.progress?.startPhase(0.18) },
|
context: {
|
||||||
|
runId: record.runId,
|
||||||
|
...(options.signal ? { signal: options.signal } : {}),
|
||||||
|
...(options.progress ? { progress: options.progress.startPhase(0.18) } : {}),
|
||||||
|
},
|
||||||
providers: enrichmentProviders,
|
providers: enrichmentProviders,
|
||||||
stateStore: enrichmentStateStore,
|
stateStore: enrichmentStateStore,
|
||||||
|
descriptionResumeStore: options.dryRun
|
||||||
|
? null
|
||||||
|
: createKtxScanDescriptionResumeStore({
|
||||||
|
project: options.project,
|
||||||
|
connectionId: options.connectionId,
|
||||||
|
syncId: record.syncId,
|
||||||
|
driver,
|
||||||
|
}),
|
||||||
syncId: record.syncId,
|
syncId: record.syncId,
|
||||||
providerIdentity: localScanProviderIdentity(
|
loadPriorDescriptions: (enrichedSnapshot) =>
|
||||||
options.project.config.scan.enrichment,
|
priorDescriptionUpdates
|
||||||
options.project.config.llm,
|
? Promise.resolve(priorDescriptionUpdates)
|
||||||
options.project.config.scan.relationships,
|
: loadOnDiskDescriptionUpdates(options.project, options.connectionId, enrichedSnapshot),
|
||||||
),
|
llmIdentity: localScanLlmIdentity(options.project.config.llm),
|
||||||
|
embeddingIdentity: localScanEmbeddingIdentity(options.project.config.scan.enrichment),
|
||||||
relationshipSettings: options.project.config.scan.relationships,
|
relationshipSettings: options.project.config.scan.relationships,
|
||||||
now: options.now,
|
now: options.now,
|
||||||
|
onCheckpoint: async (checkpoint) => {
|
||||||
|
await writeLocalScanEnrichmentCheckpoint({
|
||||||
|
project: options.project,
|
||||||
|
connectionId: options.connectionId,
|
||||||
|
syncId: record.syncId,
|
||||||
|
driver,
|
||||||
|
enrichment: checkpoint,
|
||||||
|
dryRun: options.dryRun ?? false,
|
||||||
|
});
|
||||||
|
},
|
||||||
});
|
});
|
||||||
const artifacts = await writeLocalScanEnrichmentArtifacts({
|
const artifacts = await writeLocalScanEnrichmentArtifacts({
|
||||||
project: options.project,
|
project: options.project,
|
||||||
|
|
|
||||||
|
|
@ -45,8 +45,14 @@ const scanWarningCodes = new Set<KtxScanWarning['code']>([
|
||||||
'enrichment_failed',
|
'enrichment_failed',
|
||||||
'description_fallback_used',
|
'description_fallback_used',
|
||||||
'constraint_discovery_unauthorized',
|
'constraint_discovery_unauthorized',
|
||||||
|
'object_introspection_failed',
|
||||||
]);
|
]);
|
||||||
|
|
||||||
|
/** @internal */
|
||||||
|
export function isKtxScanWarningCode(code: string): code is KtxScanWarning['code'] {
|
||||||
|
return scanWarningCodes.has(code as KtxScanWarning['code']);
|
||||||
|
}
|
||||||
|
|
||||||
function parseWarning(rawWarning: unknown, path: string): KtxScanWarning {
|
function parseWarning(rawWarning: unknown, path: string): KtxScanWarning {
|
||||||
if (
|
if (
|
||||||
!isRecord(rawWarning) ||
|
!isRecord(rawWarning) ||
|
||||||
|
|
|
||||||
50
packages/cli/src/context/scan/object-introspection.ts
Normal file
50
packages/cli/src/context/scan/object-introspection.ts
Normal file
|
|
@ -0,0 +1,50 @@
|
||||||
|
import { isNativeProgrammingFault } from '../../errors.js';
|
||||||
|
import type { KtxScanWarning } from './types.js';
|
||||||
|
|
||||||
|
export interface IntrospectObjectContext {
|
||||||
|
/** Bare object name (table or view). */
|
||||||
|
object: string;
|
||||||
|
catalog?: string | null;
|
||||||
|
db?: string | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export type IntrospectObjectOutcome<T> = { ok: true; table: T } | { ok: false; warning: KtxScanWarning };
|
||||||
|
|
||||||
|
function objectLabel(ctx: IntrospectObjectContext): string {
|
||||||
|
return [ctx.catalog, ctx.db, ctx.object].filter((part): part is string => Boolean(part)).join('.');
|
||||||
|
}
|
||||||
|
|
||||||
|
function objectIntrospectionWarning(ctx: IntrospectObjectContext, error: unknown): KtxScanWarning {
|
||||||
|
const reason = error instanceof Error ? error.message : String(error);
|
||||||
|
return {
|
||||||
|
code: 'object_introspection_failed',
|
||||||
|
message: reason,
|
||||||
|
table: ctx.object,
|
||||||
|
recoverable: true,
|
||||||
|
metadata: {
|
||||||
|
object: objectLabel(ctx),
|
||||||
|
...(ctx.db ? { db: ctx.db } : {}),
|
||||||
|
...(ctx.catalog ? { catalog: ctx.catalog } : {}),
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Runs a single-object metadata/profiling read and isolates its failure: a
|
||||||
|
* broken or inaccessible object becomes a recoverable warning instead of
|
||||||
|
* aborting the whole scan. Native programming faults (a ktx bug, not a broken
|
||||||
|
* object) still propagate so they are not masked as object skips.
|
||||||
|
*/
|
||||||
|
export async function tryIntrospectObject<T>(
|
||||||
|
ctx: IntrospectObjectContext,
|
||||||
|
fn: () => T | Promise<T>,
|
||||||
|
): Promise<IntrospectObjectOutcome<T>> {
|
||||||
|
try {
|
||||||
|
return { ok: true, table: await fn() };
|
||||||
|
} catch (error) {
|
||||||
|
if (isNativeProgrammingFault(error)) {
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
return { ok: false, warning: objectIntrospectionWarning(ctx, error) };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -1,10 +1,11 @@
|
||||||
import type { KtxSqlDialect } from '../connections/dialects.js';
|
import type { KtxSqlDialect } from '../connections/dialects.js';
|
||||||
import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable, KtxRelationshipType } from './enrichment-types.js';
|
import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable, KtxRelationshipType } from './enrichment-types.js';
|
||||||
|
import type { KtxRelationshipDetectionBudget } from './relationship-detection-budget.js';
|
||||||
import {
|
import {
|
||||||
type KtxRelationshipProfileArtifact,
|
type KtxRelationshipProfileArtifact,
|
||||||
type KtxRelationshipReadOnlyExecutor,
|
type KtxRelationshipReadOnlyExecutor,
|
||||||
} from './relationship-profiling.js';
|
} from './relationship-profiling.js';
|
||||||
import type { KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
import type { KtxProgressPort, KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
||||||
|
|
||||||
type KtxCompositeRelationshipStatus = 'accepted' | 'review' | 'rejected';
|
type KtxCompositeRelationshipStatus = 'accepted' | 'review' | 'rejected';
|
||||||
|
|
||||||
|
|
@ -66,6 +67,8 @@ export interface DiscoverKtxCompositeRelationshipsInput {
|
||||||
minPrimaryKeyUniqueness?: number;
|
minPrimaryKeyUniqueness?: number;
|
||||||
minSourceCoverage?: number;
|
minSourceCoverage?: number;
|
||||||
maxViolationRatio?: number;
|
maxViolationRatio?: number;
|
||||||
|
budget?: KtxRelationshipDetectionBudget;
|
||||||
|
progress?: KtxProgressPort;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface DiscoverKtxCompositeRelationshipsResult {
|
export interface DiscoverKtxCompositeRelationshipsResult {
|
||||||
|
|
@ -536,7 +539,13 @@ export async function discoverKtxCompositeRelationships(
|
||||||
const primaryKeys: KtxCompositePrimaryKeyCandidate[] = [];
|
const primaryKeys: KtxCompositePrimaryKeyCandidate[] = [];
|
||||||
let queryCount = 0;
|
let queryCount = 0;
|
||||||
|
|
||||||
for (const table of tables) {
|
for (const [index, table] of tables.entries()) {
|
||||||
|
if (input.budget?.check()) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
await input.progress?.update((index + 1) / tables.length, `Probing composite keys ${index + 1}/${tables.length}`, {
|
||||||
|
transient: true,
|
||||||
|
});
|
||||||
const result = await detectCompositePrimaryKeys({
|
const result = await detectCompositePrimaryKeys({
|
||||||
connectionId: input.connectionId,
|
connectionId: input.connectionId,
|
||||||
dialect: input.dialect,
|
dialect: input.dialect,
|
||||||
|
|
@ -554,6 +563,9 @@ export async function discoverKtxCompositeRelationships(
|
||||||
|
|
||||||
const relationships: KtxCompositeRelationshipCandidate[] = [];
|
const relationships: KtxCompositeRelationshipCandidate[] = [];
|
||||||
for (const targetKey of primaryKeys) {
|
for (const targetKey of primaryKeys) {
|
||||||
|
if (input.budget?.check()) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
const targetTable = tableByName.get(targetKey.table.name);
|
const targetTable = tableByName.get(targetKey.table.name);
|
||||||
if (!targetTable) {
|
if (!targetTable) {
|
||||||
continue;
|
continue;
|
||||||
|
|
@ -568,6 +580,9 @@ export async function discoverKtxCompositeRelationships(
|
||||||
}
|
}
|
||||||
|
|
||||||
for (const sourceTable of tables) {
|
for (const sourceTable of tables) {
|
||||||
|
if (input.budget?.check()) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
if (sourceTable.id === targetTable.id) {
|
if (sourceTable.id === targetTable.id) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,93 @@
|
||||||
|
export type KtxRelationshipDetectionStopReason = 'budget' | 'aborted';
|
||||||
|
|
||||||
|
export interface KtxRelationshipDetectionBudget {
|
||||||
|
/**
|
||||||
|
* Returns a stop reason when the relationship stage must stop scheduling new
|
||||||
|
* work, else null. Calling it at a unit boundary records the first observed
|
||||||
|
* stop so the stage can be finalized as partial.
|
||||||
|
*/
|
||||||
|
check(): KtxRelationshipDetectionStopReason | null;
|
||||||
|
/** The first stop reason observed via check(), or null if the stage ran to completion. */
|
||||||
|
stopReason(): KtxRelationshipDetectionStopReason | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface CreateKtxRelationshipDetectionBudgetInput {
|
||||||
|
budgetMs: number;
|
||||||
|
signal?: AbortSignal;
|
||||||
|
now?: () => number;
|
||||||
|
}
|
||||||
|
|
||||||
|
export function createKtxRelationshipDetectionBudget(
|
||||||
|
input: CreateKtxRelationshipDetectionBudgetInput,
|
||||||
|
): KtxRelationshipDetectionBudget {
|
||||||
|
const now = input.now ?? (() => Date.now());
|
||||||
|
const deadline = now() + Math.max(0, input.budgetMs);
|
||||||
|
let tripped: KtxRelationshipDetectionStopReason | null = null;
|
||||||
|
return {
|
||||||
|
check() {
|
||||||
|
if (input.signal?.aborted) {
|
||||||
|
tripped = 'aborted';
|
||||||
|
return 'aborted';
|
||||||
|
}
|
||||||
|
if (now() >= deadline) {
|
||||||
|
tripped ??= 'budget';
|
||||||
|
return 'budget';
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
},
|
||||||
|
stopReason() {
|
||||||
|
return tripped;
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface MapWithBudgetInput<TInput, TOutput> {
|
||||||
|
inputs: readonly TInput[];
|
||||||
|
concurrency: number;
|
||||||
|
budget?: KtxRelationshipDetectionBudget;
|
||||||
|
onStart?: (index: number, total: number, item: TInput) => Promise<void> | void;
|
||||||
|
mapOne: (item: TInput, index: number) => Promise<TOutput>;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface MapWithBudgetResult<TOutput> {
|
||||||
|
/** Output aligned with inputs; entries skipped on budget exhaustion are undefined. */
|
||||||
|
results: Array<TOutput | undefined>;
|
||||||
|
processedCount: number;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Concurrent map that stops claiming new items once the budget trips. In-flight
|
||||||
|
* items finish; pending items are left undefined. With no budget it is a plain
|
||||||
|
* bounded-concurrency map.
|
||||||
|
*/
|
||||||
|
export async function mapWithBudget<TInput, TOutput>(
|
||||||
|
input: MapWithBudgetInput<TInput, TOutput>,
|
||||||
|
): Promise<MapWithBudgetResult<TOutput>> {
|
||||||
|
const total = input.inputs.length;
|
||||||
|
const results: Array<TOutput | undefined> = new Array(total);
|
||||||
|
const safeConcurrency = Math.max(1, Math.floor(input.concurrency));
|
||||||
|
let nextIndex = 0;
|
||||||
|
let processedCount = 0;
|
||||||
|
|
||||||
|
async function worker(): Promise<void> {
|
||||||
|
while (true) {
|
||||||
|
const index = nextIndex;
|
||||||
|
if (index >= total) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
// Check the budget only when work remains, so a deadline that elapses
|
||||||
|
// after the last item is claimed never marks a fully-processed stage partial.
|
||||||
|
if (input.budget?.check()) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
nextIndex += 1;
|
||||||
|
const item = input.inputs[index] as TInput;
|
||||||
|
await input.onStart?.(index, total, item);
|
||||||
|
results[index] = await input.mapOne(item, index);
|
||||||
|
processedCount += 1;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
await Promise.all(Array.from({ length: Math.min(safeConcurrency, total) }, () => worker()));
|
||||||
|
return { results, processedCount };
|
||||||
|
}
|
||||||
|
|
@ -79,6 +79,8 @@ export interface KtxRelationshipDiagnosticsArtifact {
|
||||||
generatedAt: string;
|
generatedAt: string;
|
||||||
summary: KtxRelationshipDiagnosticsSummary;
|
summary: KtxRelationshipDiagnosticsSummary;
|
||||||
noAcceptedReason: string | null;
|
noAcceptedReason: string | null;
|
||||||
|
partial: boolean;
|
||||||
|
partialReason: string | null;
|
||||||
candidateCountsBySource: Record<string, number>;
|
candidateCountsBySource: Record<string, number>;
|
||||||
validation: KtxRelationshipDiagnosticsValidation;
|
validation: KtxRelationshipDiagnosticsValidation;
|
||||||
thresholds: KtxRelationshipDiagnosticsThresholds;
|
thresholds: KtxRelationshipDiagnosticsThresholds;
|
||||||
|
|
@ -101,6 +103,7 @@ export interface BuildKtxRelationshipDiagnosticsInput {
|
||||||
warnings?: readonly KtxScanWarning[];
|
warnings?: readonly KtxScanWarning[];
|
||||||
thresholds?: Partial<KtxRelationshipDiagnosticsThresholds>;
|
thresholds?: Partial<KtxRelationshipDiagnosticsThresholds>;
|
||||||
policy?: Partial<KtxRelationshipDiagnosticsPolicy>;
|
policy?: Partial<KtxRelationshipDiagnosticsPolicy>;
|
||||||
|
partial?: { reason: string } | null;
|
||||||
generatedAt?: string;
|
generatedAt?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -352,6 +355,8 @@ export function buildKtxRelationshipDiagnostics(
|
||||||
generatedAt: input.generatedAt ?? new Date().toISOString(),
|
generatedAt: input.generatedAt ?? new Date().toISOString(),
|
||||||
summary,
|
summary,
|
||||||
noAcceptedReason: noAcceptedReason({ artifacts: input.artifacts, profile: input.profile }),
|
noAcceptedReason: noAcceptedReason({ artifacts: input.artifacts, profile: input.profile }),
|
||||||
|
partial: Boolean(input.partial),
|
||||||
|
partialReason: input.partial?.reason ?? null,
|
||||||
candidateCountsBySource: candidateCountsBySource(input.artifacts),
|
candidateCountsBySource: candidateCountsBySource(input.artifacts),
|
||||||
validation: {
|
validation: {
|
||||||
available: input.profile.sqlAvailable,
|
available: input.profile.sqlAvailable,
|
||||||
|
|
|
||||||
|
|
@ -11,6 +11,11 @@ import {
|
||||||
discoverKtxCompositeRelationships,
|
discoverKtxCompositeRelationships,
|
||||||
type KtxCompositeRelationshipCandidate,
|
type KtxCompositeRelationshipCandidate,
|
||||||
} from './relationship-composite-candidates.js';
|
} from './relationship-composite-candidates.js';
|
||||||
|
import {
|
||||||
|
createKtxRelationshipDetectionBudget,
|
||||||
|
type KtxRelationshipDetectionBudget,
|
||||||
|
type KtxRelationshipDetectionStopReason,
|
||||||
|
} from './relationship-detection-budget.js';
|
||||||
import { collectKtxFormalMetadataRelationships } from './relationship-formal-metadata.js';
|
import { collectKtxFormalMetadataRelationships } from './relationship-formal-metadata.js';
|
||||||
import {
|
import {
|
||||||
type KtxResolvedRelationshipDiscoveryCandidate,
|
type KtxResolvedRelationshipDiscoveryCandidate,
|
||||||
|
|
@ -25,6 +30,7 @@ import {
|
||||||
} from './relationship-profiling.js';
|
} from './relationship-profiling.js';
|
||||||
import { validateKtxRelationshipDiscoveryCandidates } from './relationship-validation.js';
|
import { validateKtxRelationshipDiscoveryCandidates } from './relationship-validation.js';
|
||||||
import type {
|
import type {
|
||||||
|
KtxProgressPort,
|
||||||
KtxScanConnector,
|
KtxScanConnector,
|
||||||
KtxScanContext,
|
KtxScanContext,
|
||||||
KtxScanEnrichmentSummary,
|
KtxScanEnrichmentSummary,
|
||||||
|
|
@ -40,6 +46,8 @@ export interface DiscoverKtxRelationshipsInput {
|
||||||
context: KtxScanContext;
|
context: KtxScanContext;
|
||||||
settings: KtxScanRelationshipConfig;
|
settings: KtxScanRelationshipConfig;
|
||||||
llmRuntime?: KtxLlmRuntimePort | null;
|
llmRuntime?: KtxLlmRuntimePort | null;
|
||||||
|
progress?: KtxProgressPort;
|
||||||
|
now?: () => number;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface DiscoverKtxRelationshipsResult {
|
export interface DiscoverKtxRelationshipsResult {
|
||||||
|
|
@ -51,6 +59,7 @@ export interface DiscoverKtxRelationshipsResult {
|
||||||
statisticalValidation: KtxScanEnrichmentSummary['statisticalValidation'];
|
statisticalValidation: KtxScanEnrichmentSummary['statisticalValidation'];
|
||||||
llmRelationshipValidation: KtxScanEnrichmentSummary['llmRelationshipValidation'];
|
llmRelationshipValidation: KtxScanEnrichmentSummary['llmRelationshipValidation'];
|
||||||
warnings: KtxScanWarning[];
|
warnings: KtxScanWarning[];
|
||||||
|
partial: { reason: KtxRelationshipDetectionStopReason } | null;
|
||||||
}
|
}
|
||||||
|
|
||||||
function relationshipFromResolved(candidate: KtxResolvedRelationshipDiscoveryCandidate): KtxEnrichedRelationship {
|
function relationshipFromResolved(candidate: KtxResolvedRelationshipDiscoveryCandidate): KtxEnrichedRelationship {
|
||||||
|
|
@ -128,6 +137,8 @@ async function detectCompositeRelationships(input: {
|
||||||
executor: KtxRelationshipReadOnlyExecutor | null;
|
executor: KtxRelationshipReadOnlyExecutor | null;
|
||||||
context: DiscoverKtxRelationshipsInput['context'];
|
context: DiscoverKtxRelationshipsInput['context'];
|
||||||
warnings: KtxScanWarning[];
|
warnings: KtxScanWarning[];
|
||||||
|
budget: KtxRelationshipDetectionBudget;
|
||||||
|
progress?: KtxProgressPort;
|
||||||
}): Promise<KtxCompositeRelationshipCandidate[]> {
|
}): Promise<KtxCompositeRelationshipCandidate[]> {
|
||||||
if (!input.executor || !input.profile.sqlAvailable || !input.dialect) {
|
if (!input.executor || !input.profile.sqlAvailable || !input.dialect) {
|
||||||
return [];
|
return [];
|
||||||
|
|
@ -141,6 +152,8 @@ async function detectCompositeRelationships(input: {
|
||||||
profiles: input.profile,
|
profiles: input.profile,
|
||||||
executor: input.executor,
|
executor: input.executor,
|
||||||
ctx: input.context,
|
ctx: input.context,
|
||||||
|
budget: input.budget,
|
||||||
|
...(input.progress ? { progress: input.progress } : {}),
|
||||||
});
|
});
|
||||||
for (const warning of compositeDetection.warnings) {
|
for (const warning of compositeDetection.warnings) {
|
||||||
input.warnings.push({
|
input.warnings.push({
|
||||||
|
|
@ -220,6 +233,11 @@ export async function discoverKtxRelationships(
|
||||||
input: DiscoverKtxRelationshipsInput,
|
input: DiscoverKtxRelationshipsInput,
|
||||||
): Promise<DiscoverKtxRelationshipsResult> {
|
): Promise<DiscoverKtxRelationshipsResult> {
|
||||||
const { executor, warnings } = sqlExecutor(input);
|
const { executor, warnings } = sqlExecutor(input);
|
||||||
|
const budget = createKtxRelationshipDetectionBudget({
|
||||||
|
budgetMs: input.settings.detectionBudgetMs,
|
||||||
|
...(input.context.signal ? { signal: input.context.signal } : {}),
|
||||||
|
...(input.now ? { now: input.now } : {}),
|
||||||
|
});
|
||||||
const formalMetadata = collectKtxFormalMetadataRelationships(input.schema);
|
const formalMetadata = collectKtxFormalMetadataRelationships(input.schema);
|
||||||
const profileCache = createKtxRelationshipProfileCache();
|
const profileCache = createKtxRelationshipProfileCache();
|
||||||
const profile = await profileKtxRelationshipSchema({
|
const profile = await profileKtxRelationshipSchema({
|
||||||
|
|
@ -232,6 +250,8 @@ export async function discoverKtxRelationships(
|
||||||
profileSampleRows: input.settings.profileSampleRows,
|
profileSampleRows: input.settings.profileSampleRows,
|
||||||
profileConcurrency: input.settings.profileConcurrency,
|
profileConcurrency: input.settings.profileConcurrency,
|
||||||
cache: profileCache,
|
cache: profileCache,
|
||||||
|
budget,
|
||||||
|
...(input.progress ? { progress: input.progress } : {}),
|
||||||
});
|
});
|
||||||
const deterministicCandidates: KtxRelationshipDiscoveryCandidate[] = generateKtxRelationshipDiscoveryCandidates(
|
const deterministicCandidates: KtxRelationshipDiscoveryCandidate[] = generateKtxRelationshipDiscoveryCandidates(
|
||||||
input.schema,
|
input.schema,
|
||||||
|
|
@ -240,17 +260,21 @@ export async function discoverKtxRelationships(
|
||||||
profiles: profile,
|
profiles: profile,
|
||||||
},
|
},
|
||||||
);
|
);
|
||||||
const llmProposalResult = input.settings.llmProposals
|
// The LLM proposal is one more unit of relationship work, so it honors the same
|
||||||
? await proposeKtxRelationshipCandidatesWithLlm({
|
// budget/abort gate as profiling, validation, and composite probing — a stage
|
||||||
connectionId: input.connectionId,
|
// that already exhausted its budget (or was aborted) must not start a fresh call.
|
||||||
schema: input.schema,
|
const llmProposalResult =
|
||||||
profile,
|
input.settings.llmProposals && !budget.check()
|
||||||
llmRuntime: input.llmRuntime ?? null,
|
? await proposeKtxRelationshipCandidatesWithLlm({
|
||||||
settings: {
|
connectionId: input.connectionId,
|
||||||
maxTablesPerBatch: input.settings.maxLlmTablesPerBatch,
|
schema: input.schema,
|
||||||
},
|
profile,
|
||||||
})
|
llmRuntime: input.llmRuntime ?? null,
|
||||||
: { candidates: [], warnings: [], llmCalls: 0, summary: 'skipped' as const };
|
settings: {
|
||||||
|
maxTablesPerBatch: input.settings.maxLlmTablesPerBatch,
|
||||||
|
},
|
||||||
|
})
|
||||||
|
: { candidates: [], warnings: [], llmCalls: 0, summary: 'skipped' as const };
|
||||||
const candidates = mergeKtxRelationshipDiscoveryCandidates([
|
const candidates = mergeKtxRelationshipDiscoveryCandidates([
|
||||||
...deterministicCandidates,
|
...deterministicCandidates,
|
||||||
...llmProposalResult.candidates,
|
...llmProposalResult.candidates,
|
||||||
|
|
@ -271,6 +295,8 @@ export async function discoverKtxRelationships(
|
||||||
concurrency: input.settings.validationConcurrency,
|
concurrency: input.settings.validationConcurrency,
|
||||||
validationBudget: input.settings.validationBudget,
|
validationBudget: input.settings.validationBudget,
|
||||||
},
|
},
|
||||||
|
budget,
|
||||||
|
...(input.progress ? { progress: input.progress } : {}),
|
||||||
});
|
});
|
||||||
const graph = resolveKtxRelationshipGraph({
|
const graph = resolveKtxRelationshipGraph({
|
||||||
schema: input.schema,
|
schema: input.schema,
|
||||||
|
|
@ -290,6 +316,8 @@ export async function discoverKtxRelationships(
|
||||||
executor,
|
executor,
|
||||||
context: input.context,
|
context: input.context,
|
||||||
warnings,
|
warnings,
|
||||||
|
budget,
|
||||||
|
...(input.progress ? { progress: input.progress } : {}),
|
||||||
});
|
});
|
||||||
const inferredAccepted = nonFormalAcceptedRelationships({
|
const inferredAccepted = nonFormalAcceptedRelationships({
|
||||||
formalIds: formalMetadata.acceptedIds,
|
formalIds: formalMetadata.acceptedIds,
|
||||||
|
|
@ -312,6 +340,7 @@ export async function discoverKtxRelationships(
|
||||||
resolvedRelationships: graph.relationships,
|
resolvedRelationships: graph.relationships,
|
||||||
});
|
});
|
||||||
const compositeCounts = compositeSummary(compositeRelationships);
|
const compositeCounts = compositeSummary(compositeRelationships);
|
||||||
|
const stopReason = budget.stopReason();
|
||||||
|
|
||||||
return {
|
return {
|
||||||
relationshipUpdate: {
|
relationshipUpdate: {
|
||||||
|
|
@ -329,8 +358,11 @@ export async function discoverKtxRelationships(
|
||||||
profile,
|
profile,
|
||||||
resolvedRelationships: graph.relationships,
|
resolvedRelationships: graph.relationships,
|
||||||
compositeRelationships,
|
compositeRelationships,
|
||||||
statisticalValidation: profile.sqlAvailable ? 'completed' : 'skipped',
|
// A budget/abort stop means profiling did not finish, so report it as not
|
||||||
|
// completed even though the SQL capability was available.
|
||||||
|
statisticalValidation: profile.sqlAvailable && !stopReason ? 'completed' : 'skipped',
|
||||||
llmRelationshipValidation: llmProposalResult.summary,
|
llmRelationshipValidation: llmProposalResult.summary,
|
||||||
warnings,
|
warnings,
|
||||||
|
partial: stopReason ? { reason: stopReason } : null,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -96,6 +96,10 @@ function rowCountForTable(profile: KtxRelationshipProfileArtifact, table: KtxEnr
|
||||||
return profile.tables.find((item) => item.table.name.toLowerCase() === table.ref.name.toLowerCase())?.rowCount ?? null;
|
return profile.tables.find((item) => item.table.name.toLowerCase() === table.ref.name.toLowerCase())?.rowCount ?? null;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function resolvedDescription(descriptions: Partial<Record<string, string>>): string | null {
|
||||||
|
return descriptions.ai ?? descriptions.db ?? null;
|
||||||
|
}
|
||||||
|
|
||||||
function buildEvidencePacket(
|
function buildEvidencePacket(
|
||||||
schema: KtxEnrichedSchema,
|
schema: KtxEnrichedSchema,
|
||||||
profile: KtxRelationshipProfileArtifact,
|
profile: KtxRelationshipProfileArtifact,
|
||||||
|
|
@ -107,13 +111,17 @@ function buildEvidencePacket(
|
||||||
tables: schema.tables
|
tables: schema.tables
|
||||||
.filter((table) => table.enabled)
|
.filter((table) => table.enabled)
|
||||||
.slice(0, settings.maxTablesPerBatch)
|
.slice(0, settings.maxTablesPerBatch)
|
||||||
.map((table) => ({
|
.map((table) => {
|
||||||
|
const tableDescription = resolvedDescription(table.descriptions);
|
||||||
|
return {
|
||||||
name: table.ref.name,
|
name: table.ref.name,
|
||||||
catalog: table.ref.catalog,
|
catalog: table.ref.catalog,
|
||||||
db: table.ref.db,
|
db: table.ref.db,
|
||||||
rowCount: rowCountForTable(profile, table),
|
rowCount: rowCountForTable(profile, table),
|
||||||
|
...(tableDescription ? { description: tableDescription } : {}),
|
||||||
columns: table.columns.slice(0, settings.maxColumnsPerTable).map((column) => {
|
columns: table.columns.slice(0, settings.maxColumnsPerTable).map((column) => {
|
||||||
const columnProfile = profileForColumn(profile, table, column);
|
const columnProfile = profileForColumn(profile, table, column);
|
||||||
|
const columnDescription = resolvedDescription(column.descriptions);
|
||||||
return {
|
return {
|
||||||
name: column.name,
|
name: column.name,
|
||||||
nativeType: column.nativeType,
|
nativeType: column.nativeType,
|
||||||
|
|
@ -121,6 +129,7 @@ function buildEvidencePacket(
|
||||||
dimensionType: column.dimensionType,
|
dimensionType: column.dimensionType,
|
||||||
nullable: column.nullable,
|
nullable: column.nullable,
|
||||||
declaredPrimaryKey: column.primaryKey,
|
declaredPrimaryKey: column.primaryKey,
|
||||||
|
...(columnDescription ? { description: columnDescription } : {}),
|
||||||
profile: columnProfile
|
profile: columnProfile
|
||||||
? {
|
? {
|
||||||
rowCount: columnProfile.rowCount,
|
rowCount: columnProfile.rowCount,
|
||||||
|
|
@ -133,7 +142,8 @@ function buildEvidencePacket(
|
||||||
: null,
|
: null,
|
||||||
};
|
};
|
||||||
}),
|
}),
|
||||||
})),
|
};
|
||||||
|
}),
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,8 +1,9 @@
|
||||||
import type { KtxSqlDialect } from '../connections/dialects.js';
|
import type { KtxSqlDialect } from '../connections/dialects.js';
|
||||||
import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable } from './enrichment-types.js';
|
import type { KtxEnrichedColumn, KtxEnrichedSchema, KtxEnrichedTable } from './enrichment-types.js';
|
||||||
import { mapWithConcurrency } from './relationship-validation.js';
|
import { type KtxRelationshipDetectionBudget, mapWithBudget } from './relationship-detection-budget.js';
|
||||||
import type {
|
import type {
|
||||||
KtxConnectionDriver,
|
KtxConnectionDriver,
|
||||||
|
KtxProgressPort,
|
||||||
KtxQueryResult,
|
KtxQueryResult,
|
||||||
KtxReadOnlyQueryInput,
|
KtxReadOnlyQueryInput,
|
||||||
KtxScanContext,
|
KtxScanContext,
|
||||||
|
|
@ -65,6 +66,8 @@ export interface ProfileKtxRelationshipSchemaInput {
|
||||||
profileSampleRows?: number;
|
profileSampleRows?: number;
|
||||||
profileConcurrency?: number;
|
profileConcurrency?: number;
|
||||||
cache?: KtxRelationshipProfileCache;
|
cache?: KtxRelationshipProfileCache;
|
||||||
|
budget?: KtxRelationshipDetectionBudget;
|
||||||
|
progress?: KtxProgressPort;
|
||||||
}
|
}
|
||||||
|
|
||||||
export function createKtxRelationshipProfileCache(): KtxRelationshipProfileCache {
|
export function createKtxRelationshipProfileCache(): KtxRelationshipProfileCache {
|
||||||
|
|
@ -341,10 +344,14 @@ export async function profileKtxRelationshipSchema(
|
||||||
const dialect = input.dialect;
|
const dialect = input.dialect;
|
||||||
|
|
||||||
const enabledTables = input.schema.tables.filter((candidate) => candidate.enabled);
|
const enabledTables = input.schema.tables.filter((candidate) => candidate.enabled);
|
||||||
const tableResults = await mapWithConcurrency<KtxEnrichedTable, TableProfileResult>(
|
const { results: tableResults } = await mapWithBudget<KtxEnrichedTable, TableProfileResult>({
|
||||||
enabledTables,
|
inputs: enabledTables,
|
||||||
input.profileConcurrency ?? 4,
|
concurrency: input.profileConcurrency ?? 4,
|
||||||
async (table) => {
|
budget: input.budget,
|
||||||
|
onStart: async (index, total) => {
|
||||||
|
await input.progress?.update((index + 1) / total, `Profiling table ${index + 1}/${total}`, { transient: true });
|
||||||
|
},
|
||||||
|
mapOne: async (table) => {
|
||||||
const sampleValuesPerColumn = input.sampleValuesPerColumn ?? 5;
|
const sampleValuesPerColumn = input.sampleValuesPerColumn ?? 5;
|
||||||
const profileSampleRows = input.profileSampleRows ?? 10000;
|
const profileSampleRows = input.profileSampleRows ?? 10000;
|
||||||
const cacheKey = tableProfileCacheKey({
|
const cacheKey = tableProfileCacheKey({
|
||||||
|
|
@ -387,9 +394,12 @@ export async function profileKtxRelationshipSchema(
|
||||||
return { cached: cachedFailure, queryCount: 0 };
|
return { cached: cachedFailure, queryCount: 0 };
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
);
|
});
|
||||||
|
|
||||||
for (const result of tableResults) {
|
for (const result of tableResults) {
|
||||||
|
if (!result) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
if ('tableProfile' in result) {
|
if ('tableProfile' in result) {
|
||||||
queryTotal += result.tableProfile.queryCount;
|
queryTotal += result.tableProfile.queryCount;
|
||||||
tables.push(result.tableProfile.table);
|
tables.push(result.tableProfile.table);
|
||||||
|
|
|
||||||
|
|
@ -1,12 +1,14 @@
|
||||||
|
import { KtxQueryError } from '../../errors.js';
|
||||||
import type { KtxSqlDialect } from '../connections/dialects.js';
|
import type { KtxSqlDialect } from '../connections/dialects.js';
|
||||||
import type { KtxRelationshipEndpoint } from './enrichment-types.js';
|
import type { KtxRelationshipEndpoint } from './enrichment-types.js';
|
||||||
import { applyKtxRelationshipValidationBudget, type KtxRelationshipValidationBudget } from './relationship-budget.js';
|
import { applyKtxRelationshipValidationBudget, type KtxRelationshipValidationBudget } from './relationship-budget.js';
|
||||||
import type { KtxRelationshipDiscoveryCandidate } from './relationship-candidates.js';
|
import type { KtxRelationshipDiscoveryCandidate } from './relationship-candidates.js';
|
||||||
|
import { type KtxRelationshipDetectionBudget, mapWithBudget } from './relationship-detection-budget.js';
|
||||||
import {
|
import {
|
||||||
type KtxRelationshipProfileArtifact,
|
type KtxRelationshipProfileArtifact,
|
||||||
type KtxRelationshipReadOnlyExecutor,
|
type KtxRelationshipReadOnlyExecutor,
|
||||||
} from './relationship-profiling.js';
|
} from './relationship-profiling.js';
|
||||||
import type { KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
import type { KtxProgressPort, KtxQueryResult, KtxScanContext, KtxTableRef } from './types.js';
|
||||||
|
|
||||||
type KtxValidatedRelationshipStatus = 'accepted' | 'review' | 'rejected';
|
type KtxValidatedRelationshipStatus = 'accepted' | 'review' | 'rejected';
|
||||||
|
|
||||||
|
|
@ -51,6 +53,8 @@ export interface ValidateKtxRelationshipDiscoveryCandidatesInput {
|
||||||
ctx: KtxScanContext;
|
ctx: KtxScanContext;
|
||||||
tableCount?: number;
|
tableCount?: number;
|
||||||
settings?: Partial<KtxRelationshipValidationSettings>;
|
settings?: Partial<KtxRelationshipValidationSettings>;
|
||||||
|
budget?: KtxRelationshipDetectionBudget;
|
||||||
|
progress?: KtxProgressPort;
|
||||||
}
|
}
|
||||||
|
|
||||||
const DEFAULT_SETTINGS: KtxRelationshipValidationSettings = {
|
const DEFAULT_SETTINGS: KtxRelationshipValidationSettings = {
|
||||||
|
|
@ -182,31 +186,10 @@ function statusFor(input: {
|
||||||
return 'rejected';
|
return 'rejected';
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function mapWithConcurrency<TInput, TOutput>(
|
|
||||||
inputs: readonly TInput[],
|
|
||||||
concurrency: number,
|
|
||||||
mapOne: (input: TInput) => Promise<TOutput>,
|
|
||||||
): Promise<TOutput[]> {
|
|
||||||
const safeConcurrency = Math.max(1, Math.floor(concurrency));
|
|
||||||
const outputs: TOutput[] = new Array(inputs.length);
|
|
||||||
let nextIndex = 0;
|
|
||||||
|
|
||||||
async function worker(): Promise<void> {
|
|
||||||
while (nextIndex < inputs.length) {
|
|
||||||
const index = nextIndex;
|
|
||||||
nextIndex += 1;
|
|
||||||
outputs[index] = await mapOne(inputs[index] as TInput);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
await Promise.all(Array.from({ length: Math.min(safeConcurrency, inputs.length) }, () => worker()));
|
|
||||||
return outputs;
|
|
||||||
}
|
|
||||||
|
|
||||||
function reviewWithoutValidation(
|
function reviewWithoutValidation(
|
||||||
candidate: KtxRelationshipDiscoveryCandidate,
|
candidate: KtxRelationshipDiscoveryCandidate,
|
||||||
profiles: KtxRelationshipProfileArtifact,
|
profiles: KtxRelationshipProfileArtifact,
|
||||||
reason: 'validation_unavailable' | 'profile_unavailable' | 'validation_unattempted',
|
reason: 'validation_unavailable' | 'profile_unavailable' | 'validation_unattempted' | 'validation_query_failed',
|
||||||
): KtxValidatedRelationshipDiscoveryCandidate {
|
): KtxValidatedRelationshipDiscoveryCandidate {
|
||||||
const sourceColumn = singleRelationshipColumn(candidate.from);
|
const sourceColumn = singleRelationshipColumn(candidate.from);
|
||||||
const targetColumn = singleRelationshipColumn(candidate.to);
|
const targetColumn = singleRelationshipColumn(candidate.to);
|
||||||
|
|
@ -257,21 +240,35 @@ export async function validateKtxRelationshipDiscoveryCandidates(
|
||||||
return reviewWithoutValidation(candidate, input.profiles, 'profile_unavailable');
|
return reviewWithoutValidation(candidate, input.profiles, 'profile_unavailable');
|
||||||
}
|
}
|
||||||
|
|
||||||
const result = await executor.executeReadOnly(
|
let result: KtxQueryResult;
|
||||||
{
|
try {
|
||||||
connectionId: input.connectionId,
|
result = await executor.executeReadOnly(
|
||||||
sql: buildCoverageSql({
|
{
|
||||||
dialect,
|
connectionId: input.connectionId,
|
||||||
childTable: candidate.from.table,
|
sql: buildCoverageSql({
|
||||||
childColumn: sourceColumn,
|
dialect,
|
||||||
parentTable: candidate.to.table,
|
childTable: candidate.from.table,
|
||||||
parentColumn: targetColumn,
|
childColumn: sourceColumn,
|
||||||
maxDistinctSourceValues: settings.maxDistinctSourceValues,
|
parentTable: candidate.to.table,
|
||||||
}),
|
parentColumn: targetColumn,
|
||||||
maxRows: 1,
|
maxDistinctSourceValues: settings.maxDistinctSourceValues,
|
||||||
},
|
}),
|
||||||
input.ctx,
|
maxRows: 1,
|
||||||
);
|
},
|
||||||
|
input.ctx,
|
||||||
|
);
|
||||||
|
} catch (error) {
|
||||||
|
// A bounded-query timeout (or other query rejection) on this one coverage
|
||||||
|
// probe is best-effort: skip the candidate to review rather than aborting
|
||||||
|
// the whole validation pass.
|
||||||
|
if (error instanceof KtxQueryError) {
|
||||||
|
input.ctx.logger?.warn(
|
||||||
|
`relationship validation query skipped for ${candidate.from.table.name}.${sourceColumn} -> ${candidate.to.table.name}.${targetColumn}: ${error.message}`,
|
||||||
|
);
|
||||||
|
return reviewWithoutValidation(candidate, input.profiles, 'validation_query_failed');
|
||||||
|
}
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
const childDistinct = numberAt(result, 'child_distinct');
|
const childDistinct = numberAt(result, 'child_distinct');
|
||||||
const parentDistinct = numberAt(result, 'parent_distinct');
|
const parentDistinct = numberAt(result, 'parent_distinct');
|
||||||
const overlap = numberAt(result, 'overlap');
|
const overlap = numberAt(result, 'overlap');
|
||||||
|
|
@ -330,18 +327,29 @@ export async function validateKtxRelationshipDiscoveryCandidates(
|
||||||
budget: settings.validationBudget,
|
budget: settings.validationBudget,
|
||||||
score: (candidate) => candidate.confidence,
|
score: (candidate) => candidate.confidence,
|
||||||
});
|
});
|
||||||
const validated = await mapWithConcurrency(
|
const { results: validated } = await mapWithBudget({
|
||||||
budgeted.toValidate.map((entry) => entry.candidate),
|
inputs: budgeted.toValidate,
|
||||||
settings.concurrency,
|
concurrency: settings.concurrency,
|
||||||
validateCandidate,
|
budget: input.budget,
|
||||||
);
|
onStart: async (index, total) => {
|
||||||
|
await input.progress?.update((index + 1) / total, `Validating candidate ${index + 1}/${total}`, {
|
||||||
|
transient: true,
|
||||||
|
});
|
||||||
|
},
|
||||||
|
mapOne: (entry) => validateCandidate(entry.candidate),
|
||||||
|
});
|
||||||
const byOriginalIndex = new Map<number, KtxValidatedRelationshipDiscoveryCandidate>();
|
const byOriginalIndex = new Map<number, KtxValidatedRelationshipDiscoveryCandidate>();
|
||||||
for (let index = 0; index < budgeted.toValidate.length; index += 1) {
|
for (let index = 0; index < budgeted.toValidate.length; index += 1) {
|
||||||
const originalIndex = budgeted.toValidate[index]?.originalIndex;
|
const entry = budgeted.toValidate[index];
|
||||||
const candidate = validated[index];
|
if (!entry) {
|
||||||
if (originalIndex !== undefined && candidate) {
|
continue;
|
||||||
byOriginalIndex.set(originalIndex, candidate);
|
|
||||||
}
|
}
|
||||||
|
// A candidate left unvalidated by the wall-clock budget degrades to the
|
||||||
|
// same review status as one deferred by the validation count budget.
|
||||||
|
byOriginalIndex.set(
|
||||||
|
entry.originalIndex,
|
||||||
|
validated[index] ?? reviewWithoutValidation(entry.candidate, input.profiles, 'validation_unattempted'),
|
||||||
|
);
|
||||||
}
|
}
|
||||||
for (const entry of budgeted.deferred) {
|
for (const entry of budgeted.deferred) {
|
||||||
byOriginalIndex.set(
|
byOriginalIndex.set(
|
||||||
|
|
|
||||||
|
|
@ -61,6 +61,9 @@ function isSafeRunId(runId: string): boolean {
|
||||||
return /^[a-zA-Z0-9][a-zA-Z0-9_.-]*$/.test(runId);
|
return /^[a-zA-Z0-9][a-zA-Z0-9_.-]*$/.test(runId);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const STAGES_TABLE = 'local_scan_enrichment_stages';
|
||||||
|
const STAGES_PRIMARY_KEY = ['connection_id', 'stage', 'input_hash'] as const;
|
||||||
|
|
||||||
export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentStateStore {
|
export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentStateStore {
|
||||||
private readonly db: Database.Database;
|
private readonly db: Database.Database;
|
||||||
|
|
||||||
|
|
@ -68,6 +71,10 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
||||||
mkdirSync(dirname(options.dbPath), { recursive: true });
|
mkdirSync(dirname(options.dbPath), { recursive: true });
|
||||||
this.db = new Database(options.dbPath);
|
this.db = new Database(options.dbPath);
|
||||||
this.db.pragma('journal_mode = WAL');
|
this.db.pragma('journal_mode = WAL');
|
||||||
|
// Disposable local resume cache: if a prior ktx wrote the table with a
|
||||||
|
// different primary key, drop it rather than migrate. Losing it only means
|
||||||
|
// one ingest cannot resume; it never corrupts a queryable artifact.
|
||||||
|
this.dropStagesTableIfPrimaryKeyDiffers();
|
||||||
this.db.exec(`
|
this.db.exec(`
|
||||||
CREATE TABLE IF NOT EXISTS local_scan_enrichment_stages (
|
CREATE TABLE IF NOT EXISTS local_scan_enrichment_stages (
|
||||||
run_id TEXT NOT NULL,
|
run_id TEXT NOT NULL,
|
||||||
|
|
@ -80,32 +87,53 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
||||||
output_json TEXT,
|
output_json TEXT,
|
||||||
error_message TEXT,
|
error_message TEXT,
|
||||||
updated_at TEXT NOT NULL,
|
updated_at TEXT NOT NULL,
|
||||||
PRIMARY KEY (run_id, stage)
|
PRIMARY KEY (connection_id, stage, input_hash)
|
||||||
);
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS local_scan_enrichment_stages_content_idx
|
||||||
|
ON local_scan_enrichment_stages (connection_id, stage, input_hash, updated_at);
|
||||||
CREATE INDEX IF NOT EXISTS local_scan_enrichment_stages_run_idx
|
CREATE INDEX IF NOT EXISTS local_scan_enrichment_stages_run_idx
|
||||||
ON local_scan_enrichment_stages (run_id, updated_at, stage);
|
ON local_scan_enrichment_stages (run_id, updated_at, stage);
|
||||||
`);
|
`);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private dropStagesTableIfPrimaryKeyDiffers(): void {
|
||||||
|
const columns = this.db.prepare(`PRAGMA table_info(${STAGES_TABLE})`).all() as Array<{
|
||||||
|
name: string;
|
||||||
|
pk: number;
|
||||||
|
}>;
|
||||||
|
if (columns.length === 0) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
const primaryKey = columns
|
||||||
|
.filter((column) => column.pk > 0)
|
||||||
|
.sort((left, right) => left.pk - right.pk)
|
||||||
|
.map((column) => column.name);
|
||||||
|
const matches =
|
||||||
|
primaryKey.length === STAGES_PRIMARY_KEY.length &&
|
||||||
|
primaryKey.every((name, index) => name === STAGES_PRIMARY_KEY[index]);
|
||||||
|
if (!matches) {
|
||||||
|
this.db.exec(`DROP TABLE ${STAGES_TABLE}`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
async findCompletedStage<TOutput = unknown>(
|
async findCompletedStage<TOutput = unknown>(
|
||||||
input: KtxScanEnrichmentStageLookup,
|
input: KtxScanEnrichmentStageLookup,
|
||||||
): Promise<KtxScanEnrichmentCompletedStage<TOutput> | null> {
|
): Promise<KtxScanEnrichmentCompletedStage<TOutput> | null> {
|
||||||
if (!isSafeRunId(input.runId)) {
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
const row = this.db
|
const row = this.db
|
||||||
.prepare(
|
.prepare(
|
||||||
`
|
`
|
||||||
SELECT *
|
SELECT *
|
||||||
FROM local_scan_enrichment_stages
|
FROM local_scan_enrichment_stages
|
||||||
WHERE run_id = ?
|
WHERE connection_id = ?
|
||||||
AND stage = ?
|
AND stage = ?
|
||||||
AND input_hash = ?
|
AND input_hash = ?
|
||||||
AND status = 'completed'
|
AND status = 'completed'
|
||||||
|
ORDER BY updated_at DESC
|
||||||
|
LIMIT 1
|
||||||
`,
|
`,
|
||||||
)
|
)
|
||||||
.get(input.runId, input.stage, input.inputHash) as StageRow | undefined;
|
.get(input.connectionId, input.stage, input.inputHash) as StageRow | undefined;
|
||||||
|
|
||||||
if (!row) {
|
if (!row) {
|
||||||
return null;
|
return null;
|
||||||
|
|
@ -114,6 +142,31 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
||||||
return parsed.status === 'completed' ? parsed : null;
|
return parsed.status === 'completed' ? parsed : null;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
async findLatestCompletedStage(input: {
|
||||||
|
connectionId: string;
|
||||||
|
stage: KtxScanEnrichmentStage;
|
||||||
|
}): Promise<KtxScanEnrichmentCompletedStage | null> {
|
||||||
|
const row = this.db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT *
|
||||||
|
FROM local_scan_enrichment_stages
|
||||||
|
WHERE connection_id = ?
|
||||||
|
AND stage = ?
|
||||||
|
AND status = 'completed'
|
||||||
|
ORDER BY updated_at DESC
|
||||||
|
LIMIT 1
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.get(input.connectionId, input.stage) as StageRow | undefined;
|
||||||
|
|
||||||
|
if (!row) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
const parsed = parseStageRow(row);
|
||||||
|
return parsed.status === 'completed' ? parsed : null;
|
||||||
|
}
|
||||||
|
|
||||||
async saveCompletedStage<TOutput = unknown>(
|
async saveCompletedStage<TOutput = unknown>(
|
||||||
input: Omit<KtxScanEnrichmentCompletedStage<TOutput>, 'status' | 'errorMessage'>,
|
input: Omit<KtxScanEnrichmentCompletedStage<TOutput>, 'status' | 'errorMessage'>,
|
||||||
): Promise<void> {
|
): Promise<void> {
|
||||||
|
|
@ -144,9 +197,8 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
||||||
NULL,
|
NULL,
|
||||||
@updatedAt
|
@updatedAt
|
||||||
)
|
)
|
||||||
ON CONFLICT(run_id, stage) DO UPDATE SET
|
ON CONFLICT(connection_id, stage, input_hash) DO UPDATE SET
|
||||||
input_hash = excluded.input_hash,
|
run_id = excluded.run_id,
|
||||||
connection_id = excluded.connection_id,
|
|
||||||
sync_id = excluded.sync_id,
|
sync_id = excluded.sync_id,
|
||||||
mode = excluded.mode,
|
mode = excluded.mode,
|
||||||
status = excluded.status,
|
status = excluded.status,
|
||||||
|
|
@ -195,9 +247,8 @@ export class SqliteLocalScanEnrichmentStateStore implements KtxScanEnrichmentSta
|
||||||
@errorMessage,
|
@errorMessage,
|
||||||
@updatedAt
|
@updatedAt
|
||||||
)
|
)
|
||||||
ON CONFLICT(run_id, stage) DO UPDATE SET
|
ON CONFLICT(connection_id, stage, input_hash) DO UPDATE SET
|
||||||
input_hash = excluded.input_hash,
|
run_id = excluded.run_id,
|
||||||
connection_id = excluded.connection_id,
|
|
||||||
sync_id = excluded.sync_id,
|
sync_id = excluded.sync_id,
|
||||||
mode = excluded.mode,
|
mode = excluded.mode,
|
||||||
status = excluded.status,
|
status = excluded.status,
|
||||||
|
|
|
||||||
|
|
@ -385,12 +385,17 @@ type KtxScanWarningCode =
|
||||||
| 'embedding_unavailable'
|
| 'embedding_unavailable'
|
||||||
| 'scan_enrichment_backend_not_configured'
|
| 'scan_enrichment_backend_not_configured'
|
||||||
| 'relationship_validation_failed'
|
| 'relationship_validation_failed'
|
||||||
|
| 'relationship_detection_partial'
|
||||||
|
| 'enrichment_stage_skipped'
|
||||||
|
| 'enrichment_stage_stale'
|
||||||
| 'relationship_llm_invalid_reference'
|
| 'relationship_llm_invalid_reference'
|
||||||
| 'relationship_llm_proposal_failed'
|
| 'relationship_llm_proposal_failed'
|
||||||
| 'credential_redacted'
|
| 'credential_redacted'
|
||||||
| 'enrichment_failed'
|
| 'enrichment_failed'
|
||||||
|
| 'enrichment_timeout'
|
||||||
| 'description_fallback_used'
|
| 'description_fallback_used'
|
||||||
| 'constraint_discovery_unauthorized';
|
| 'constraint_discovery_unauthorized'
|
||||||
|
| 'object_introspection_failed';
|
||||||
|
|
||||||
export interface KtxScanWarning {
|
export interface KtxScanWarning {
|
||||||
code: KtxScanWarningCode;
|
code: KtxScanWarningCode;
|
||||||
|
|
|
||||||
|
|
@ -93,7 +93,7 @@ async function loadCandidates(
|
||||||
listed.files
|
listed.files
|
||||||
.map((path) => path.split('/')[1])
|
.map((path) => path.split('/')[1])
|
||||||
.filter((connectionId): connectionId is string =>
|
.filter((connectionId): connectionId is string =>
|
||||||
typeof connectionId === 'string' && /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId),
|
typeof connectionId === 'string' && /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId),
|
||||||
),
|
),
|
||||||
),
|
),
|
||||||
].sort();
|
].sort();
|
||||||
|
|
|
||||||
|
|
@ -20,7 +20,7 @@ interface WriteSourceOptions {
|
||||||
}
|
}
|
||||||
|
|
||||||
const SL_DIR_PREFIX = 'semantic-layer';
|
const SL_DIR_PREFIX = 'semantic-layer';
|
||||||
const CONNECTION_ID_PATTERN = /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/;
|
const CONNECTION_ID_PATTERN = /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/;
|
||||||
|
|
||||||
export interface LoadAllSourcesResult {
|
export interface LoadAllSourcesResult {
|
||||||
sources: SemanticLayerSource[];
|
sources: SemanticLayerSource[];
|
||||||
|
|
|
||||||
|
|
@ -39,7 +39,7 @@ export function assertSafeConnectionId(connectionId: string): string {
|
||||||
}
|
}
|
||||||
|
|
||||||
export function isSafeConnectionId(connectionId: string | undefined): connectionId is string {
|
export function isSafeConnectionId(connectionId: string | undefined): connectionId is string {
|
||||||
return typeof connectionId === 'string' && /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId);
|
return typeof connectionId === 'string' && /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId);
|
||||||
}
|
}
|
||||||
|
|
||||||
export function sourceNameFromPath(path: string): string {
|
export function sourceNameFromPath(path: string): string {
|
||||||
|
|
|
||||||
|
|
@ -3,4 +3,4 @@ import { z } from 'zod';
|
||||||
export const slToolConnectionIdSchema = z
|
export const slToolConnectionIdSchema = z
|
||||||
.string()
|
.string()
|
||||||
.min(1)
|
.min(1)
|
||||||
.regex(/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/, 'Connection id must be alphanumeric and may contain _ or -');
|
.regex(/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/, 'Connection id must be alphanumeric and may contain _ or -');
|
||||||
|
|
|
||||||
49
packages/cli/src/context/sql-analysis/dialect-notes.ts
Normal file
49
packages/cli/src/context/sql-analysis/dialect-notes.ts
Normal file
|
|
@ -0,0 +1,49 @@
|
||||||
|
import { readFileSync } from 'node:fs';
|
||||||
|
import { fileURLToPath } from 'node:url';
|
||||||
|
import type { SqlAnalysisDialect } from './ports.js';
|
||||||
|
|
||||||
|
// Per-engine SQL syntax notes live as markdown files under ./dialects (one per
|
||||||
|
// dialect), served by the sql_dialect_notes MCP tool. They are package-internal:
|
||||||
|
// copy-runtime-assets.mjs ships them to dist, and they are never installed onto an
|
||||||
|
// agent target. The set covers every dialect reachable from a configured warehouse
|
||||||
|
// driver; duckdb/databricks are intentionally absent because no connector produces
|
||||||
|
// them.
|
||||||
|
|
||||||
|
/** @internal Dialects with an authored ./dialects/<dialect>.md file. */
|
||||||
|
export const DIALECTS_WITH_NOTES = [
|
||||||
|
'postgres',
|
||||||
|
'mysql',
|
||||||
|
'snowflake',
|
||||||
|
'bigquery',
|
||||||
|
'sqlite',
|
||||||
|
'clickhouse',
|
||||||
|
'tsql',
|
||||||
|
] as const;
|
||||||
|
|
||||||
|
type DialectWithNotes = (typeof DIALECTS_WITH_NOTES)[number];
|
||||||
|
|
||||||
|
const notesCache = new Map<DialectWithNotes, string>();
|
||||||
|
|
||||||
|
function readDialectNotes(dialect: DialectWithNotes): string {
|
||||||
|
const cached = notesCache.get(dialect);
|
||||||
|
if (cached !== undefined) {
|
||||||
|
return cached;
|
||||||
|
}
|
||||||
|
const path = fileURLToPath(new URL(`./dialects/${dialect}.md`, import.meta.url));
|
||||||
|
const content = readFileSync(path, 'utf-8').trimEnd();
|
||||||
|
notesCache.set(dialect, content);
|
||||||
|
return content;
|
||||||
|
}
|
||||||
|
|
||||||
|
function hasNotes(dialect: SqlAnalysisDialect): dialect is DialectWithNotes {
|
||||||
|
return (DIALECTS_WITH_NOTES as readonly string[]).includes(dialect);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* SQL syntax notes for a resolved dialect. Falls back to `postgres` — the
|
||||||
|
* resolver's own default for unrecognized drivers — so any SQL connection yields
|
||||||
|
* usable guidance rather than an empty string.
|
||||||
|
*/
|
||||||
|
export function sqlDialectNotes(dialect: SqlAnalysisDialect): string {
|
||||||
|
return readDialectNotes(hasNotes(dialect) ? dialect : 'postgres');
|
||||||
|
}
|
||||||
13
packages/cli/src/context/sql-analysis/dialects/bigquery.md
Normal file
13
packages/cli/src/context/sql-analysis/dialects/bigquery.md
Normal file
|
|
@ -0,0 +1,13 @@
|
||||||
|
**bigquery** SQL conventions:
|
||||||
|
- **FQTN:** backtick-quoted `` `project.dataset.table` `` (e.g. `` `my-proj.analytics.orders` ``); backticks are required when a name contains a dash.
|
||||||
|
- **Identifiers:** backtick to quote; column and field names are case-insensitive, dataset and table names are case-sensitive.
|
||||||
|
- **Date/time:** `DATE_TRUNC(d, MONTH)`, `EXTRACT(YEAR FROM ts)`, `PARSE_DATE('%Y-%m-%d', s)`, `FORMAT_DATE('%Y-%m', d)`, `CURRENT_DATE()`.
|
||||||
|
- **Series:** build a spine with `UNNEST(GENERATE_DATE_ARRAY('2023-01-01', '2023-12-01', INTERVAL 1 MONTH))` for dates (or `GENERATE_ARRAY(1, n)` for integers), then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||||
|
- **Rolling window over time:** `RANGE` frames are numeric, so range over an integer day key — `AVG(amount) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)` is a trailing 30-day average that tolerates gaps; or build a spine (see **Series**) and use a `ROWS` frame.
|
||||||
|
- **Safe cast:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(x AS NUMERIC)`) returns `NULL` instead of erroring on a value that does not parse, so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed.
|
||||||
|
- **Safe divide:** `SAFE_DIVIDE(num, den)` returns `NULL` instead of erroring when the denominator is `0`, so a rate/ratio/share is one expression with no `CASE den = 0` guard; multiply by `100` for a percentage. Prefer it over `num / den` for any computed measure whose denominator can be zero.
|
||||||
|
- **Top-N / windows:** `QUALIFY` filters on a window result, e.g. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1`.
|
||||||
|
- **JSON:** `JSON_VALUE(col, '$.k')` returns a scalar STRING, `JSON_QUERY(col, '$.k')` returns a subtree.
|
||||||
|
- **Nested & repeated data (ARRAY / STRUCT):** the defining BigQuery shape (e.g. GA360 `ga_sessions.hits`, GA4 `event_params`/`user_properties`). Flatten a repeated column by cross-joining `UNNEST` correlated to its row — `FROM t, UNNEST(t.hits) AS h, UNNEST(h.product) AS p` — and read STRUCT fields with dot notation (`h.page.pagePath`, `p.productRevenue`). Pull one value out of a key-value parameter array with a scalar subquery: `(SELECT ep.value.int_value FROM UNNEST(event_params) AS ep WHERE ep.key = 'page_view')`. An `UNNEST` multiplies the parent row by the array's length, so a `COUNT(*)`/`SUM` after it double-counts the parent — count the parent key with `COUNT(DISTINCT visitId)` (or aggregate *inside* the unnest); use `LEFT JOIN UNNEST(arr)` to keep rows whose array is empty.
|
||||||
|
- **Geospatial (GEOGRAPHY):** build a point with `ST_GEOGPOINT(longitude, latitude)` — **longitude first** — or parse text with `ST_GEOGFROMTEXT(wkt)` / `ST_GEOGFROMGEOJSON(s)`. Predicates: containment `ST_CONTAINS(area, pt)` / `ST_WITHIN(pt, area)` (`ST_WITHIN(a,b)=ST_CONTAINS(b,a)`); proximity `ST_DWITHIN(g1, g2, meters)` (geodesic); distance `ST_DISTANCE(g1, g2)` (meters); overlap `ST_INTERSECTS`. For areal allocation use `ST_AREA(g)` (m²) and `ST_AREA(ST_INTERSECTION(a, b))` for the overlapping area. Prefer these predicates over hand-rolled lat/lon `BETWEEN` boxes.
|
||||||
|
- **Sharded tables:** query a wildcard table `` `dataset.events_*` `` and filter the shard with the `_TABLE_SUFFIX` pseudo-column, e.g. `WHERE _TABLE_SUFFIX BETWEEN '20240101' AND '20240131'`. The wildcard spans only the shards that exist — before a measure that pins specific dates/periods, confirm the matching shards are actually present (an absent endpoint silently yields no rows, not an error).
|
||||||
|
|
@ -0,0 +1,9 @@
|
||||||
|
**clickhouse** SQL conventions:
|
||||||
|
- **FQTN:** `database.table` (e.g. `analytics.orders`).
|
||||||
|
- **Identifiers:** quote with backticks (`` `Order` ``) or double quotes; identifiers are case-sensitive.
|
||||||
|
- **Date/time:** native `Date`/`DateTime` types. Bucket with `toStartOfMonth(ts)`, `toStartOfDay(ts)`, `toYYYYMM(ts)`; parse with `toDate(s)` / `parseDateTimeBestEffort(s)`; format with `formatDateTime(ts, '%Y-%m')`.
|
||||||
|
- **Series:** `numbers(n)` / `range(n)` generate an integer sequence; offset a start date with `addMonths(toDate('2023-01-01'), number)` (or `arrayJoin`) to form a spine, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||||
|
- **Rolling window over time:** a numeric range frame over a `Date` column counts in days and tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)` is a trailing 30-day average (use seconds for a `DateTime` key; the `INTERVAL` form is unsupported); or build a spine (see **Series**) and use a `ROWS` frame.
|
||||||
|
- **Safe cast:** `toFloat64OrNull(x)` / `toDecimal64OrNull(x, s)` returns `NULL` on a value that does not parse (the `...OrZero` variants return `0` instead), so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed.
|
||||||
|
- **Top-N / windows:** use the `LIMIT n BY key` clause for n rows per key, or rank in a CTE with `ROW_NUMBER() OVER (...)` and filter outside it.
|
||||||
|
- **JSON:** extract from a String column with `JSONExtractString(col, 'k')`, `JSONExtractInt(col, 'k')`, etc.; a native `JSON`-typed column is traversed by dot path `col.k`.
|
||||||
9
packages/cli/src/context/sql-analysis/dialects/mysql.md
Normal file
9
packages/cli/src/context/sql-analysis/dialects/mysql.md
Normal file
|
|
@ -0,0 +1,9 @@
|
||||||
|
**mysql** SQL conventions:
|
||||||
|
- **FQTN:** `database.table` (MySQL has no separate schema layer — a schema is a database).
|
||||||
|
- **Identifiers:** quote with backticks (`` `order` ``); table-name case-sensitivity follows the server filesystem, while column names are case-insensitive.
|
||||||
|
- **Date/time:** `DATE_FORMAT(ts, '%Y-%m')`, `STR_TO_DATE(s, fmt)`, `YEAR(ts)`/`MONTH(ts)`, `CURDATE()`, `NOW()`.
|
||||||
|
- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH RECURSIVE months(d) AS (SELECT '2023-01-01' UNION ALL SELECT DATE_ADD(d, INTERVAL 1 MONTH) FROM months WHERE d < '2023-12-01')`, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||||
|
- **Rolling window over time:** a native interval range frame over a temporal order key tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||||
|
- **Safe cast:** MySQL has no `TRY_CAST`, and `CAST('abc' AS DECIMAL)` returns `0` with a warning rather than erroring — guard with a pattern test first: `CASE WHEN x REGEXP '^-?[0-9.]+$' THEN CAST(x AS DECIMAL(18,4)) END` makes a value that does not parse `NULL`, so a residual-`NULL` count catches an encoding the sample missed (`REGEXP_REPLACE` can strip symbols).
|
||||||
|
- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (...)` and filter outside it; use `ORDER BY ... LIMIT n` for a global top-N.
|
||||||
|
- **JSON:** `JSON_EXTRACT(col, '$.k')`, or the `col->'$.k'` / `col->>'$.k'` shortcuts (`->>` unquotes to text).
|
||||||
10
packages/cli/src/context/sql-analysis/dialects/postgres.md
Normal file
10
packages/cli/src/context/sql-analysis/dialects/postgres.md
Normal file
|
|
@ -0,0 +1,10 @@
|
||||||
|
**postgres** SQL conventions:
|
||||||
|
- **FQTN:** `schema.table` (e.g. `public.orders`); one query targets a single database, so qualify by schema, not by database.
|
||||||
|
- **Identifiers:** unquoted names fold to lower-case; double-quote (`"Name"`) only to keep case or use a reserved word.
|
||||||
|
- **Date/time:** `date_trunc('month', ts)`, `EXTRACT(YEAR FROM ts)`, `to_char(ts, 'YYYY-MM')`, `CURRENT_DATE`; cast text to a date with `col::date`.
|
||||||
|
- **Series:** build a date/number spine with `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')` (or `generate_series(1, n)` for integers), then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||||
|
- **Rolling window over time:** a native calendar-range frame spans real dates and tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||||
|
- **Integer division:** `/` between two integers truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; cast one operand first — `a::numeric / b` or `a * 1.0 / b` — and round only in the final projection.
|
||||||
|
- **Safe cast:** postgres has no `TRY_CAST`; guard a text-encoded number with a pattern test before casting — `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END` yields `NULL` for a value that does not parse, so counting residual `NULL`s among non-sentinel rows catches an encoding the sample missed (`regexp_replace` can strip symbols, but chained `REPLACE` is the portable default).
|
||||||
|
- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)` and filter in the outer query, or use `DISTINCT ON (key) ... ORDER BY key, ...` for one row per key.
|
||||||
|
- **JSON:** `col->'k'` returns json, `col->>'k'` returns text, deep path `col#>>'{a,b}'`; prefer `jsonb` operators on `jsonb` columns.
|
||||||
10
packages/cli/src/context/sql-analysis/dialects/snowflake.md
Normal file
10
packages/cli/src/context/sql-analysis/dialects/snowflake.md
Normal file
|
|
@ -0,0 +1,10 @@
|
||||||
|
**snowflake** SQL conventions:
|
||||||
|
- **FQTN:** three-part `DATABASE.SCHEMA.TABLE` (e.g. `analytics.public.orders`).
|
||||||
|
- **Identifiers:** unquoted names fold to UPPER-case; double-quote for a case-sensitive or reserved name — `orders` resolves to `"ORDERS"`, which is a different object from `"orders"`.
|
||||||
|
- **Date/time:** `DATE_TRUNC('month', ts)`, `TO_DATE(s[, fmt])`, `DATEADD(day, -7, CURRENT_DATE)`, `CURRENT_DATE`.
|
||||||
|
- **Series:** generate rows with `TABLE(GENERATOR(ROWCOUNT => n))` and offset a start date via `DATEADD('month', SEQ4(), '2023-01-01')` (or a recursive CTE) to form a spine, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||||
|
- **Rolling window over time:** a native interval range frame over a date/timestamp order key tolerates gaps — `AVG(amount) OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW)` is a trailing 30-day average without a spine; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||||
|
- **Safe cast:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` (or `TRY_CAST(x AS NUMBER)`) returns `NULL` instead of erroring on a value that does not parse, so a residual-`NULL` count among non-sentinel rows catches an encoding the sample missed.
|
||||||
|
- **Top-N / windows:** `QUALIFY` filters on a window result without a subquery, e.g. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1`.
|
||||||
|
- **Semi-structured (VARIANT):** traverse with a colon path and cast with `::`, e.g. `src:vehicle[0].make::string`, `payload:events.date::date`; expand arrays with `LATERAL FLATTEN`.
|
||||||
|
- **Geospatial (GEOGRAPHY):** build a point with `ST_MAKEPOINT(longitude, latitude)` — **longitude first** — or `TO_GEOGRAPHY(wkt_or_geojson)`; an area polygon from a closed ring of corner points with `ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_CONSTRUCT(p1, p2, …, p1)))` (repeat the first point last to close). Predicates: proximity `ST_DWITHIN(g1, g2, meters)` (geodesic) and distance `ST_DISTANCE(g1, g2)` (meters); containment `ST_CONTAINS(area, pt)` / `ST_WITHIN(pt, area)` where `ST_WITHIN(a,b)=ST_CONTAINS(b,a)`; overlap `ST_INTERSECTS`. Prefer these predicates over hand-rolled lat/lon `BETWEEN` boxes.
|
||||||
11
packages/cli/src/context/sql-analysis/dialects/sqlite.md
Normal file
11
packages/cli/src/context/sql-analysis/dialects/sqlite.md
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
**sqlite** SQL conventions:
|
||||||
|
- **FQTN:** usually the bare `table`; `main.table` to be explicit, `attached.table` for an attached database.
|
||||||
|
- **Identifiers:** case-insensitive; double-quote (`"Name"`) to preserve a name with spaces or a keyword.
|
||||||
|
- **Date/time:** there is no native date type — values are TEXT, INTEGER, or REAL. Format and bucket with `strftime('%Y-%m', col)`, `date(col)`, `datetime(col)`, and take day differences with `julianday(a) - julianday(b)`. Confirm the stored encoding (ISO text vs Unix epoch) before comparing.
|
||||||
|
- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH RECURSIVE months(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d, '+1 month') FROM months WHERE d < '2023-12-01')`, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||||
|
- **Rolling window over time:** there is no date-interval range frame (a `RANGE` offset needs a single numeric order key, and dates are TEXT), so build a gap-free date spine (see **Series**) and use a row frame — `AVG(amount) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)` then equals a trailing 30-day average; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||||
|
- **Integer division:** `/` between two integers truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; force real division with `a * 1.0 / b` (or `CAST(a AS REAL) / b`) and round only in the final projection.
|
||||||
|
- **Safe cast:** sqlite has no failure-signaling cast — `CAST('abc' AS REAL)` returns `0.0` and `CAST('12abc' AS REAL)` returns `12.0` (no error, no `NULL`), so an `IS NULL` coverage check silently passes. Detect a value that did not parse with a pattern guard before casting, e.g. `CASE WHEN cleaned NOT GLOB '*[^0-9.]*' THEN CAST(cleaned AS REAL) END` (strip any leading sign first), then count the residual `NULL`s.
|
||||||
|
- **Rounding (exact half-up at `.5` boundaries):** `ROUND(x, n)` rounds half-away-from-zero, but binary floating-point stores an exact half-way value just *below* it, so the round goes the wrong way — `ROUND(6.475, 2)` returns `6.47`, not `6.48`. When a rounded measure must match exact half-up (a displayed average, rate, or price), nudge by a tiny epsilon below display precision before rounding: `ROUND(x + 1e-9, n)` lifts `6.4749999…` back to `6.475` so it rounds to `6.48` (it leaves non-boundary values unchanged). Round once, at full precision, in the final projection — never in intermediate CTEs.
|
||||||
|
- **Top-N / windows:** rank in a CTE with `ROW_NUMBER() OVER (...)` and filter in the outer query; use `ORDER BY ... LIMIT n` for a global top-N.
|
||||||
|
- **JSON:** `json_extract(col, '$.k')`, or the `col->'$.k'` / `col->>'$.k'` operators (`->>` returns text).
|
||||||
10
packages/cli/src/context/sql-analysis/dialects/tsql.md
Normal file
10
packages/cli/src/context/sql-analysis/dialects/tsql.md
Normal file
|
|
@ -0,0 +1,10 @@
|
||||||
|
**tsql** (SQL Server) SQL conventions:
|
||||||
|
- **FQTN:** `schema.table` (e.g. `dbo.orders`), or `database.schema.table` across databases.
|
||||||
|
- **Identifiers:** quote with square brackets (`[Order]`), or double quotes when `QUOTED_IDENTIFIER` is on; case-sensitivity is set by the database collation (commonly case-insensitive).
|
||||||
|
- **Date/time:** `DATEPART(year, ts)`, `DATEADD(day, -7, ts)`, `DATEDIFF(day, a, b)`, `CONVERT(date, ts)`, `FORMAT(ts, 'yyyy-MM')`, `GETDATE()`.
|
||||||
|
- **Series:** no series function — build a spine with a recursive CTE, e.g. `WITH months AS (SELECT CAST('2023-01-01' AS date) AS d UNION ALL SELECT DATEADD(month, 1, d) FROM months WHERE d < '2023-12-01')` (cap with `OPTION (MAXRECURSION 0)`), or a numbers/tally table, then `LEFT JOIN` the aggregated facts onto it so empty periods still appear.
|
||||||
|
- **Rolling window over time:** `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame), so build a gap-free date spine (see **Series**) and use a row frame — `AVG(amount) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)` — or a date-keyed self-join on `f.day BETWEEN DATEADD(day, -29, d.day) AND d.day`; guard minimum periods with `COUNT(*) OVER (<same frame>)`.
|
||||||
|
- **Integer division:** `/` between two `int`s truncates (`5 / 2` → `2`), so a rate or `SUM(a) / COUNT(*)` silently floors to an integer; cast one operand first — `CAST(a AS decimal(18,4)) / b` or `a * 1.0 / b` — and round only in the final projection.
|
||||||
|
- **Safe cast:** `TRY_CAST(x AS DECIMAL(18,4))` (or `TRY_CONVERT(decimal(18,4), x)`) returns `NULL` instead of erroring on a value that does not parse, so a residual-`NULL` count among non-sentinel rows catches an encoding the sample missed.
|
||||||
|
- **Top-N / windows:** `SELECT TOP (n) ... ORDER BY ...` for a global top-N; for per-group, rank in a CTE with `ROW_NUMBER() OVER (...)` and filter in the outer query.
|
||||||
|
- **JSON:** `JSON_VALUE(col, '$.k')` returns a scalar, `JSON_QUERY(col, '$.k')` returns an object/array, and `OPENJSON(col)` shreds JSON into rows.
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
const FLAT_WIKI_KEY_PATTERN = /^[a-zA-Z0-9][a-zA-Z0-9_-]*$/;
|
const FLAT_WIKI_KEY_PATTERN = /^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/;
|
||||||
|
|
||||||
export function suggestFlatWikiKey(key: string): string {
|
export function suggestFlatWikiKey(key: string): string {
|
||||||
const suggested = key
|
const suggested = key
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,7 @@
|
||||||
import { createHash } from 'node:crypto';
|
import { createHash } from 'node:crypto';
|
||||||
import YAML from 'yaml';
|
import YAML from 'yaml';
|
||||||
import type { KtxEmbeddingPort } from '../../context/core/embedding.js';
|
import type { KtxEmbeddingPort } from '../../context/core/embedding.js';
|
||||||
import type { KtxFileStorePort } from '../../context/core/file-store.js';
|
import type { KtxFileStorePort, KtxFileWriteResult } from '../../context/core/file-store.js';
|
||||||
import type { KtxLogger } from '../../context/core/config.js';
|
import type { KtxLogger } from '../../context/core/config.js';
|
||||||
import { noopLogger } from '../../context/core/config.js';
|
import { noopLogger } from '../../context/core/config.js';
|
||||||
import type { ReindexWorkResult } from '../index-sync/types.js';
|
import type { ReindexWorkResult } from '../index-sync/types.js';
|
||||||
|
|
@ -232,11 +232,21 @@ export class KnowledgeWikiService {
|
||||||
author: string,
|
author: string,
|
||||||
authorEmail: string,
|
authorEmail: string,
|
||||||
commitMessage?: string,
|
commitMessage?: string,
|
||||||
): Promise<void> {
|
): Promise<KtxFileWriteResult> {
|
||||||
await this.writePage(scope, scopeId, pageKey, frontmatter, content, author, authorEmail, commitMessage);
|
const writeResult = await this.writePage(
|
||||||
|
scope,
|
||||||
|
scopeId,
|
||||||
|
pageKey,
|
||||||
|
frontmatter,
|
||||||
|
content,
|
||||||
|
author,
|
||||||
|
authorEmail,
|
||||||
|
commitMessage,
|
||||||
|
);
|
||||||
const serialized = this.serializePage(frontmatter, content);
|
const serialized = this.serializePage(frontmatter, content);
|
||||||
const contentHash = createHash('sha256').update(serialized).digest('hex');
|
const contentHash = createHash('sha256').update(serialized).digest('hex');
|
||||||
await this.syncSinglePage(scope, scopeId, pageKey, frontmatter, content, contentHash);
|
await this.syncSinglePage(scope, scopeId, pageKey, frontmatter, content, contentHash);
|
||||||
|
return writeResult;
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Index sync (files → DB) ───────────────────────────────────
|
// ── Index sync (files → DB) ───────────────────────────────────
|
||||||
|
|
|
||||||
|
|
@ -21,6 +21,7 @@ export interface LocalKnowledgePage {
|
||||||
tags: string[];
|
tags: string[];
|
||||||
refs: string[];
|
refs: string[];
|
||||||
slRefs: string[];
|
slRefs: string[];
|
||||||
|
connections: string[];
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface LocalKnowledgeSummary {
|
export interface LocalKnowledgeSummary {
|
||||||
|
|
@ -52,6 +53,7 @@ export interface WriteLocalKnowledgePageInput {
|
||||||
representativeSql?: string;
|
representativeSql?: string;
|
||||||
usage?: HistoricSqlWikiUsageFrontmatter;
|
usage?: HistoricSqlWikiUsageFrontmatter;
|
||||||
fingerprints?: string[];
|
fingerprints?: string[];
|
||||||
|
connections?: string[];
|
||||||
}
|
}
|
||||||
|
|
||||||
const LOCAL_AUTHOR = 'ktx';
|
const LOCAL_AUTHOR = 'ktx';
|
||||||
|
|
@ -75,6 +77,19 @@ function stringArray(value: unknown): string[] {
|
||||||
return Array.isArray(value) ? value.filter((item): item is string => typeof item === 'string') : [];
|
return Array.isArray(value) ? value.filter((item): item is string => typeof item === 'string') : [];
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** Coerce a YAML scalar or list into a string list — `connections` accepts a single id or a list. */
|
||||||
|
function stringList(value: unknown): string[] {
|
||||||
|
if (typeof value === 'string') {
|
||||||
|
return value.trim().length > 0 ? [value] : [];
|
||||||
|
}
|
||||||
|
return stringArray(value);
|
||||||
|
}
|
||||||
|
|
||||||
|
/** A page applies to `connectionId` when it is unscoped (empty) or lists that id. */
|
||||||
|
function pageMatchesConnection(connections: string[], connectionId: string | undefined): boolean {
|
||||||
|
return connectionId === undefined || connections.length === 0 || connections.includes(connectionId);
|
||||||
|
}
|
||||||
|
|
||||||
function knowledgePath(scope: LocalKnowledgeScope, userId: string | undefined, key: string): string {
|
function knowledgePath(scope: LocalKnowledgeScope, userId: string | undefined, key: string): string {
|
||||||
const safeKey = assertFlatWikiKey(key);
|
const safeKey = assertFlatWikiKey(key);
|
||||||
if (scope === 'GLOBAL') {
|
if (scope === 'GLOBAL') {
|
||||||
|
|
@ -104,6 +119,7 @@ function parseKnowledgePage(key: string, path: string, scope: LocalKnowledgeScop
|
||||||
tags: [],
|
tags: [],
|
||||||
refs: [],
|
refs: [],
|
||||||
slRefs: [],
|
slRefs: [],
|
||||||
|
connections: [],
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -117,6 +133,7 @@ function parseKnowledgePage(key: string, path: string, scope: LocalKnowledgeScop
|
||||||
tags: stringArray(frontmatter.tags),
|
tags: stringArray(frontmatter.tags),
|
||||||
refs: stringArray(frontmatter.refs),
|
refs: stringArray(frontmatter.refs),
|
||||||
slRefs: stringArray(frontmatter.sl_refs),
|
slRefs: stringArray(frontmatter.sl_refs),
|
||||||
|
connections: stringList(frontmatter.connections),
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -133,6 +150,7 @@ function serializeKnowledgePage(input: WriteLocalKnowledgePageInput): string {
|
||||||
...(input.representativeSql === undefined ? {} : { representative_sql: input.representativeSql }),
|
...(input.representativeSql === undefined ? {} : { representative_sql: input.representativeSql }),
|
||||||
...(input.usage === undefined ? {} : { usage: input.usage }),
|
...(input.usage === undefined ? {} : { usage: input.usage }),
|
||||||
...(input.fingerprints === undefined ? {} : { fingerprints: input.fingerprints }),
|
...(input.fingerprints === undefined ? {} : { fingerprints: input.fingerprints }),
|
||||||
|
...(input.connections === undefined ? {} : { connections: input.connections }),
|
||||||
};
|
};
|
||||||
return `---\n${YAML.stringify(frontmatter, { indent: 2, lineWidth: 0 }).trimEnd()}\n---\n\n${input.content.trim()}\n`;
|
return `---\n${YAML.stringify(frontmatter, { indent: 2, lineWidth: 0 }).trimEnd()}\n---\n\n${input.content.trim()}\n`;
|
||||||
}
|
}
|
||||||
|
|
@ -180,7 +198,7 @@ export async function readLocalKnowledgePage(
|
||||||
|
|
||||||
export async function listLocalKnowledgePages(
|
export async function listLocalKnowledgePages(
|
||||||
project: KtxLocalProject,
|
project: KtxLocalProject,
|
||||||
input: { userId?: string } = {},
|
input: { userId?: string; connectionId?: string } = {},
|
||||||
): Promise<LocalKnowledgeSummary[]> {
|
): Promise<LocalKnowledgeSummary[]> {
|
||||||
const userId = input.userId ?? 'local';
|
const userId = input.userId ?? 'local';
|
||||||
const pages: LocalKnowledgeSummary[] = [];
|
const pages: LocalKnowledgeSummary[] = [];
|
||||||
|
|
@ -193,7 +211,7 @@ export async function listLocalKnowledgePages(
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
const page = await readPageAtPath(project, key, path, scope);
|
const page = await readPageAtPath(project, key, path, scope);
|
||||||
if (page) {
|
if (page && pageMatchesConnection(page.connections, input.connectionId)) {
|
||||||
pages.push({ key, path, scope, summary: page.summary });
|
pages.push({ key, path, scope, summary: page.summary });
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
@ -227,6 +245,26 @@ export async function listLocalKnowledgePageKeys(
|
||||||
return [...keys].sort();
|
return [...keys].sort();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Connection ids referenced by any stored page's `connections` frontmatter,
|
||||||
|
* sorted and deduped. Derived from files; an id here that is not configured in
|
||||||
|
* `ktx.yaml` is a warn-only condition (config and content evolve independently)
|
||||||
|
* and never blocks loading, searching, or reading.
|
||||||
|
*/
|
||||||
|
export async function listReferencedConnectionIds(
|
||||||
|
project: KtxLocalProject,
|
||||||
|
input: { userId?: string } = {},
|
||||||
|
): Promise<string[]> {
|
||||||
|
const pages = await loadAllKnowledgePages(project, { userId: input.userId });
|
||||||
|
const ids = new Set<string>();
|
||||||
|
for (const page of pages) {
|
||||||
|
for (const id of page.connections) {
|
||||||
|
ids.add(id);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return [...ids].sort();
|
||||||
|
}
|
||||||
|
|
||||||
function scorePage(page: LocalKnowledgePage, terms: string[]): number {
|
function scorePage(page: LocalKnowledgePage, terms: string[]): number {
|
||||||
const haystack = buildKnowledgeSearchText(page.key, page.summary, page.content, page.tags).toLowerCase();
|
const haystack = buildKnowledgeSearchText(page.key, page.summary, page.content, page.tags).toLowerCase();
|
||||||
return terms.some((term) => haystack.includes(term)) ? 3 : 0;
|
return terms.some((term) => haystack.includes(term)) ? 3 : 0;
|
||||||
|
|
@ -266,9 +304,12 @@ function tokenLaneCandidates(pages: LocalKnowledgePage[], terms: string[]) {
|
||||||
|
|
||||||
async function loadAllKnowledgePages(
|
async function loadAllKnowledgePages(
|
||||||
project: KtxLocalProject,
|
project: KtxLocalProject,
|
||||||
input: { userId?: string } = {},
|
input: { userId?: string; connectionId?: string } = {},
|
||||||
): Promise<LocalKnowledgePage[]> {
|
): Promise<LocalKnowledgePage[]> {
|
||||||
const summaries = await listLocalKnowledgePages(project, { userId: input.userId });
|
const summaries = await listLocalKnowledgePages(project, {
|
||||||
|
userId: input.userId,
|
||||||
|
connectionId: input.connectionId,
|
||||||
|
});
|
||||||
const pages: LocalKnowledgePage[] = [];
|
const pages: LocalKnowledgePage[] = [];
|
||||||
for (const summary of summaries) {
|
for (const summary of summaries) {
|
||||||
const page = await readPageAtPath(project, summary.key, summary.path, summary.scope);
|
const page = await readPageAtPath(project, summary.key, summary.path, summary.scope);
|
||||||
|
|
@ -281,10 +322,27 @@ async function loadAllKnowledgePages(
|
||||||
|
|
||||||
async function searchLocalKnowledgePagesWithSqlite(
|
async function searchLocalKnowledgePagesWithSqlite(
|
||||||
project: KtxLocalProject,
|
project: KtxLocalProject,
|
||||||
input: { query: string; userId?: string; embeddingService?: KtxEmbeddingPort | null; limit?: number },
|
input: {
|
||||||
|
query: string;
|
||||||
|
userId?: string;
|
||||||
|
connectionId?: string;
|
||||||
|
embeddingService?: KtxEmbeddingPort | null;
|
||||||
|
limit?: number;
|
||||||
|
},
|
||||||
): Promise<LocalKnowledgeSearchResult[]> {
|
): Promise<LocalKnowledgeSearchResult[]> {
|
||||||
|
// The sqlite index is shared across connections and `index.sync` deletes any
|
||||||
|
// page not in its input, so sync the FULL corpus and apply the connection
|
||||||
|
// filter only to the candidate/result set (`allowedPaths`), never to sync.
|
||||||
const pages = await loadAllKnowledgePages(project, { userId: input.userId });
|
const pages = await loadAllKnowledgePages(project, { userId: input.userId });
|
||||||
const byPath = new Map(pages.map((page) => [page.path, page]));
|
const allowedPaths = new Set(
|
||||||
|
pages.filter((page) => pageMatchesConnection(page.connections, input.connectionId)).map((page) => page.path),
|
||||||
|
);
|
||||||
|
const allowedPages = pages.filter((page) => allowedPaths.has(page.path));
|
||||||
|
// Scope the lexical/semantic lanes inside the query so their LIMIT applies to
|
||||||
|
// in-scope rows; only narrow when a connection is requested (otherwise every
|
||||||
|
// path is allowed and the filter is a no-op).
|
||||||
|
const scopedPaths = input.connectionId === undefined ? undefined : [...allowedPaths];
|
||||||
|
const byPath = new Map(allowedPages.map((page) => [page.path, page]));
|
||||||
const embeddingService = input.embeddingService ?? null;
|
const embeddingService = input.embeddingService ?? null;
|
||||||
const index = new SqliteKnowledgeIndex({ dbPath: sqliteKnowledgeDbPath(project) });
|
const index = new SqliteKnowledgeIndex({ dbPath: sqliteKnowledgeDbPath(project) });
|
||||||
const existingPages = index.getExistingPages();
|
const existingPages = index.getExistingPages();
|
||||||
|
|
@ -309,7 +367,7 @@ async function searchLocalKnowledgePagesWithSqlite(
|
||||||
|
|
||||||
index.sync(indexPages);
|
index.sync(indexPages);
|
||||||
|
|
||||||
const finalLimit = input.limit ?? Math.max(1, indexPages.length);
|
const finalLimit = input.limit ?? Math.max(1, allowedPages.length);
|
||||||
const core = new HybridSearchCore();
|
const core = new HybridSearchCore();
|
||||||
const generators: SearchCandidateGenerator[] = [
|
const generators: SearchCandidateGenerator[] = [
|
||||||
{
|
{
|
||||||
|
|
@ -318,6 +376,7 @@ async function searchLocalKnowledgePagesWithSqlite(
|
||||||
const rows = index.searchLexicalCandidates({
|
const rows = index.searchLexicalCandidates({
|
||||||
queryText: args.queryText,
|
queryText: args.queryText,
|
||||||
limit: args.laneCandidatePoolLimit,
|
limit: args.laneCandidatePoolLimit,
|
||||||
|
allowedPaths: scopedPaths,
|
||||||
});
|
});
|
||||||
return {
|
return {
|
||||||
candidates: rows.map((row) => ({ id: row.id, rank: row.rank, rawScore: row.rawScore })),
|
candidates: rows.map((row) => ({ id: row.id, rank: row.rank, rawScore: row.rawScore })),
|
||||||
|
|
@ -327,7 +386,10 @@ async function searchLocalKnowledgePagesWithSqlite(
|
||||||
{
|
{
|
||||||
lane: 'token',
|
lane: 'token',
|
||||||
async generate(args) {
|
async generate(args) {
|
||||||
const rows = tokenLaneCandidates(pages, args.normalizedQuery.terms).slice(0, args.laneCandidatePoolLimit);
|
const rows = tokenLaneCandidates(allowedPages, args.normalizedQuery.terms).slice(
|
||||||
|
0,
|
||||||
|
args.laneCandidatePoolLimit,
|
||||||
|
);
|
||||||
return {
|
return {
|
||||||
candidates: rows.map((row, index) => ({
|
candidates: rows.map((row, index) => ({
|
||||||
id: row.page.path,
|
id: row.page.path,
|
||||||
|
|
@ -349,6 +411,7 @@ async function searchLocalKnowledgePagesWithSqlite(
|
||||||
const rows = index.searchSemanticCandidates({
|
const rows = index.searchSemanticCandidates({
|
||||||
queryEmbedding,
|
queryEmbedding,
|
||||||
limit: args.laneCandidatePoolLimit,
|
limit: args.laneCandidatePoolLimit,
|
||||||
|
allowedPaths: scopedPaths,
|
||||||
});
|
});
|
||||||
return {
|
return {
|
||||||
candidates: rows
|
candidates: rows
|
||||||
|
|
@ -387,14 +450,14 @@ async function searchLocalKnowledgePagesWithSqlite(
|
||||||
|
|
||||||
async function searchLocalKnowledgePagesWithScan(
|
async function searchLocalKnowledgePagesWithScan(
|
||||||
project: KtxLocalProject,
|
project: KtxLocalProject,
|
||||||
input: { query: string; userId?: string; limit?: number },
|
input: { query: string; userId?: string; connectionId?: string; limit?: number },
|
||||||
): Promise<LocalKnowledgeSearchResult[]> {
|
): Promise<LocalKnowledgeSearchResult[]> {
|
||||||
const terms = input.query
|
const terms = input.query
|
||||||
.toLowerCase()
|
.toLowerCase()
|
||||||
.split(/\s+/)
|
.split(/\s+/)
|
||||||
.map((term) => term.trim())
|
.map((term) => term.trim())
|
||||||
.filter(Boolean);
|
.filter(Boolean);
|
||||||
const pages = await loadAllKnowledgePages(project, { userId: input.userId });
|
const pages = await loadAllKnowledgePages(project, { userId: input.userId, connectionId: input.connectionId });
|
||||||
const results: LocalKnowledgeSearchResult[] = [];
|
const results: LocalKnowledgeSearchResult[] = [];
|
||||||
for (const page of pages) {
|
for (const page of pages) {
|
||||||
const score = scorePage(page, terms);
|
const score = scorePage(page, terms);
|
||||||
|
|
@ -416,7 +479,13 @@ async function searchLocalKnowledgePagesWithScan(
|
||||||
|
|
||||||
export async function searchLocalKnowledgePages(
|
export async function searchLocalKnowledgePages(
|
||||||
project: KtxLocalProject,
|
project: KtxLocalProject,
|
||||||
input: { query: string; userId?: string; embeddingService?: KtxEmbeddingPort | null; limit?: number },
|
input: {
|
||||||
|
query: string;
|
||||||
|
userId?: string;
|
||||||
|
connectionId?: string;
|
||||||
|
embeddingService?: KtxEmbeddingPort | null;
|
||||||
|
limit?: number;
|
||||||
|
},
|
||||||
): Promise<LocalKnowledgeSearchResult[]> {
|
): Promise<LocalKnowledgeSearchResult[]> {
|
||||||
if (project.config.storage.search === 'sqlite-fts5') {
|
if (project.config.storage.search === 'sqlite-fts5') {
|
||||||
return searchLocalKnowledgePagesWithSqlite(project, input);
|
return searchLocalKnowledgePagesWithSqlite(project, input);
|
||||||
|
|
|
||||||
|
|
@ -85,6 +85,22 @@ function parseEmbedding(raw: string | null): number[] | null {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** A provided-but-empty allowlist means "no page is in scope", distinct from an absent (unfiltered) one. */
|
||||||
|
function isEmptyAllowlist(allowedPaths: readonly string[] | undefined): boolean {
|
||||||
|
return allowedPaths !== undefined && allowedPaths.length === 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Build a `<keyword> path IN (?, …)` fragment so the scope filter applies inside the query, before any LIMIT. */
|
||||||
|
function pathInClause(
|
||||||
|
keyword: 'AND' | 'WHERE',
|
||||||
|
allowedPaths: readonly string[] | undefined,
|
||||||
|
): { sql: string; params: string[] } {
|
||||||
|
if (allowedPaths === undefined || allowedPaths.length === 0) {
|
||||||
|
return { sql: '', params: [] };
|
||||||
|
}
|
||||||
|
return { sql: ` ${keyword} path IN (${allowedPaths.map(() => '?').join(', ')})`, params: [...allowedPaths] };
|
||||||
|
}
|
||||||
|
|
||||||
function normalizeFtsQuery(query: string): string {
|
function normalizeFtsQuery(query: string): string {
|
||||||
const terms = query
|
const terms = query
|
||||||
.toLowerCase()
|
.toLowerCase()
|
||||||
|
|
@ -217,23 +233,28 @@ export class SqliteKnowledgeIndex {
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
searchLexicalCandidates(input: { queryText: string; limit: number }): WikiSqliteLaneCandidate[] {
|
searchLexicalCandidates(input: {
|
||||||
|
queryText: string;
|
||||||
|
limit: number;
|
||||||
|
allowedPaths?: readonly string[];
|
||||||
|
}): WikiSqliteLaneCandidate[] {
|
||||||
const ftsQuery = normalizeFtsQuery(input.queryText);
|
const ftsQuery = normalizeFtsQuery(input.queryText);
|
||||||
if (!ftsQuery) {
|
if (!ftsQuery || isEmptyAllowlist(input.allowedPaths)) {
|
||||||
return [];
|
return [];
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const pathFilter = pathInClause('AND', input.allowedPaths);
|
||||||
const rows = this.db
|
const rows = this.db
|
||||||
.prepare(
|
.prepare(
|
||||||
`
|
`
|
||||||
SELECT path, bm25(knowledge_pages_fts) AS rank
|
SELECT path, bm25(knowledge_pages_fts) AS rank
|
||||||
FROM knowledge_pages_fts
|
FROM knowledge_pages_fts
|
||||||
WHERE knowledge_pages_fts MATCH ?
|
WHERE knowledge_pages_fts MATCH ?${pathFilter.sql}
|
||||||
ORDER BY rank ASC, path ASC
|
ORDER BY rank ASC, path ASC
|
||||||
LIMIT ?
|
LIMIT ?
|
||||||
`,
|
`,
|
||||||
)
|
)
|
||||||
.all(ftsQuery, Math.max(1, input.limit)) as SearchRow[];
|
.all(ftsQuery, ...pathFilter.params, Math.max(1, input.limit)) as SearchRow[];
|
||||||
|
|
||||||
return rows.map((row, index) => ({
|
return rows.map((row, index) => ({
|
||||||
id: row.path,
|
id: row.path,
|
||||||
|
|
@ -243,16 +264,25 @@ export class SqliteKnowledgeIndex {
|
||||||
}));
|
}));
|
||||||
}
|
}
|
||||||
|
|
||||||
searchSemanticCandidates(input: { queryEmbedding: number[]; limit: number }): WikiSqliteLaneCandidate[] {
|
searchSemanticCandidates(input: {
|
||||||
|
queryEmbedding: number[];
|
||||||
|
limit: number;
|
||||||
|
allowedPaths?: readonly string[];
|
||||||
|
}): WikiSqliteLaneCandidate[] {
|
||||||
|
if (isEmptyAllowlist(input.allowedPaths)) {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
const pathFilter = pathInClause('WHERE', input.allowedPaths);
|
||||||
const rows = this.db
|
const rows = this.db
|
||||||
.prepare(
|
.prepare(
|
||||||
`
|
`
|
||||||
SELECT path, embedding_json
|
SELECT path, embedding_json
|
||||||
FROM knowledge_pages
|
FROM knowledge_pages${pathFilter.sql}
|
||||||
ORDER BY path ASC
|
ORDER BY path ASC
|
||||||
`,
|
`,
|
||||||
)
|
)
|
||||||
.all() as IndexedPageRow[];
|
.all(...pathFilter.params) as IndexedPageRow[];
|
||||||
|
|
||||||
return rows
|
return rows
|
||||||
.flatMap((row) => {
|
.flatMap((row) => {
|
||||||
|
|
|
||||||
|
|
@ -35,6 +35,12 @@ const wikiWriteInputSchema = z.object({
|
||||||
tags: z.array(z.string()).optional(),
|
tags: z.array(z.string()).optional(),
|
||||||
refs: z.array(z.string()).optional(),
|
refs: z.array(z.string()).optional(),
|
||||||
sl_refs: z.array(z.string()).optional(),
|
sl_refs: z.array(z.string()).optional(),
|
||||||
|
connections: z
|
||||||
|
.union([z.string(), z.array(z.string())])
|
||||||
|
.optional()
|
||||||
|
.describe(
|
||||||
|
'Connection ids this page applies to. Set [connectionId] on database-specific pages (with a connection-distinctive key); omit or leave empty for org-wide content. REPLACE semantics like tags.',
|
||||||
|
),
|
||||||
source: z.string().optional(),
|
source: z.string().optional(),
|
||||||
intent: z.string().optional(),
|
intent: z.string().optional(),
|
||||||
tables: z.array(z.string()).optional(),
|
tables: z.array(z.string()).optional(),
|
||||||
|
|
@ -150,6 +156,33 @@ Keys must be flat file names, not directory paths. Use tags/source frontmatter f
|
||||||
const resolvedTags = input.tags === undefined ? existingFm?.tags : input.tags;
|
const resolvedTags = input.tags === undefined ? existingFm?.tags : input.tags;
|
||||||
const resolvedRefs = input.refs === undefined ? existingFm?.refs : input.refs;
|
const resolvedRefs = input.refs === undefined ? existingFm?.refs : input.refs;
|
||||||
const resolvedSlRefs = input.sl_refs === undefined ? existingFm?.sl_refs : input.sl_refs;
|
const resolvedSlRefs = input.sl_refs === undefined ? existingFm?.sl_refs : input.sl_refs;
|
||||||
|
const incomingConnections =
|
||||||
|
input.connections === undefined
|
||||||
|
? undefined
|
||||||
|
: typeof input.connections === 'string'
|
||||||
|
? [input.connections]
|
||||||
|
: input.connections;
|
||||||
|
const resolvedConnections = incomingConnections === undefined ? existingFm?.connections : incomingConnections;
|
||||||
|
|
||||||
|
// Data-loss guard: page keys are a flat global namespace, so a write whose
|
||||||
|
// incoming connection scope is disjoint from an existing same-key page would
|
||||||
|
// silently overwrite a different connection's page. Surface it instead.
|
||||||
|
const existingConnections = existingFm?.connections ?? [];
|
||||||
|
if (
|
||||||
|
existing &&
|
||||||
|
incomingConnections !== undefined &&
|
||||||
|
incomingConnections.length > 0 &&
|
||||||
|
existingConnections.length > 0 &&
|
||||||
|
!incomingConnections.some((id) => existingConnections.includes(id))
|
||||||
|
) {
|
||||||
|
return {
|
||||||
|
markdown:
|
||||||
|
`Error: page "${input.key}" already exists scoped to a different connection ` +
|
||||||
|
`(connections: ${existingConnections.join(', ')}); writing it for ${incomingConnections.join(', ')} ` +
|
||||||
|
`would overwrite that page. Use a connection-distinctive key (e.g. "${input.key}_${incomingConnections[0]}").`,
|
||||||
|
structured: { success: false, key: input.key },
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
let finalContent: string;
|
let finalContent: string;
|
||||||
const finalFm: WikiFrontmatter = {
|
const finalFm: WikiFrontmatter = {
|
||||||
|
|
@ -159,6 +192,7 @@ Keys must be flat file names, not directory paths. Use tags/source frontmatter f
|
||||||
tags: resolvedTags,
|
tags: resolvedTags,
|
||||||
refs: resolvedRefs,
|
refs: resolvedRefs,
|
||||||
sl_refs: resolvedSlRefs,
|
sl_refs: resolvedSlRefs,
|
||||||
|
connections: resolvedConnections,
|
||||||
source: input.source === undefined ? existingFm?.source : input.source,
|
source: input.source === undefined ? existingFm?.source : input.source,
|
||||||
intent: input.intent === undefined ? existingFm?.intent : input.intent,
|
intent: input.intent === undefined ? existingFm?.intent : input.intent,
|
||||||
tables: input.tables === undefined ? existingFm?.tables : input.tables,
|
tables: input.tables === undefined ? existingFm?.tables : input.tables,
|
||||||
|
|
|
||||||
|
|
@ -16,6 +16,12 @@ export interface WikiFrontmatter {
|
||||||
tags?: string[];
|
tags?: string[];
|
||||||
refs?: string[];
|
refs?: string[];
|
||||||
sl_refs?: string[];
|
sl_refs?: string[];
|
||||||
|
/**
|
||||||
|
* Connection ids this page applies to. Absent or empty ⇒ unscoped: the page
|
||||||
|
* applies to all connections. Additive metadata, orthogonal to GLOBAL/USER
|
||||||
|
* scope; it does not namespace page keys.
|
||||||
|
*/
|
||||||
|
connections?: string[];
|
||||||
usage_mode: 'always' | 'auto' | 'never';
|
usage_mode: 'always' | 'auto' | 'never';
|
||||||
sort_order?: number;
|
sort_order?: number;
|
||||||
source?: string;
|
source?: string;
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,7 @@
|
||||||
import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js';
|
import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js';
|
||||||
import type { KtxEmbeddingPort } from './context/core/embedding.js';
|
import type { KtxEmbeddingPort } from './context/core/embedding.js';
|
||||||
import { loadKtxProject } from './context/project/project.js';
|
import { loadKtxProject } from './context/project/project.js';
|
||||||
|
import { assertConfiguredConnectionId } from './context/connections/configured-connections.js';
|
||||||
import {
|
import {
|
||||||
type LocalKnowledgeSearchResult,
|
type LocalKnowledgeSearchResult,
|
||||||
type LocalKnowledgeSummary,
|
type LocalKnowledgeSummary,
|
||||||
|
|
@ -17,12 +18,21 @@ import { createRankBadgeFormatter, printList, type PrintListColumn } from './io/
|
||||||
import { emitTelemetryEvent } from './telemetry/index.js';
|
import { emitTelemetryEvent } from './telemetry/index.js';
|
||||||
|
|
||||||
export type KtxKnowledgeArgs =
|
export type KtxKnowledgeArgs =
|
||||||
| { command: 'list'; projectDir: string; userId: string; output?: string; json?: boolean; cliVersion: string }
|
| {
|
||||||
|
command: 'list';
|
||||||
|
projectDir: string;
|
||||||
|
userId: string;
|
||||||
|
connectionId?: string;
|
||||||
|
output?: string;
|
||||||
|
json?: boolean;
|
||||||
|
cliVersion: string;
|
||||||
|
}
|
||||||
| {
|
| {
|
||||||
command: 'search';
|
command: 'search';
|
||||||
projectDir: string;
|
projectDir: string;
|
||||||
query: string;
|
query: string;
|
||||||
userId: string;
|
userId: string;
|
||||||
|
connectionId?: string;
|
||||||
output?: string;
|
output?: string;
|
||||||
json?: boolean;
|
json?: boolean;
|
||||||
limit?: number;
|
limit?: number;
|
||||||
|
|
@ -120,7 +130,14 @@ export async function runKtxKnowledge(
|
||||||
try {
|
try {
|
||||||
const project = await loadKtxProject({ projectDir: args.projectDir });
|
const project = await loadKtxProject({ projectDir: args.projectDir });
|
||||||
if (args.command === 'list') {
|
if (args.command === 'list') {
|
||||||
const pages = await listLocalKnowledgePages(project, { userId: args.userId });
|
const connectionId =
|
||||||
|
args.connectionId === undefined
|
||||||
|
? undefined
|
||||||
|
: assertConfiguredConnectionId(project.config.connections, args.connectionId);
|
||||||
|
const pages = await listLocalKnowledgePages(project, {
|
||||||
|
userId: args.userId,
|
||||||
|
...(connectionId !== undefined ? { connectionId } : {}),
|
||||||
|
});
|
||||||
const mode = resolveOutputMode({ explicit: args.output, json: args.json, io });
|
const mode = resolveOutputMode({ explicit: args.output, json: args.json, io });
|
||||||
printList<LocalKnowledgeSummary>({
|
printList<LocalKnowledgeSummary>({
|
||||||
rows: pages,
|
rows: pages,
|
||||||
|
|
@ -145,6 +162,10 @@ export async function runKtxKnowledge(
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
if (args.command === 'search') {
|
if (args.command === 'search') {
|
||||||
|
const connectionId =
|
||||||
|
args.connectionId === undefined
|
||||||
|
? undefined
|
||||||
|
: assertConfiguredConnectionId(project.config.connections, args.connectionId);
|
||||||
const embeddingService = await wikiSearchEmbeddingService(project, deps, { cliVersion: args.cliVersion }, io);
|
const embeddingService = await wikiSearchEmbeddingService(project, deps, { cliVersion: args.cliVersion }, io);
|
||||||
const search = deps.searchLocalKnowledgePages ?? defaultSearchLocalKnowledgePages;
|
const search = deps.searchLocalKnowledgePages ?? defaultSearchLocalKnowledgePages;
|
||||||
const results = await search(project, {
|
const results = await search(project, {
|
||||||
|
|
@ -152,6 +173,7 @@ export async function runKtxKnowledge(
|
||||||
userId: args.userId,
|
userId: args.userId,
|
||||||
embeddingService,
|
embeddingService,
|
||||||
limit: args.limit,
|
limit: args.limit,
|
||||||
|
...(connectionId !== undefined ? { connectionId } : {}),
|
||||||
});
|
});
|
||||||
await emitTelemetryEvent({
|
await emitTelemetryEvent({
|
||||||
name: 'wiki_query_completed',
|
name: 'wiki_query_completed',
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,7 @@ import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
||||||
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js';
|
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js';
|
||||||
import { isInitializeRequest } from '@modelcontextprotocol/sdk/types.js';
|
import { isInitializeRequest } from '@modelcontextprotocol/sdk/types.js';
|
||||||
import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js';
|
import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js';
|
||||||
|
import { createMcpLogger, serializeMcpError } from './context/mcp/logger.js';
|
||||||
import { createKtxMcpServerFactory } from './mcp-server-factory.js';
|
import { createKtxMcpServerFactory } from './mcp-server-factory.js';
|
||||||
|
|
||||||
const DEFAULT_ALLOWED_HOSTS = ['localhost', '127.0.0.1', '::1'] as const;
|
const DEFAULT_ALLOWED_HOSTS = ['localhost', '127.0.0.1', '::1'] as const;
|
||||||
|
|
@ -173,6 +174,9 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
||||||
options.createMcpServer === undefined
|
options.createMcpServer === undefined
|
||||||
? await (options.loadProject ?? loadKtxProject)({ projectDir: options.projectDir })
|
? await (options.loadProject ?? loadKtxProject)({ projectDir: options.projectDir })
|
||||||
: undefined;
|
: undefined;
|
||||||
|
// One logger per process, shared by the tool layer (via the factory) and the
|
||||||
|
// transport lifecycle below. Falls back to a no-op sink for programmatic callers.
|
||||||
|
const logger = createMcpLogger(options.io ?? { stdout: { write() {} }, stderr: { write() {} } });
|
||||||
const createMcpServer =
|
const createMcpServer =
|
||||||
options.createMcpServer ??
|
options.createMcpServer ??
|
||||||
(await createKtxMcpServerFactory({
|
(await createKtxMcpServerFactory({
|
||||||
|
|
@ -180,6 +184,7 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
||||||
projectDir: options.projectDir,
|
projectDir: options.projectDir,
|
||||||
cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version,
|
cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version,
|
||||||
io: options.io,
|
io: options.io,
|
||||||
|
logger,
|
||||||
}));
|
}));
|
||||||
const sessions = new Map<string, StreamableHTTPServerTransport>();
|
const sessions = new Map<string, StreamableHTTPServerTransport>();
|
||||||
|
|
||||||
|
|
@ -189,6 +194,7 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
||||||
sessionIdGenerator: () => randomUUID(),
|
sessionIdGenerator: () => randomUUID(),
|
||||||
onsessioninitialized: (sessionId) => {
|
onsessioninitialized: (sessionId) => {
|
||||||
sessions.set(sessionId, transport);
|
sessions.set(sessionId, transport);
|
||||||
|
logger.info({ sessionId }, 'session.open');
|
||||||
},
|
},
|
||||||
onsessionclosed: (sessionId) => {
|
onsessionclosed: (sessionId) => {
|
||||||
sessions.delete(sessionId);
|
sessions.delete(sessionId);
|
||||||
|
|
@ -197,15 +203,25 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
||||||
allowedOrigins: config.allowedOrigins,
|
allowedOrigins: config.allowedOrigins,
|
||||||
enableDnsRebindingProtection: true,
|
enableDnsRebindingProtection: true,
|
||||||
});
|
});
|
||||||
|
// onclose is the universal session-end signal (clean DELETE and dropped connection both
|
||||||
|
// close the transport), so session.close is logged here rather than in onsessionclosed.
|
||||||
transport.onclose = () => {
|
transport.onclose = () => {
|
||||||
if (transport.sessionId) {
|
if (transport.sessionId) {
|
||||||
sessions.delete(transport.sessionId);
|
sessions.delete(transport.sessionId);
|
||||||
|
logger.info({ sessionId: transport.sessionId }, 'session.close');
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
transport.onerror = (error) => {
|
||||||
|
logger.error(
|
||||||
|
{ ...(transport.sessionId ? { sessionId: transport.sessionId } : {}), err: serializeMcpError(error) },
|
||||||
|
'transport.error',
|
||||||
|
);
|
||||||
|
};
|
||||||
await createMcpServer().connect(transport);
|
await createMcpServer().connect(transport);
|
||||||
return transport;
|
return transport;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const startedAt = performance.now();
|
||||||
const server = createServer(async (req, res) => {
|
const server = createServer(async (req, res) => {
|
||||||
const path = requestPath(req);
|
const path = requestPath(req);
|
||||||
const auth = isMcpRequestAuthorized({ path, headers: req.headers }, config);
|
const auth = isMcpRequestAuthorized({ path, headers: req.headers }, config);
|
||||||
|
|
@ -216,7 +232,8 @@ export async function runKtxMcpHttpServer(options: RunKtxMcpHttpServerOptions):
|
||||||
|
|
||||||
if (path === '/health' && req.method === 'GET') {
|
if (path === '/health' && req.method === 'GET') {
|
||||||
const port = listenerPort(server, config.port);
|
const port = listenerPort(server, config.port);
|
||||||
writeJson(res, 200, { status: 'ok', projectDir: options.projectDir, port });
|
const uptimeMs = Math.round(performance.now() - startedAt);
|
||||||
|
writeJson(res, 200, { status: 'ok', projectDir: options.projectDir, port, uptimeMs });
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -2,6 +2,9 @@ import { KtxIngestEmbeddingPortAdapter } from './context/llm/embedding-port.js';
|
||||||
import { createDefaultKtxMcpServer } from './context/mcp/server.js';
|
import { createDefaultKtxMcpServer } from './context/mcp/server.js';
|
||||||
import { createLocalProjectMcpContextPorts } from './context/mcp/local-project-ports.js';
|
import { createLocalProjectMcpContextPorts } from './context/mcp/local-project-ports.js';
|
||||||
import { createLocalProjectMemoryIngest } from './context/memory/local-memory.js';
|
import { createLocalProjectMemoryIngest } from './context/memory/local-memory.js';
|
||||||
|
import { assertConfiguredConnectionId } from './context/connections/configured-connections.js';
|
||||||
|
import type { KtxMcpLogger } from './context/mcp/logger.js';
|
||||||
|
import type { MemoryIngestPort } from './context/mcp/types.js';
|
||||||
import type { KtxLocalProject } from './context/project/project.js';
|
import type { KtxLocalProject } from './context/project/project.js';
|
||||||
import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
||||||
import type { KtxCliIo } from './cli-runtime.js';
|
import type { KtxCliIo } from './cli-runtime.js';
|
||||||
|
|
@ -23,6 +26,7 @@ export async function createKtxMcpServerFactory(input: {
|
||||||
projectDir: string;
|
projectDir: string;
|
||||||
cliVersion: string;
|
cliVersion: string;
|
||||||
io?: KtxCliIo;
|
io?: KtxCliIo;
|
||||||
|
logger?: KtxMcpLogger;
|
||||||
}): Promise<() => McpServer> {
|
}): Promise<() => McpServer> {
|
||||||
const io = input.io ?? noopMcpIo();
|
const io = input.io ?? noopMcpIo();
|
||||||
const queryExecutor = createKtxCliIngestQueryExecutor(input.project);
|
const queryExecutor = createKtxCliIngestQueryExecutor(input.project);
|
||||||
|
|
@ -57,13 +61,25 @@ export async function createKtxMcpServerFactory(input: {
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
|
|
||||||
let memoryIngest: ReturnType<typeof createLocalProjectMemoryIngest> | undefined;
|
let memoryIngest: MemoryIngestPort | undefined;
|
||||||
try {
|
try {
|
||||||
memoryIngest = createLocalProjectMemoryIngest(input.project, {
|
const baseMemoryIngest = createLocalProjectMemoryIngest(input.project, {
|
||||||
semanticLayerCompute,
|
semanticLayerCompute,
|
||||||
queryExecutor,
|
queryExecutor,
|
||||||
embeddingProvider,
|
embeddingProvider,
|
||||||
});
|
});
|
||||||
|
// Validate the explicit connectionId argument here so a typo is rejected with the
|
||||||
|
// configured ids before the ingest run starts; persisted page scope is validated
|
||||||
|
// separately (warn-only) and must not fail.
|
||||||
|
memoryIngest = {
|
||||||
|
ingest: (ingestInput) => {
|
||||||
|
if (ingestInput.connectionId !== undefined) {
|
||||||
|
assertConfiguredConnectionId(input.project.config.connections, ingestInput.connectionId);
|
||||||
|
}
|
||||||
|
return baseMemoryIngest.ingest(ingestInput);
|
||||||
|
},
|
||||||
|
status: (runId) => baseMemoryIngest.status(runId),
|
||||||
|
};
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
io.stderr.write(`ktx MCP memory_ingest disabled: ${error instanceof Error ? error.message : String(error)}\n`);
|
io.stderr.write(`ktx MCP memory_ingest disabled: ${error instanceof Error ? error.message : String(error)}\n`);
|
||||||
}
|
}
|
||||||
|
|
@ -75,6 +91,7 @@ export async function createKtxMcpServerFactory(input: {
|
||||||
userContext: { userId: 'local' },
|
userContext: { userId: 'local' },
|
||||||
projectDir: input.projectDir,
|
projectDir: input.projectDir,
|
||||||
io,
|
io,
|
||||||
|
...(input.logger ? { logger: input.logger } : {}),
|
||||||
contextTools: {
|
contextTools: {
|
||||||
...contextTools,
|
...contextTools,
|
||||||
...(memoryIngest ? { memoryIngest } : {}),
|
...(memoryIngest ? { memoryIngest } : {}),
|
||||||
|
|
|
||||||
|
|
@ -4,6 +4,7 @@ import { loadKtxProject } from './context/project/project.js';
|
||||||
import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
|
||||||
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
|
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
|
||||||
import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js';
|
import { getKtxCliPackageInfo, type KtxCliIo } from './cli-runtime.js';
|
||||||
|
import { createMcpLogger, serializeMcpError } from './context/mcp/logger.js';
|
||||||
import { createKtxMcpServerFactory } from './mcp-server-factory.js';
|
import { createKtxMcpServerFactory } from './mcp-server-factory.js';
|
||||||
|
|
||||||
export interface RunKtxMcpStdioServerOptions {
|
export interface RunKtxMcpStdioServerOptions {
|
||||||
|
|
@ -25,6 +26,8 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions)
|
||||||
stdout: { write() {} },
|
stdout: { write() {} },
|
||||||
stderr: options.io?.stderr ?? process.stderr,
|
stderr: options.io?.stderr ?? process.stderr,
|
||||||
};
|
};
|
||||||
|
// stdout is reserved for JSON-RPC, so the logger writes to stderr only.
|
||||||
|
const logger = createMcpLogger(protocolIo);
|
||||||
const createMcpServer =
|
const createMcpServer =
|
||||||
options.createMcpServer ??
|
options.createMcpServer ??
|
||||||
(await createKtxMcpServerFactory({
|
(await createKtxMcpServerFactory({
|
||||||
|
|
@ -32,6 +35,7 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions)
|
||||||
projectDir: options.projectDir,
|
projectDir: options.projectDir,
|
||||||
cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version,
|
cliVersion: options.cliVersion ?? getKtxCliPackageInfo().version,
|
||||||
io: protocolIo,
|
io: protocolIo,
|
||||||
|
logger,
|
||||||
}));
|
}));
|
||||||
const stdin = options.stdin ?? process.stdin;
|
const stdin = options.stdin ?? process.stdin;
|
||||||
const transport = new StdioServerTransport(stdin, options.stdout);
|
const transport = new StdioServerTransport(stdin, options.stdout);
|
||||||
|
|
@ -50,13 +54,17 @@ export async function runKtxMcpStdioServer(options: RunKtxMcpStdioServerOptions)
|
||||||
settle(() => reject(error instanceof Error ? error : new Error(String(error))));
|
settle(() => reject(error instanceof Error ? error : new Error(String(error))));
|
||||||
});
|
});
|
||||||
};
|
};
|
||||||
transport.onclose = () => settle(resolve);
|
transport.onclose = () => {
|
||||||
|
logger.info({}, 'session.close');
|
||||||
|
settle(resolve);
|
||||||
|
};
|
||||||
transport.onerror = (error) => {
|
transport.onerror = (error) => {
|
||||||
options.io?.stderr.write(`ktx MCP stdio transport error: ${error.message}\n`);
|
logger.error({ err: serializeMcpError(error) }, 'transport.error');
|
||||||
settle(() => reject(error));
|
settle(() => reject(error));
|
||||||
};
|
};
|
||||||
stdin.once('end', closeTransport);
|
stdin.once('end', closeTransport);
|
||||||
stdin.once('close', closeTransport);
|
stdin.once('close', closeTransport);
|
||||||
|
logger.info({}, 'session.open');
|
||||||
createMcpServer().connect(transport).catch((error: unknown) => {
|
createMcpServer().connect(transport).catch((error: unknown) => {
|
||||||
settle(() => reject(error instanceof Error ? error : new Error(String(error))));
|
settle(() => reject(error instanceof Error ? error : new Error(String(error))));
|
||||||
});
|
});
|
||||||
|
|
|
||||||
|
|
@ -46,7 +46,7 @@ const NOTION_SCRIPTED_MODE_HINT =
|
||||||
'Notion picker requires a TTY. Use --no-input --notion-root-page-id <UUID> for scripted mode.';
|
'Notion picker requires a TTY. Use --no-input --notion-root-page-id <UUID> for scripted mode.';
|
||||||
|
|
||||||
function assertSafeNotionPickerConnectionId(connectionId: string): void {
|
function assertSafeNotionPickerConnectionId(connectionId: string): void {
|
||||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
||||||
throw new Error(`Unsafe connection id: ${connectionId}`);
|
throw new Error(`Unsafe connection id: ${connectionId}`);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -19,6 +19,8 @@ A single artifact typically produces multiple actions: one SL source per table/v
|
||||||
|
|
||||||
<scope>
|
<scope>
|
||||||
All wiki writes go to the GLOBAL scope - they will be visible to every user of this ktx project. Phrase wiki pages as objective business knowledge, not personal preference. The `wiki_write` tool handles scope selection automatically for external ingest.
|
All wiki writes go to the GLOBAL scope - they will be visible to every user of this ktx project. Phrase wiki pages as objective business knowledge, not personal preference. The `wiki_write` tool handles scope selection automatically for external ingest.
|
||||||
|
|
||||||
|
When a `connectionId` is shown in the prompt context, tag database-specific pages with `connections: [<that id>]` and give them connection-distinctive keys (`orders_sales_db`, not `orders`) so same-concept pages from other databases do not collide or pollute each other's searches. Leave `connections` empty for org-wide knowledge that applies across every database. See the `wiki_capture` skill's "Connection scoping" section.
|
||||||
</scope>
|
</scope>
|
||||||
|
|
||||||
<do_not>
|
<do_not>
|
||||||
|
|
|
||||||
|
|
@ -20,7 +20,7 @@ import {
|
||||||
import { createAggregateProgressPort } from './progress-port-adapter.js';
|
import { createAggregateProgressPort } from './progress-port-adapter.js';
|
||||||
import { resolvePublicIngestRuntimeRequirements } from './runtime-requirements.js';
|
import { resolvePublicIngestRuntimeRequirements } from './runtime-requirements.js';
|
||||||
import type { KtxScanArgs, KtxScanDeps } from './scan.js';
|
import type { KtxScanArgs, KtxScanDeps } from './scan.js';
|
||||||
import type { KtxTableRef } from './context/scan/types.js';
|
import type { KtxScanEnrichmentStage, KtxTableRef } from './context/scan/types.js';
|
||||||
import { profileMark } from './startup-profile.js';
|
import { profileMark } from './startup-profile.js';
|
||||||
import { isDemoConnection } from './telemetry/demo-detect.js';
|
import { isDemoConnection } from './telemetry/demo-detect.js';
|
||||||
import { emitProjectStackSnapshot, emitTelemetryEvent, reportException } from './telemetry/index.js';
|
import { emitProjectStackSnapshot, emitTelemetryEvent, reportException } from './telemetry/index.js';
|
||||||
|
|
@ -46,6 +46,7 @@ export type KtxPublicIngestArgs =
|
||||||
queryHistory?: KtxPublicIngestQueryHistoryFlag;
|
queryHistory?: KtxPublicIngestQueryHistoryFlag;
|
||||||
queryHistoryWindowDays?: number;
|
queryHistoryWindowDays?: number;
|
||||||
scanMode?: Extract<KtxScanArgs, { command: 'run' }>['mode'];
|
scanMode?: Extract<KtxScanArgs, { command: 'run' }>['mode'];
|
||||||
|
stages?: KtxScanEnrichmentStage[];
|
||||||
detectRelationships?: boolean;
|
detectRelationships?: boolean;
|
||||||
cliVersion?: string;
|
cliVersion?: string;
|
||||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||||
|
|
@ -123,6 +124,7 @@ interface KtxPublicContextBuildArgs {
|
||||||
queryHistory?: KtxPublicIngestQueryHistoryFlag;
|
queryHistory?: KtxPublicIngestQueryHistoryFlag;
|
||||||
queryHistoryWindowDays?: number;
|
queryHistoryWindowDays?: number;
|
||||||
scanMode?: Extract<KtxScanArgs, { command: 'run' }>['mode'];
|
scanMode?: Extract<KtxScanArgs, { command: 'run' }>['mode'];
|
||||||
|
stages?: KtxScanEnrichmentStage[];
|
||||||
detectRelationships?: boolean;
|
detectRelationships?: boolean;
|
||||||
cliVersion?: string;
|
cliVersion?: string;
|
||||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||||
|
|
@ -974,6 +976,7 @@ async function runIngestTargetSteps(
|
||||||
mode: 'enriched',
|
mode: 'enriched',
|
||||||
detectRelationships: target.detectRelationships === true,
|
detectRelationships: target.detectRelationships === true,
|
||||||
dryRun: false,
|
dryRun: false,
|
||||||
|
...(args.stages ? { stages: args.stages } : {}),
|
||||||
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
||||||
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
||||||
};
|
};
|
||||||
|
|
@ -1153,6 +1156,7 @@ export async function runKtxPublicIngest(
|
||||||
...(args.queryHistory ? { queryHistory: args.queryHistory } : {}),
|
...(args.queryHistory ? { queryHistory: args.queryHistory } : {}),
|
||||||
...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}),
|
...(args.queryHistoryWindowDays !== undefined ? { queryHistoryWindowDays: args.queryHistoryWindowDays } : {}),
|
||||||
...(args.scanMode ? { scanMode: args.scanMode } : {}),
|
...(args.scanMode ? { scanMode: args.scanMode } : {}),
|
||||||
|
...(args.stages ? { stages: args.stages } : {}),
|
||||||
...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}),
|
...(args.detectRelationships !== undefined ? { detectRelationships: args.detectRelationships } : {}),
|
||||||
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
...(args.cliVersion ? { cliVersion: args.cliVersion } : {}),
|
||||||
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
...(args.runtimeInstallPolicy ? { runtimeInstallPolicy: args.runtimeInstallPolicy } : {}),
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,10 @@
|
||||||
import type { KtxProgressPort, KtxScanMode, KtxScanReport, KtxScanWarning } from './context/scan/types.js';
|
import type {
|
||||||
|
KtxProgressPort,
|
||||||
|
KtxScanEnrichmentStage,
|
||||||
|
KtxScanMode,
|
||||||
|
KtxScanReport,
|
||||||
|
KtxScanWarning,
|
||||||
|
} from './context/scan/types.js';
|
||||||
import { runLocalScan } from './context/scan/local-scan.js';
|
import { runLocalScan } from './context/scan/local-scan.js';
|
||||||
import { loadKtxProject, type KtxLocalProject } from './context/project/project.js';
|
import { loadKtxProject, type KtxLocalProject } from './context/project/project.js';
|
||||||
import { getKtxCliPackageInfo } from './cli-runtime.js';
|
import { getKtxCliPackageInfo } from './cli-runtime.js';
|
||||||
|
|
@ -21,6 +27,8 @@ export interface KtxScanArgs {
|
||||||
mode: KtxScanMode;
|
mode: KtxScanMode;
|
||||||
detectRelationships: boolean;
|
detectRelationships: boolean;
|
||||||
dryRun: boolean;
|
dryRun: boolean;
|
||||||
|
/** Enrichment stages to (re)run; omit to run all eligible stages. */
|
||||||
|
stages?: KtxScanEnrichmentStage[];
|
||||||
databaseIntrospectionUrl?: string;
|
databaseIntrospectionUrl?: string;
|
||||||
cliVersion?: string;
|
cliVersion?: string;
|
||||||
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
runtimeInstallPolicy?: KtxManagedPythonInstallPolicy;
|
||||||
|
|
@ -180,8 +188,14 @@ function describeWarningGroup(code: string, count: number): string {
|
||||||
return `${count} LLM relationship ${plural(count, 'proposal')} failed.`;
|
return `${count} LLM relationship ${plural(count, 'proposal')} failed.`;
|
||||||
case 'scan_enrichment_backend_not_configured':
|
case 'scan_enrichment_backend_not_configured':
|
||||||
return 'Scan enrichment backend is not configured; AI stages were skipped.';
|
return 'Scan enrichment backend is not configured; AI stages were skipped.';
|
||||||
|
case 'enrichment_stage_skipped':
|
||||||
|
return `${count} requested ${plural(count, 'enrichment stage')} could not run (prerequisite missing).`;
|
||||||
|
case 'enrichment_stage_stale':
|
||||||
|
return `${count} enrichment ${plural(count, 'stage')} are stale after a selective run; re-run them to refresh.`;
|
||||||
case 'credential_redacted':
|
case 'credential_redacted':
|
||||||
return `${count} ${plural(count, 'credential')} were redacted from scan output.`;
|
return `${count} ${plural(count, 'credential')} were redacted from scan output.`;
|
||||||
|
case 'object_introspection_failed':
|
||||||
|
return `${count} ${plural(count, 'object')} skipped during introspection (broken or inaccessible objects were excluded; the rest were ingested).`;
|
||||||
default:
|
default:
|
||||||
return `${count} ${plural(count, 'warning')} (${code})`;
|
return `${count} ${plural(count, 'warning')} (${code})`;
|
||||||
}
|
}
|
||||||
|
|
@ -348,6 +362,7 @@ export async function runKtxScan(args: KtxScanArgs, io: KtxCliIo = process, deps
|
||||||
connectionId: args.connectionId,
|
connectionId: args.connectionId,
|
||||||
mode: args.mode,
|
mode: args.mode,
|
||||||
detectRelationships: args.detectRelationships,
|
detectRelationships: args.detectRelationships,
|
||||||
|
...(args.stages ? { stages: args.stages } : {}),
|
||||||
dryRun: args.dryRun,
|
dryRun: args.dryRun,
|
||||||
trigger: 'cli',
|
trigger: 'cli',
|
||||||
databaseIntrospectionUrl: args.databaseIntrospectionUrl,
|
databaseIntrospectionUrl: args.databaseIntrospectionUrl,
|
||||||
|
|
|
||||||
|
|
@ -320,7 +320,7 @@ function unique(values: string[]): string[] {
|
||||||
}
|
}
|
||||||
|
|
||||||
function assertSafeDatabaseConnectionId(connectionId: string): void {
|
function assertSafeDatabaseConnectionId(connectionId: string): void {
|
||||||
if (!/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
if (!/^[a-zA-Z0-9_][a-zA-Z0-9_-]*$/.test(connectionId)) {
|
||||||
throw new Error(`Unsafe connection id: ${connectionId}`);
|
throw new Error(`Unsafe connection id: ${connectionId}`);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue