ktx/packages/cli/src/context/scan/description-generation.test.ts
Andrey Avtomonov 2366b00301
chore(workspace): gate dead-code with knip production mode (#196)
* refactor(workspace): relocate @ktx/llm source into packages/cli/src/llm

* refactor(workspace): rewrite @ktx/llm imports to relative paths

* refactor(workspace): fold internal packages into cli

* chore(workspace): gate dead-code with knip production mode

Turn on production-mode knip plus an autofix run in pre-commit and the
`pnpm dead-code` script, document the `/** @internal */` convention for
test-only exports in AGENTS.md, annotate test-only exports across the
CLI with that JSDoc, and drop dead exports/wrappers the new gate
surfaced (e.g. `cli-project.ts`, `lookerRuntimeSourceToFileAdapterSource`,
`createLocalScanEnrichmentProvidersFromConfig`,
`PGLITE_OWNER_PROCESS_BACKEND_CAPABILITIES`, stale type re-exports).
Replace the loose `ignoreIssues` allowlist in `knip.json` with explicit
production entries so cross-package barrel leaks are caught.

* refactor(cli): delete internal barrel index.ts files

The 34 `index.ts` re-export barrels inside `packages/cli/src/` were
holdovers from the pre-fold multi-workspace structure. Post-fold-in they
served no production purpose: external consumers go through the single
package main entry, and in-repo callers mostly imported through them
only because the path was short. Internally, knip flagged most barrel
re-exports as production-dead (only reached via tests).

This change:
- Deletes every internal barrel except `packages/cli/src/index.ts`
  (the published package entry).
- Rewrites ~270 source/test files to import each name directly from
  the file that defines it.
- Moves `tools/warehouse-verification/index.ts` to
  `create-warehouse-verification-tools.ts` (the function it defined
  locally) and updates its single consumer.
- Renames `search/backend-conformance.ts` → `.test-utils.ts` to match
  the existing test-helper file convention.
- Deletes 13 dead test-only chains (dbt-descriptions/*,
  live-database/extracted-schema, live-database/structural-sync,
  relationship-* feedback/review chain) plus their tests and a
  cascading orphan integration test.
- Updates test mocks that pointed at deleted barrel paths
  (notion-client, connector barrels in scan/local-scan-connectors
  tests) to mock the source files instead.
- Points the maintainer benchmark script
  (`scripts/relationship-benchmark-report.mjs`) at source files
  instead of `dist/context/scan/index.js`.
- Drops the barrel `!` entries from `knip.json`; adds explicit
  production entries only for the benchmark code reached via dist by
  the maintainer script.

Net: 413 files changed, ~1.2k insertions, ~9.4k deletions.

`pnpm run dead-code` (Biome + knip default + knip production) and
`pnpm run type-check` are clean; 2277 tests pass.

* refactor(workspace): rename @ktx/cli to @kaelio/ktx and pack it directly

Promote the CLI workspace package to the public name `@kaelio/ktx` and
drop the separate `scripts/build-public-npm-package.mjs` wrapper. The
CLI package is now publishable in place (`publishConfig.access: public`,
`provenance: true`), so artifact packing uses `pnpm pack` against
`packages/cli/` instead of assembling a parallel package tree.

Updates all workspace filter invocations, docs, tests, and release
readiness checks to reference the new package name, and folds the
tarball-name helper into `scripts/public-npm-release-metadata.mjs`.

* docs: align "agent clients" and "data agents" terminology

Replace "client agents" with "agent clients" and "database agents" with
"data agents" across AGENTS.md, README.md, the docs-site copy, and the
matching setup-agents test description, matching the canonical
vocabulary in docs/terminology.md.

Also moves packages/cli/tsconfig.json's tsBuildInfoFile from
node_modules/.cache/ to dist/.tsbuildinfo so incremental builds survive
node_modules reinstalls.

* refactor(release): single source of truth for package version

Make packages/cli/package.json the single source of truth for the
@kaelio/ktx version. publicNpmPackageVersion() now reads it directly,
so artifact filenames, release-readiness checks, and the Python wheel
version all derive from one field. The duplicate
release-policy.json.publicNpmPackageVersion is removed.

Previously the two fields could drift: tarballs were named
kaelio-ktx-0.4.1.tgz while internally containing
@kaelio/ktx@0.0.0-private.

- update-public-release-version.mjs rewrites both Python pyproject.toml
  files (ktx-daemon, ktx-sl) alongside the npm package.jsons,
  normalizing the version for PEP 440 (e.g. 0.1.0-rc.2 -> 0.1.0rc2).
- semantic-release-config.cjs adds the two pyproject.toml files to
  @semantic-release/git assets so the release commit back to main
  carries every version source in lockstep.
- The six "?? '0.0.0-private'" fallback literals across the CLI are
  replaced with "?? getKtxCliPackageInfo().version", and
  createDefaultKtxMcpServer makes its version arg required.
- docs/release.md describes the actual commit-back model: the dev tree
  always reflects the most recent release; no sentinel pin to
  maintain.

Verified: pnpm run artifacts:build now produces
kaelio-ktx-0.4.1.tgz and kaelio_ktx-0.4.1-py3-none-any.whl with
@kaelio/ktx@0.4.1 inside. Full type-check, dead-code, and
2287 vitests + 173 script tests pass.

* refactor(cli): inject embedding provider resolution and detect sentence-transformers runtime

Make resolveProjectEmbeddingProvider and runtimeIo injectable in ingest and
scan command entrypoints so tests can stub them, and teach
resolvePublicIngestRuntimeRequirements to flag the local-embeddings runtime
feature when ktx.yaml selects sentence-transformers.

* chore(cli): mark buildLocalStatsStatus and LocalStatsStatus as @internal

Both symbols are consumed only by status-project.test.ts. Annotating with
/** @internal */ keeps knip's production-mode check clean without changing
runtime behavior.

* fix(cli): use real package metadata in print-command-tree

The stubbed package name embedded a forbidden product identifier that
tripped the boundary check in CI. Read the metadata from package.json
instead — keeps the rendered tree unchanged and removes a duplicate
source of truth.

* feat(cli): show embedding coverage in `ktx status`, drop duplicate disk counts

Inline `(N embedded)` next to the Wiki scope counts and Semantic-layer
source counts, computed with `SUM(embedding_json IS NOT NULL)` over
`knowledge_pages` and `local_sl_sources`. Rename the "Knowledge" label to
"Wiki" (canonical per `docs/terminology.md`) and rename the matching
`localStats.knowledgePages` field to `localStats.wikiPages`.

Drop `wiki=N md` and `semantic-layer=N yaml` from the Disk row — those
duplicated the per-surface rows above. Disk now reports only actual byte
usage (db, cache, raw-sources). The unused `wikiGlobalMarkdownCount` /
`semanticLayerYamlCount` fields, the `isMarkdownEntry` / `isYamlEntry`
helpers, and the `filter` arg on `summarizeDir` are removed.
2026-05-21 15:28:58 +02:00

667 lines
22 KiB
TypeScript

import { describe, expect, it, vi } from 'vitest';
vi.mock('ai', async (importOriginal) => {
const actual = await importOriginal<typeof import('ai')>();
return { ...actual, generateText: vi.fn() };
});
import { generateText } from 'ai';
import {
buildKtxColumnDescriptionPrompt,
buildKtxDataSourceDescriptionPrompt,
buildKtxTableDescriptionPrompt,
type KtxDescriptionCachePort,
KtxDescriptionGenerator,
} from './description-generation.js';
import { createKtxConnectorCapabilities, type KtxScanConnector } from './types.js';
function createCache(initial: Record<string, string> = {}): KtxDescriptionCachePort {
const data = new Map(Object.entries(initial));
return {
buildTableKey: (table) => [table.catalog, table.db, table.name].filter(Boolean).join('.'),
buildColumnKey: (table, columnName) => [table.catalog, table.db, table.name, columnName].filter(Boolean).join('.'),
buildConnectionKey: (connectionName) => `__connection:${connectionName}`,
get: vi.fn(async (key: string) => data.get(key) ?? null),
set: vi.fn(async (key: string, value: string) => {
data.set(key, value);
}),
};
}
function createLlmProvider(text = 'generated description') {
vi.mocked(generateText).mockResolvedValue({ text } as never);
return {
generateText: vi.fn(async (input) => {
const result = await generateText({
system: input.system ? { role: 'system', content: input.system } : undefined,
messages: [{ role: 'user', content: input.prompt }],
temperature: input.temperature,
} as never);
return result.text;
}),
generateObject: vi.fn(),
runAgentLoop: vi.fn(),
} as any;
}
function createFailingLlmProvider(message = 'timeout exceeded when trying to connect') {
vi.mocked(generateText).mockRejectedValue(new Error(message) as never);
return {
generateText: vi.fn(async (input) => {
const result = await generateText({
system: input.system ? { role: 'system', content: input.system } : undefined,
messages: [{ role: 'user', content: input.prompt }],
temperature: input.temperature,
} as never);
return result.text;
}),
generateObject: vi.fn(),
runAgentLoop: vi.fn(),
} as any;
}
function createConnector(): KtxScanConnector {
return {
id: 'test-connector',
driver: 'postgres',
capabilities: createKtxConnectorCapabilities({
tableSampling: true,
columnSampling: true,
nestedAnalysis: true,
}),
introspect: vi.fn(async () => {
throw new Error('introspection is not used by description generation');
}),
sampleColumn: vi.fn(async () => ({
values: ['paid', 'refunded', null],
nullCount: 1,
distinctCount: 2,
})),
sampleTable: vi.fn(async () => ({
headers: ['id', 'status', 'amount'],
rows: [
[1, 'paid', 20],
[2, 'refunded', 10],
],
totalRows: 2,
})),
};
}
describe('KTX description prompt builders', () => {
it('builds column prompts with sample values, source descriptions, and nested BigQuery guidance', () => {
const { system, user } = buildKtxColumnDescriptionPrompt({
columnName: 'payload',
columnValues: [{ nested: true }, '[1,2]'],
tableContext: 'Table: events | Columns: payload | Data source: BIGQUERY',
dataSourceType: 'BIGQUERY',
supportsNestedAnalysis: true,
rawDescriptions: { db: 'Raw event payload', ai: 'Old AI text', user: 'User text' },
maxWords: 12,
});
expect(user).toContain(
'<table_context> Table: events | Columns: payload | Data source: BIGQUERY </table_context>',
);
expect(user).toContain('<column_name> payload </column_name>');
expect(user).toContain('<sample_values> [object Object], [1,2] </sample_values>');
expect(user).toContain('<db_documentation> Raw event payload </db_documentation>');
expect(user).not.toContain('Old AI text');
expect(user).not.toContain('User text');
expect(system).toContain('nested/structured data');
expect(system).toContain('12 words or less');
expect(user).not.toContain('12 words or less');
});
it('builds table and data-source prompts from sampled rows', () => {
const sample = {
headers: ['id', 'status'],
rows: [
[1, 'paid'],
[2, 'refunded'],
],
totalRows: 2,
};
const table = buildKtxTableDescriptionPrompt({
tableName: 'orders',
sampleData: sample,
dataSourceType: 'POSTGRESQL',
rawDescriptions: { dbt: 'Fact table for commerce orders' },
});
expect(table.user).toContain('status: paid, refunded');
expect(table.system).toContain('Analyze database tables');
const datasource = buildKtxDataSourceDescriptionPrompt({
tableSamples: [['orders', sample]],
dataSourceType: 'POSTGRESQL',
});
expect(datasource.user).toContain('orders (2 columns, 2 sample rows)');
expect(datasource.system).toContain('Analyze databases');
});
});
describe('KtxDescriptionGenerator', () => {
it('generates column descriptions with pre-fetched values, cache hits, and word-limit metadata', async () => {
const cache = createCache({ 'warehouse.public.orders.cached_status': 'Cached status description' });
const llmRuntime = createLlmProvider('Payment state');
const connector = createConnector();
const generator = new KtxDescriptionGenerator({
llmRuntime,
cache,
settings: {
columnMaxWords: 12,
tableMaxWords: 18,
dataSourceMaxWords: 24,
temperature: 0.2,
concurrencyLimit: 2,
},
});
const result = await generator.generateColumnDescriptions({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
supportsNestedAnalysis: false,
table: {
catalog: 'warehouse',
db: 'public',
name: 'orders',
columns: [
{ name: 'status', sampleValues: ['paid', 'refunded'], rawDescriptions: { db: 'Payment lifecycle' } },
{ name: 'cached_status', sampleValues: ['open'] },
],
},
skipExisting: false,
existingDescriptions: {},
});
expect(result).toEqual({
columnDescriptions: [
['status', 'Payment state'],
['cached_status', 'Cached status description'],
],
processedColumns: ['status'],
skippedColumns: ['cached_status'],
});
expect(connector.sampleColumn).not.toHaveBeenCalled();
expect(generateText).toHaveBeenCalledWith(
expect.objectContaining({
temperature: 0.2,
system: expect.objectContaining({
role: 'system',
content: expect.stringContaining('Please provide a concise description in 12 words or less.'),
}),
messages: expect.arrayContaining([
expect.objectContaining({
role: 'user',
content: expect.stringContaining('<column_name> status </column_name>'),
}),
]),
}),
);
const lastCall = vi.mocked(generateText).mock.calls.at(-1)?.[0];
expect(lastCall?.messages?.some((message) => message.role === 'system')).toBe(false);
});
it('samples through the connector when column values are not pre-fetched', async () => {
const connector = createConnector();
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Current order state'),
settings: {
columnMaxWords: 12,
tableMaxWords: 18,
dataSourceMaxWords: 24,
},
});
const result = await generator.generateColumnDescriptions({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
supportsNestedAnalysis: false,
table: {
catalog: null,
db: 'public',
name: 'orders',
columns: [{ name: 'status' }],
},
});
expect(connector.sampleColumn).toHaveBeenCalledWith(
{
connectionId: 'conn-1',
table: { catalog: null, db: 'public', name: 'orders' },
column: 'status',
limit: 50,
},
{ runId: 'run-1' },
);
expect(result.columnDescriptions).toEqual([['status', 'Current order state']]);
});
it('samples through a description sampling port without requiring structural introspection', async () => {
const sampler = {
id: 'description-sampler:conn-1',
sampleColumn: vi.fn(async () => ({
values: ['paid', 'refunded'],
nullCount: null,
distinctCount: null,
})),
sampleTable: vi.fn(async () => ({
headers: ['id', 'status'],
rows: [[1, 'paid']],
totalRows: 1,
})),
};
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Generated through sampler'),
settings: {
columnMaxWords: 12,
tableMaxWords: 18,
dataSourceMaxWords: 24,
},
});
const result = await generator.generateColumnDescriptions({
connectionId: 'conn-1',
connector: sampler,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
supportsNestedAnalysis: false,
table: {
catalog: null,
db: 'public',
name: 'orders',
columns: [{ name: 'status' }],
},
});
expect(result.columnDescriptions).toEqual([['status', 'Generated through sampler']]);
expect(sampler.sampleColumn).toHaveBeenCalledWith(
{
connectionId: 'conn-1',
table: { catalog: null, db: 'public', name: 'orders' },
column: 'status',
limit: 50,
},
{ runId: 'run-1' },
);
expect('introspect' in sampler).toBe(false);
});
it('does not turn LLM failures into generated descriptions', async () => {
const cache = createCache();
const connector = createConnector();
const generator = new KtxDescriptionGenerator({
llmRuntime: createFailingLlmProvider(),
cache,
settings: {
columnMaxWords: 12,
tableMaxWords: 18,
dataSourceMaxWords: 24,
},
});
const columnResult = await generator.generateColumnDescriptions({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
supportsNestedAnalysis: false,
table: {
catalog: null,
db: 'public',
name: 'orders',
columns: [{ name: 'status' }],
},
});
await expect(
generator.generateTableDescription({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
table: { catalog: null, db: 'public', name: 'orders' },
}),
).resolves.toBeNull();
expect(columnResult).toEqual({
columnDescriptions: [['status', null]],
processedColumns: [],
skippedColumns: [],
});
expect(cache.set).not.toHaveBeenCalled();
});
it('generates and caches table and data-source descriptions', async () => {
const cache = createCache();
const connector = createConnector();
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Commerce orders'),
cache,
settings: {
columnMaxWords: 12,
tableMaxWords: 18,
dataSourceMaxWords: 24,
concurrencyLimit: 2,
},
});
await expect(
generator.generateTableDescription({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
table: { catalog: 'warehouse', db: 'public', name: 'orders', rawDescriptions: { db: 'Raw orders' } },
}),
).resolves.toBe('Commerce orders');
await expect(
generator.generateDataSourceDescription({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
tables: [
{ catalog: 'warehouse', db: 'public', name: 'orders' },
{ catalog: 'warehouse', db: 'public', name: 'customers' },
],
connectionName: 'Warehouse',
}),
).resolves.toBe('Commerce orders');
expect(cache.set).toHaveBeenCalledWith('warehouse.public.orders', 'Commerce orders');
expect(cache.set).toHaveBeenCalledWith('__connection:Warehouse', 'Commerce orders');
});
});
describe('KtxDescriptionGenerator resilience', () => {
function createLogger() {
return {
debug: vi.fn(),
info: vi.fn(),
warn: vi.fn(),
error: vi.fn(),
};
}
it('retries sampleTable on transient failure and uses sampled rows when it eventually succeeds', async () => {
const sampleTable = vi
.fn<NonNullable<KtxScanConnector['sampleTable']>>()
.mockRejectedValueOnce(new Error('pool: transient ECONNRESET'))
.mockRejectedValueOnce(new Error('pool: transient ECONNRESET'))
.mockResolvedValue({
headers: ['id', 'status'],
rows: [
[1, 'paid'],
[2, 'refunded'],
],
totalRows: 2,
});
const connector: KtxScanConnector = {
...createConnector(),
sampleTable,
};
const logger = createLogger();
const warnings: Array<{ code: string; table?: string }> = [];
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Commerce orders'),
logger,
onWarning: (warning) => warnings.push({ code: warning.code, ...(warning.table ? { table: warning.table } : {}) }),
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24, concurrencyLimit: 2 },
});
const description = await generator.generateTableDescription({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
table: { catalog: null, db: 'public', name: 'orders' },
});
expect(description).toBe('Commerce orders');
expect(sampleTable).toHaveBeenCalledTimes(3);
expect(logger.warn).toHaveBeenCalledTimes(2);
expect(warnings).toEqual([]);
});
it('falls back to metadata-only prompt when sampleTable retries exhaust', async () => {
const sampleTable = vi
.fn<NonNullable<KtxScanConnector['sampleTable']>>()
.mockRejectedValue(new Error('pool: connection refused'));
const connector: KtxScanConnector = {
...createConnector(),
sampleTable,
};
const logger = createLogger();
const warnings: Array<{ code: string; table?: string; metadata?: Record<string, unknown> }> = [];
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Customer reference data'),
logger,
onWarning: (warning) =>
warnings.push({
code: warning.code,
...(warning.table ? { table: warning.table } : {}),
...(warning.metadata ? { metadata: warning.metadata } : {}),
}),
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24, concurrencyLimit: 2 },
});
const description = await generator.generateTableDescription({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
table: {
catalog: null,
db: 'public',
name: 'customers',
columns: [
{ name: 'id', nativeType: 'uuid' },
{ name: 'email', nativeType: 'text', comment: 'Primary contact email' },
],
},
});
expect(description).toBe('Customer reference data');
expect(sampleTable).toHaveBeenCalledTimes(3);
expect(warnings.map((warning) => warning.code)).toEqual(['sampling_failed', 'description_fallback_used']);
expect(warnings[1]?.metadata?.reason).toBe('sampling_failed');
const userPrompt = (vi.mocked(generateText).mock.calls.at(-1)?.[0] as { messages: Array<{ role: string; content: string }> })
.messages.find((message) => message.role === 'user')?.content;
expect(userPrompt).toContain('Columns (metadata only, no sample rows)');
expect(userPrompt).toContain('email (text)');
expect(userPrompt).toContain('Primary contact email');
});
it('emits enrichment_failed and returns null when both sampling and metadata-only LLM fail', async () => {
const sampleTable = vi
.fn<NonNullable<KtxScanConnector['sampleTable']>>()
.mockRejectedValue(new Error('pool: connection refused'));
const connector: KtxScanConnector = {
...createConnector(),
sampleTable,
};
const warnings: string[] = [];
const generator = new KtxDescriptionGenerator({
llmRuntime: createFailingLlmProvider(),
onWarning: (warning) => warnings.push(warning.code),
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
});
const description = await generator.generateTableDescription({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
table: { catalog: null, db: 'public', name: 'orphan', columns: [{ name: 'id' }] },
});
expect(description).toBeNull();
expect(warnings).toEqual(['sampling_failed', 'enrichment_failed']);
});
it('uses metadata-only fallback when connector has no sampleTable', async () => {
const connector = createConnector();
const samplerWithoutTable: KtxScanConnector = {
...connector,
sampleTable: undefined,
};
const warnings: string[] = [];
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Orders mart'),
onWarning: (warning) => warnings.push(warning.code),
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
});
const description = await generator.generateTableDescription({
connectionId: 'conn-1',
connector: samplerWithoutTable,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
table: {
catalog: null,
db: 'public',
name: 'mart_orders',
columns: [{ name: 'order_id', nativeType: 'uuid' }],
},
});
expect(description).toBe('Orders mart');
expect(warnings).toEqual(['connector_capability_missing', 'description_fallback_used']);
});
it('aborts retry loop when the scan context signal fires', async () => {
const controller = new AbortController();
const sampleTable = vi.fn<NonNullable<KtxScanConnector['sampleTable']>>().mockImplementation(async () => {
controller.abort();
throw new Error('first attempt blew up');
});
const connector: KtxScanConnector = {
...createConnector(),
sampleTable,
};
const warnings: string[] = [];
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('should not be called'),
onWarning: (warning) => warnings.push(warning.code),
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
});
await expect(
generator.generateTableDescription({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1', signal: controller.signal },
dataSourceType: 'POSTGRESQL',
table: { catalog: null, db: 'public', name: 'orders' },
}),
).rejects.toThrow('aborted');
expect(sampleTable).toHaveBeenCalledTimes(1);
expect(warnings).toEqual([]);
});
it('generates column descriptions from rawDescriptions when sampleColumn is unavailable', async () => {
const samplerWithoutColumn: KtxScanConnector = {
...createConnector(),
sampleColumn: undefined,
};
const logger = createLogger();
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Payment lifecycle state'),
logger,
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
});
const result = await generator.generateColumnDescriptions({
connectionId: 'conn-1',
connector: samplerWithoutColumn,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
supportsNestedAnalysis: false,
table: {
catalog: null,
db: 'public',
name: 'orders',
columns: [{ name: 'status', rawDescriptions: { db: 'order lifecycle state' } }],
},
});
expect(result.columnDescriptions).toEqual([['status', 'Payment lifecycle state']]);
expect(logger.warn).toHaveBeenCalled();
const userPrompt = (
vi.mocked(generateText).mock.calls.at(-1)?.[0] as { messages: Array<{ role: string; content: string }> }
).messages.find((message) => message.role === 'user')?.content;
expect(userPrompt).toContain('<sample_values> unavailable </sample_values>');
expect(userPrompt).toContain('<db_documentation> order lifecycle state </db_documentation>');
});
it('generates column descriptions from rawDescriptions when sampleColumn retries exhaust', async () => {
const sampleColumn = vi
.fn<NonNullable<KtxScanConnector['sampleColumn']>>()
.mockRejectedValue(new Error('pool: connection refused'));
const flakyConnector: KtxScanConnector = {
...createConnector(),
sampleColumn,
};
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('Customer reference identifier'),
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
});
const result = await generator.generateColumnDescriptions({
connectionId: 'conn-1',
connector: flakyConnector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
supportsNestedAnalysis: false,
table: {
catalog: null,
db: 'public',
name: 'orders',
columns: [{ name: 'customer_id', rawDescriptions: { db: 'FK to customers.id' } }],
},
});
expect(sampleColumn).toHaveBeenCalledTimes(3);
expect(result.columnDescriptions).toEqual([['customer_id', 'Customer reference identifier']]);
});
it('skips column LLM call only when neither samples nor rawDescriptions are available', async () => {
const sampleColumn = vi
.fn<NonNullable<KtxScanConnector['sampleColumn']>>()
.mockResolvedValue({ values: [null, null], nullCount: 2, distinctCount: 0 });
const connector: KtxScanConnector = {
...createConnector(),
sampleColumn,
};
vi.mocked(generateText).mockClear();
const generator = new KtxDescriptionGenerator({
llmRuntime: createLlmProvider('should not be called'),
settings: { columnMaxWords: 12, tableMaxWords: 18, dataSourceMaxWords: 24 },
});
const result = await generator.generateColumnDescriptions({
connectionId: 'conn-1',
connector,
context: { runId: 'run-1' },
dataSourceType: 'POSTGRESQL',
supportsNestedAnalysis: false,
table: {
catalog: null,
db: 'public',
name: 'orders',
columns: [{ name: 'opaque_blob' }],
},
});
expect(result.columnDescriptions).toEqual([['opaque_blob', null]]);
expect(generateText).not.toHaveBeenCalled();
});
});