mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
* feat(setup): drop redundant Snowflake schema prompt; fall back to free-text on listSchemas failure Snowflake setup previously asked for a single schema as free text, then ran a multiselect against the discovered schemas — two schema questions back-to-back, with the first being only a session bootstrap. The SDK's `schema` is optional, so the bootstrap step is unnecessary. - Remove the free-text Snowflake schema prompt; only pass `schema` to snowflake-sdk when one is configured. - When `listSchemas()` fails (e.g. role lacks SHOW SCHEMAS), prompt the user for a comma-separated list, persist it as `schema_names`, and use it as both the table-list filter and the multiselect default. Applies to every driver with a scope-discovery spec, not just Snowflake. - Update docs to lead with `schema_names`; keep `schema_name` as a documented single-schema shorthand. * fix(snowflake): keep introspecting when primary-key discovery is denied The PK query joins INFORMATION_SCHEMA.TABLE_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE, which require grants the connection role may not have. Previously a 'SQL compilation error: Object ANALYTICS.INFORMATION_SCHEMA.KEY_COLUMN_USAGE does not exist or not authorized' aborted the entire introspect — schemas, columns, and row counts were all discarded over a missing nice-to-have. Wrap the constraint query in try/catch, log a one-line warning per schema, and return an empty PK map. Columns end up with primaryKey=false; relationship inference still has FK and profiling to fall back on. * fix(scan): unblock relationship discovery on Snowflake Two adjacent bugs prevented the scan's relationship pipeline from producing any joins on a Snowflake warehouse: - relationship-profiling.ts fell through to a default `GROUP_CONCAT` branch for unknown drivers. Snowflake has no GROUP_CONCAT, so every per-table profile query failed with "Unknown function GROUP_CONCAT". Add an explicit Snowflake branch that uses LISTAGG with a literal '\x1f' delimiter (Snowflake requires the delimiter to be a constant, so CHR(31) is rejected). - description-generation.ts destructured `connector.sampleTable` and `connector.sampleColumn` into bare locals, losing the `this` binding when the class-method connectors (Snowflake, Postgres, MySQL) were invoked. Every sample call threw "Cannot read properties of undefined (reading 'assertConnection')" and degraded LLM descriptions to metadata-only prompts. Call the methods through the connector instead. Without these, even after the primary-key probe is allowed to fail softly, the scan ends up with 0 validated relationships and an empty `joins:` block in every shard YAML. * test(scan): cover table-ref helpers * feat(scan): plumb tableScope through live-database introspection port * feat(scan): apply tableScope during metadata fetch * feat(scan): enforce table scope at fetch boundary * feat(scan): pool Snowflake sessions and batch enrichment for faster ingest (#206) * feat(cli): add RSA key-pair auth option to Snowflake setup wizard Extends the interactive Snowflake setup flow with an authentication-method prompt (password vs RSA/JWT key-pair). The RSA branch collects a private-key path (env/file/absolute) and an optional passphrase; the resulting connection config records `authMethod: 'rsa'` with `privateKey` and `passphrase` instead of `password`. * feat(scan): pool Snowflake sessions * fix(scan): reuse structural snapshots and cleanup connectors * feat(scan): parallelize relationship profiling * feat(scan): batch table description generation * docs: document Snowflake ingest concurrency knobs * fix(scan): close Snowflake ingest perf verification gaps * fix(scan): keep batched description failure bounded * feat(scan): dispatch query-history probes by connection driver Extract historic-sql dialect resolution into a shared helper so the status-project readiness check and the local ingest factory agree on which connections enable query history and which probe to run. The status command now picks the postgres/snowflake/bigquery probe based on the connection's driver instead of always reporting against postgres, which previously caused snowflake connections with queryHistory.enabled to surface a misleading "driver is snowflake" failure. Also drops a noisy console.warn from Snowflake primary-key discovery — INFORMATION_SCHEMA.KEY_COLUMN_USAGE is commonly ungranted for read-only roles and the FK + profiling paths handle the empty PK map already. * fix(llm): allow StructuredOutput tool and raise maxTurns for generateObject The Claude Code agent SDK announces an internal pseudo-tool named StructuredOutput in the system/init message whenever outputFormat is set to { type: 'json_schema' }. The runtime's isolation check built its allowedToolIds set only from MCP tool ids and treated StructuredOutput as an unexpected host-injected tool, so every generateObject call threw "Claude Code runtime isolation failed: tools=StructuredOutput ..." and the table-descriptions and relationship-LLM-proposal enrichment stages recorded null output across the board. Whitelist StructuredOutput specifically in generateObject's allowedToolIds — the check also enforces missing_tools symmetry, so generateText and runAgentLoop, which do not see StructuredOutput, must not require it. generateObject also ran with maxTurns: 1, which the model intermittently breached when it emitted thinking text before the structured response. Raised to 5 to give the schema-bound call enough headroom without allowing unbounded loops. The existing tests now exercise the path with an init message that announces StructuredOutput so the regression cannot slip back in. * chore(scripts): add ktx-reset.sh project-cleanup helper Convenience script for repeatable ingest testing: takes a project directory and prunes everything except ktx.yaml and .ktx/secrets/, so the next ktx setup or ktx ingest run starts from a known-clean state.
628 lines
22 KiB
TypeScript
628 lines
22 KiB
TypeScript
import pLimit from 'p-limit';
|
|
import type { KtxLlmRuntimePort } from '../../context/llm/runtime-port.js';
|
|
import { buildDefaultKtxProjectConfig, type KtxScanRelationshipConfig } from '../project/config.js';
|
|
import { KtxDescriptionGenerator } from './description-generation.js';
|
|
import { buildKtxColumnEmbeddingText } from './embedding-text.js';
|
|
import {
|
|
completedKtxScanEnrichmentStateSummary,
|
|
computeKtxScanEnrichmentInputHash,
|
|
type KtxScanEnrichmentStateStore,
|
|
summarizeKtxScanEnrichmentState,
|
|
} from './enrichment-state.js';
|
|
import { skippedKtxScanEnrichmentSummary } from './enrichment-summary.js';
|
|
import type {
|
|
KtxEmbeddingUpdate,
|
|
KtxEnrichedColumn,
|
|
KtxEnrichedRelationship,
|
|
KtxEnrichedSchema,
|
|
KtxEnrichedTable,
|
|
KtxRelationshipEndpoint,
|
|
KtxRelationshipUpdate,
|
|
} from './enrichment-types.js';
|
|
import type { KtxCompositeRelationshipCandidate } from './relationship-composite-candidates.js';
|
|
import type { KtxResolvedRelationshipDiscoveryCandidate } from './relationship-graph-resolver.js';
|
|
import { discoverKtxRelationships } from './relationship-discovery.js';
|
|
import type { KtxRelationshipProfileArtifact } from './relationship-profiling.js';
|
|
import type {
|
|
KtxEmbeddingPort,
|
|
KtxProgressPort,
|
|
KtxScanConnector,
|
|
KtxScanContext,
|
|
KtxScanEnrichmentStage,
|
|
KtxScanEnrichmentStateSummary,
|
|
KtxScanEnrichmentSummary,
|
|
KtxScanMode,
|
|
KtxScanRelationshipSummary,
|
|
KtxScanWarning,
|
|
KtxSchemaColumn,
|
|
KtxSchemaForeignKey,
|
|
KtxSchemaSnapshot,
|
|
KtxSchemaTable,
|
|
KtxTableRef,
|
|
} from './types.js';
|
|
|
|
const DESCRIPTION_TABLE_CONCURRENCY = 4;
|
|
|
|
export interface KtxLocalScanEnrichmentProviders {
|
|
llmRuntime: KtxLlmRuntimePort;
|
|
embedding?: KtxEmbeddingPort | null;
|
|
}
|
|
|
|
export interface KtxLocalScanEnrichmentInput {
|
|
connectionId: string;
|
|
mode: KtxScanMode;
|
|
detectRelationships?: boolean;
|
|
connector: KtxScanConnector;
|
|
snapshot?: KtxSchemaSnapshot;
|
|
context: KtxScanContext;
|
|
providers: KtxLocalScanEnrichmentProviders | null;
|
|
stateStore?: KtxScanEnrichmentStateStore | null;
|
|
syncId?: string;
|
|
providerIdentity?: Record<string, unknown>;
|
|
relationshipSettings?: KtxScanRelationshipConfig;
|
|
now?: () => Date;
|
|
}
|
|
|
|
export interface KtxLocalScanEnrichmentResult {
|
|
snapshot: KtxSchemaSnapshot;
|
|
summary: KtxScanEnrichmentSummary;
|
|
relationships: KtxScanRelationshipSummary;
|
|
state: KtxScanEnrichmentStateSummary;
|
|
warnings: KtxScanWarning[];
|
|
descriptionUpdates: Array<{
|
|
table: KtxTableRef;
|
|
tableDescription: string | null;
|
|
columnDescriptions: Record<string, string | null>;
|
|
}>;
|
|
embeddingUpdates: KtxEmbeddingUpdate[];
|
|
relationshipUpdate: KtxRelationshipUpdate | null;
|
|
relationshipProfile: KtxRelationshipProfileArtifact | null;
|
|
resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null;
|
|
compositeRelationships: KtxCompositeRelationshipCandidate[] | null;
|
|
}
|
|
|
|
function tableId(table: KtxSchemaTable): string {
|
|
return [table.catalog, table.db, table.name].filter((value): value is string => Boolean(value)).join('.');
|
|
}
|
|
|
|
function columnId(table: KtxSchemaTable, column: KtxSchemaColumn): string {
|
|
return `${tableId(table)}.${column.name}`;
|
|
}
|
|
|
|
function tableRef(table: KtxSchemaTable): KtxTableRef {
|
|
return {
|
|
catalog: table.catalog,
|
|
db: table.db,
|
|
name: table.name,
|
|
};
|
|
}
|
|
|
|
function endpoint(table: KtxEnrichedTable, column: KtxEnrichedColumn): KtxRelationshipEndpoint {
|
|
return {
|
|
tableId: table.id,
|
|
columnIds: [column.id],
|
|
table: table.ref,
|
|
columns: [column.name],
|
|
};
|
|
}
|
|
|
|
function relationshipId(from: KtxRelationshipEndpoint, to: KtxRelationshipEndpoint): string {
|
|
return `${from.tableId}:(${from.columnIds.join(',')})->${to.tableId}:(${to.columnIds.join(',')})`;
|
|
}
|
|
|
|
function targetMatchesForeignKey(table: KtxEnrichedTable, foreignKey: KtxSchemaForeignKey): boolean {
|
|
return (
|
|
table.ref.name === foreignKey.toTable &&
|
|
(foreignKey.toCatalog === null || table.ref.catalog === foreignKey.toCatalog) &&
|
|
(foreignKey.toDb === null || table.ref.db === foreignKey.toDb)
|
|
);
|
|
}
|
|
|
|
function formalRelationshipsFromSnapshot(
|
|
snapshot: KtxSchemaSnapshot,
|
|
tables: readonly KtxEnrichedTable[],
|
|
): KtxEnrichedRelationship[] {
|
|
const tableById = new Map(tables.map((table) => [table.id, table]));
|
|
const relationships: KtxEnrichedRelationship[] = [];
|
|
|
|
for (const sourceTableSnapshot of snapshot.tables) {
|
|
const sourceTable = tableById.get(tableId(sourceTableSnapshot));
|
|
if (!sourceTable) {
|
|
continue;
|
|
}
|
|
|
|
for (const foreignKey of sourceTableSnapshot.foreignKeys) {
|
|
const sourceColumn = sourceTable.columns.find((column) => column.name === foreignKey.fromColumn);
|
|
const targetTable = tables.find((table) => targetMatchesForeignKey(table, foreignKey));
|
|
const targetColumn = targetTable?.columns.find((column) => column.name === foreignKey.toColumn);
|
|
if (!sourceColumn || !targetTable || !targetColumn) {
|
|
continue;
|
|
}
|
|
|
|
const from = endpoint(sourceTable, sourceColumn);
|
|
const to = endpoint(targetTable, targetColumn);
|
|
relationships.push({
|
|
id: relationshipId(from, to),
|
|
source: 'formal',
|
|
from,
|
|
to,
|
|
relationshipType: 'many_to_one',
|
|
confidence: 1,
|
|
isPrimaryKeyReference: true,
|
|
});
|
|
}
|
|
}
|
|
|
|
return relationships.sort((left, right) => left.id.localeCompare(right.id));
|
|
}
|
|
|
|
function providerlessEnrichedWarning(relationshipDetection: boolean): KtxScanWarning {
|
|
return {
|
|
code: 'scan_enrichment_backend_not_configured',
|
|
message:
|
|
'Skipping description and embedding enrichment because scan.enrichment.mode is not configured; relationship discovery still ran.',
|
|
recoverable: true,
|
|
metadata: {
|
|
skippedStages: ['descriptions', 'embeddings'],
|
|
relationshipDetection,
|
|
},
|
|
};
|
|
}
|
|
|
|
export function createDeterministicLocalScanEnrichmentProviders(): KtxLocalScanEnrichmentProviders {
|
|
return {
|
|
llmRuntime: deterministicLlmRuntime(),
|
|
};
|
|
}
|
|
|
|
function deterministicLlmRuntime(): KtxLlmRuntimePort {
|
|
return {
|
|
async generateText(input) {
|
|
return `Deterministic description for ${input.prompt.slice(0, 64).trim() || 'data source'}`;
|
|
},
|
|
async generateObject(input) {
|
|
if (input.prompt.includes('Sample rows:')) {
|
|
const columns = Array.from(input.prompt.matchAll(/^- ([^\s(]+)/gm), (match) => ({
|
|
name: match[1] ?? 'column',
|
|
description: `Deterministic description for ${match[1] ?? 'column'}`,
|
|
}));
|
|
return {
|
|
tableDescription: `Deterministic description for ${input.prompt.slice(0, 64).trim() || 'table'}`,
|
|
columns,
|
|
} as never;
|
|
}
|
|
return { pkCandidates: [], fkCandidates: [] } as never;
|
|
},
|
|
async runAgentLoop() {
|
|
return { stopReason: 'natural' };
|
|
},
|
|
};
|
|
}
|
|
|
|
export function snapshotToKtxEnrichedSchema(
|
|
snapshot: KtxSchemaSnapshot,
|
|
embeddingsByColumnId: ReadonlyMap<string, number[]> = new Map(),
|
|
): KtxEnrichedSchema {
|
|
const tables: KtxEnrichedTable[] = snapshot.tables.map((table) => {
|
|
const id = tableId(table);
|
|
const ref = tableRef(table);
|
|
const columns: KtxEnrichedColumn[] = table.columns.map((column) => {
|
|
const idForColumn = columnId(table, column);
|
|
return {
|
|
id: idForColumn,
|
|
tableId: id,
|
|
tableRef: ref,
|
|
name: column.name,
|
|
nativeType: column.nativeType,
|
|
normalizedType: column.normalizedType,
|
|
dimensionType: column.dimensionType,
|
|
nullable: column.nullable,
|
|
primaryKey: column.primaryKey,
|
|
parentColumnId: null,
|
|
descriptions: {
|
|
...(column.comment ? { db: column.comment } : {}),
|
|
},
|
|
embedding: embeddingsByColumnId.get(idForColumn) ?? null,
|
|
sampleValues: null,
|
|
cardinality: null,
|
|
};
|
|
});
|
|
return {
|
|
id,
|
|
ref,
|
|
enabled: true,
|
|
descriptions: {
|
|
...(table.comment ? { db: table.comment } : {}),
|
|
},
|
|
columns,
|
|
};
|
|
});
|
|
|
|
return {
|
|
connectionId: snapshot.connectionId,
|
|
tables,
|
|
relationships: formalRelationshipsFromSnapshot(snapshot, tables),
|
|
};
|
|
}
|
|
|
|
function embeddingBatchSize(maxBatchSize: number): number {
|
|
return Number.isInteger(maxBatchSize) && maxBatchSize > 0 ? maxBatchSize : 100;
|
|
}
|
|
|
|
async function generateDescriptions(input: {
|
|
snapshot: KtxSchemaSnapshot;
|
|
connector: KtxScanConnector;
|
|
context: KtxScanContext;
|
|
providers: KtxLocalScanEnrichmentProviders;
|
|
progress?: KtxProgressPort;
|
|
warnings?: KtxScanWarning[];
|
|
}): Promise<KtxLocalScanEnrichmentResult['descriptionUpdates']> {
|
|
const warningSink = input.warnings;
|
|
const generator = new KtxDescriptionGenerator({
|
|
llmRuntime: input.providers.llmRuntime,
|
|
...(input.context.logger ? { logger: input.context.logger } : {}),
|
|
...(warningSink
|
|
? {
|
|
onWarning: (warning: KtxScanWarning) => {
|
|
warningSink.push(warning);
|
|
},
|
|
}
|
|
: {}),
|
|
settings: {
|
|
columnMaxWords: 16,
|
|
tableMaxWords: 24,
|
|
dataSourceMaxWords: 32,
|
|
concurrencyLimit: 4,
|
|
},
|
|
});
|
|
|
|
const updates: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [];
|
|
const totalTables = input.snapshot.tables.length;
|
|
if (totalTables === 0) {
|
|
await input.progress?.update(1, 'No tables to describe');
|
|
return updates;
|
|
}
|
|
const limitTable = pLimit(DESCRIPTION_TABLE_CONCURRENCY);
|
|
const tableUpdates = await Promise.all(
|
|
input.snapshot.tables.map((table, index) =>
|
|
limitTable(async () => {
|
|
await input.progress?.update(
|
|
(index + 1) / totalTables,
|
|
`Generating descriptions ${index + 1}/${totalTables} tables`,
|
|
{
|
|
transient: true,
|
|
},
|
|
);
|
|
const batched = await generator.generateBatchedTableDescriptions({
|
|
connectionId: input.snapshot.connectionId,
|
|
connector: input.connector,
|
|
context: input.context,
|
|
dataSourceType: input.snapshot.driver,
|
|
supportsNestedAnalysis: input.connector.capabilities.nestedAnalysis,
|
|
table: {
|
|
catalog: table.catalog,
|
|
db: table.db,
|
|
name: table.name,
|
|
rawDescriptions: table.comment ? { db: table.comment } : {},
|
|
columns: table.columns.map((column) => ({
|
|
name: column.name,
|
|
type: column.nativeType,
|
|
...(column.comment ? { rawDescriptions: { db: column.comment } } : {}),
|
|
})),
|
|
},
|
|
});
|
|
return {
|
|
table: tableRef(table),
|
|
tableDescription: batched.tableDescription,
|
|
columnDescriptions: Object.fromEntries(batched.columnDescriptions),
|
|
};
|
|
}),
|
|
),
|
|
);
|
|
updates.push(...tableUpdates);
|
|
await input.progress?.update(1, `Generated descriptions for ${totalTables} tables`);
|
|
return updates;
|
|
}
|
|
|
|
async function buildEmbeddings(input: {
|
|
snapshot: KtxSchemaSnapshot;
|
|
embedding: KtxEmbeddingPort;
|
|
descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'];
|
|
progress?: KtxProgressPort;
|
|
}): Promise<{ updates: KtxEmbeddingUpdate[]; byColumnId: Map<string, number[]> }> {
|
|
const descriptionByTable = new Map(input.descriptions.map((item) => [item.table.name, item]));
|
|
const texts: Array<{ columnId: string; text: string }> = [];
|
|
|
|
for (const table of input.snapshot.tables) {
|
|
const tableDescriptions = descriptionByTable.get(table.name);
|
|
for (const column of table.columns) {
|
|
const id = columnId(table, column);
|
|
const text = buildKtxColumnEmbeddingText({
|
|
tableName: table.name,
|
|
columnName: column.name,
|
|
columnType: column.nativeType,
|
|
resolvedDescription: tableDescriptions?.columnDescriptions[column.name] ?? column.comment,
|
|
resolvedTableDescription: tableDescriptions?.tableDescription ?? table.comment,
|
|
sampleValues: column.comment ? [column.comment] : null,
|
|
foreignKeys: {
|
|
outgoing: (table.foreignKeys ?? [])
|
|
.filter((foreignKey) => foreignKey.fromColumn === column.name)
|
|
.map((foreignKey) => ({ toTable: foreignKey.toTable, toColumn: foreignKey.toColumn })),
|
|
incoming: [],
|
|
},
|
|
});
|
|
texts.push({ columnId: id, text });
|
|
}
|
|
}
|
|
|
|
const embeddings: number[][] = [];
|
|
const maxBatchSize = embeddingBatchSize(input.embedding.maxBatchSize);
|
|
const embeddingTexts = texts.map((item) => item.text);
|
|
const batchCount = Math.ceil(embeddingTexts.length / maxBatchSize);
|
|
if (batchCount === 0) {
|
|
await input.progress?.update(1, 'No embeddings to build');
|
|
}
|
|
for (let offset = 0; offset < embeddingTexts.length; offset += maxBatchSize) {
|
|
const batchIndex = Math.floor(offset / maxBatchSize) + 1;
|
|
await input.progress?.update(batchIndex / batchCount, `Building embeddings ${batchIndex}/${batchCount} batches`, {
|
|
transient: true,
|
|
});
|
|
const batch = embeddingTexts.slice(offset, offset + maxBatchSize);
|
|
const batchEmbeddings = await input.embedding.embedBatch(batch);
|
|
if (batchEmbeddings.length !== batch.length) {
|
|
throw new Error(`expected ${batch.length} embeddings, received ${batchEmbeddings.length}`);
|
|
}
|
|
embeddings.push(...batchEmbeddings);
|
|
}
|
|
|
|
const byColumnId = new Map<string, number[]>();
|
|
const updates = texts.map((item, index) => {
|
|
const embedding = embeddings[index] ?? [];
|
|
byColumnId.set(item.columnId, embedding);
|
|
return {
|
|
columnId: item.columnId,
|
|
text: item.text,
|
|
embedding,
|
|
};
|
|
});
|
|
if (batchCount > 0) {
|
|
await input.progress?.update(1, `Built embeddings for ${updates.length} columns`);
|
|
}
|
|
return { updates, byColumnId };
|
|
}
|
|
|
|
async function runEnrichmentStage<TOutput>(input: {
|
|
stateStore: KtxScanEnrichmentStateStore | null | undefined;
|
|
runId: string;
|
|
connectionId: string;
|
|
syncId: string;
|
|
mode: KtxScanMode;
|
|
stage: KtxScanEnrichmentStage;
|
|
inputHash: string;
|
|
now: () => Date;
|
|
resumedStages: KtxScanEnrichmentStage[];
|
|
completedStages: KtxScanEnrichmentStage[];
|
|
failedStages: KtxScanEnrichmentStage[];
|
|
compute: () => Promise<TOutput>;
|
|
}): Promise<TOutput> {
|
|
const existing = await input.stateStore?.findCompletedStage<TOutput>({
|
|
runId: input.runId,
|
|
stage: input.stage,
|
|
inputHash: input.inputHash,
|
|
});
|
|
if (existing) {
|
|
input.resumedStages.push(input.stage);
|
|
input.completedStages.push(input.stage);
|
|
return existing.output;
|
|
}
|
|
|
|
try {
|
|
const output = await input.compute();
|
|
input.completedStages.push(input.stage);
|
|
await input.stateStore?.saveCompletedStage({
|
|
runId: input.runId,
|
|
connectionId: input.connectionId,
|
|
syncId: input.syncId,
|
|
mode: input.mode,
|
|
stage: input.stage,
|
|
inputHash: input.inputHash,
|
|
output,
|
|
updatedAt: input.now().toISOString(),
|
|
});
|
|
return output;
|
|
} catch (error) {
|
|
input.failedStages.push(input.stage);
|
|
await input.stateStore?.saveFailedStage({
|
|
runId: input.runId,
|
|
connectionId: input.connectionId,
|
|
syncId: input.syncId,
|
|
mode: input.mode,
|
|
stage: input.stage,
|
|
inputHash: input.inputHash,
|
|
errorMessage: error instanceof Error ? error.message : String(error),
|
|
updatedAt: input.now().toISOString(),
|
|
});
|
|
throw error;
|
|
}
|
|
}
|
|
|
|
function embeddingsByColumnId(updates: KtxEmbeddingUpdate[]): Map<string, number[]> {
|
|
return new Map(updates.map((update) => [update.columnId, update.embedding]));
|
|
}
|
|
|
|
export async function runLocalScanEnrichment(
|
|
input: KtxLocalScanEnrichmentInput,
|
|
): Promise<KtxLocalScanEnrichmentResult> {
|
|
const progress = input.context.progress;
|
|
await progress?.update(0, 'Loading enrichment schema snapshot');
|
|
const snapshot =
|
|
input.snapshot ??
|
|
(await input.connector.introspect(
|
|
{
|
|
connectionId: input.connectionId,
|
|
driver: input.connector.driver,
|
|
mode: input.mode,
|
|
detectRelationships: input.detectRelationships,
|
|
},
|
|
input.context,
|
|
));
|
|
await progress?.update(0.05, `Loaded schema snapshot with ${snapshot.tables.length} tables`);
|
|
|
|
const now = input.now ?? (() => new Date());
|
|
const state = completedKtxScanEnrichmentStateSummary();
|
|
const syncId = input.syncId ?? input.context.runId;
|
|
const relationshipSettings = input.relationshipSettings ?? buildDefaultKtxProjectConfig().scan.relationships;
|
|
const inputHash = computeKtxScanEnrichmentInputHash({
|
|
snapshot,
|
|
mode: input.mode,
|
|
detectRelationships: input.detectRelationships ?? false,
|
|
providerIdentity: input.providerIdentity ?? {},
|
|
relationshipSettings,
|
|
});
|
|
const warnings: KtxScanWarning[] = [];
|
|
let descriptions: KtxLocalScanEnrichmentResult['descriptionUpdates'] = [];
|
|
let embeddingUpdates: KtxEmbeddingUpdate[] = [];
|
|
let schema = snapshotToKtxEnrichedSchema(snapshot);
|
|
const summary: KtxScanEnrichmentSummary = { ...skippedKtxScanEnrichmentSummary };
|
|
const relationshipDetectionEnabled = relationshipSettings.enabled;
|
|
const shouldDetectRelationships =
|
|
relationshipDetectionEnabled &&
|
|
(input.mode === 'relationships' || input.mode === 'enriched' || (input.detectRelationships ?? false));
|
|
|
|
if (input.mode === 'enriched' && !input.providers) {
|
|
warnings.push(providerlessEnrichedWarning(shouldDetectRelationships));
|
|
}
|
|
|
|
if (input.mode === 'enriched' && input.providers) {
|
|
const providers = input.providers;
|
|
const descriptionProgress = progress?.startPhase(0.45);
|
|
descriptions = await runEnrichmentStage({
|
|
stateStore: input.stateStore,
|
|
runId: input.context.runId,
|
|
connectionId: input.connectionId,
|
|
syncId,
|
|
mode: input.mode,
|
|
stage: 'descriptions',
|
|
inputHash,
|
|
now,
|
|
resumedStages: state.resumedStages,
|
|
completedStages: state.completedStages,
|
|
failedStages: state.failedStages,
|
|
compute: () =>
|
|
generateDescriptions({
|
|
snapshot,
|
|
connector: input.connector,
|
|
context: input.context,
|
|
providers,
|
|
progress: descriptionProgress,
|
|
warnings,
|
|
}),
|
|
});
|
|
summary.dataDictionary = input.connector.sampleColumn ? 'completed' : 'skipped';
|
|
summary.tableDescriptions = 'completed';
|
|
summary.columnDescriptions = 'completed';
|
|
|
|
const embeddingProgress = progress?.startPhase(0.2);
|
|
const embedding = providers.embedding;
|
|
if (embedding) {
|
|
embeddingUpdates = await runEnrichmentStage({
|
|
stateStore: input.stateStore,
|
|
runId: input.context.runId,
|
|
connectionId: input.connectionId,
|
|
syncId,
|
|
mode: input.mode,
|
|
stage: 'embeddings',
|
|
inputHash,
|
|
now,
|
|
resumedStages: state.resumedStages,
|
|
completedStages: state.completedStages,
|
|
failedStages: state.failedStages,
|
|
compute: async () => {
|
|
const embeddings = await buildEmbeddings({
|
|
snapshot,
|
|
embedding,
|
|
descriptions,
|
|
progress: embeddingProgress,
|
|
});
|
|
return embeddings.updates;
|
|
},
|
|
});
|
|
schema = snapshotToKtxEnrichedSchema(snapshot, embeddingsByColumnId(embeddingUpdates));
|
|
summary.embeddings = 'completed';
|
|
}
|
|
}
|
|
|
|
let relationshipUpdate: KtxRelationshipUpdate | null = null;
|
|
let relationshipProfile: KtxRelationshipProfileArtifact | null = null;
|
|
let resolvedRelationships: KtxResolvedRelationshipDiscoveryCandidate[] | null = null;
|
|
let compositeRelationships: KtxCompositeRelationshipCandidate[] | null = null;
|
|
let relationships: KtxScanRelationshipSummary = { accepted: 0, review: 0, rejected: 0, skipped: 0 };
|
|
if (shouldDetectRelationships) {
|
|
const relationshipProgress = progress?.startPhase(0.25);
|
|
const relationshipStage = await runEnrichmentStage({
|
|
stateStore: input.stateStore,
|
|
runId: input.context.runId,
|
|
connectionId: input.connectionId,
|
|
syncId,
|
|
mode: input.mode,
|
|
stage: 'relationships',
|
|
inputHash,
|
|
now,
|
|
resumedStages: state.resumedStages,
|
|
completedStages: state.completedStages,
|
|
failedStages: state.failedStages,
|
|
compute: async () => {
|
|
await relationshipProgress?.update(0, 'Detecting relationships');
|
|
const detection = await discoverKtxRelationships({
|
|
connectionId: input.connectionId,
|
|
driver: snapshot.driver,
|
|
connector: input.connector,
|
|
schema,
|
|
context: input.context,
|
|
settings: relationshipSettings,
|
|
llmRuntime: input.providers?.llmRuntime ?? null,
|
|
});
|
|
|
|
await relationshipProgress?.update(
|
|
1,
|
|
`Relationship detection found ${detection.relationships.accepted} accepted, ${detection.relationships.review} review`,
|
|
);
|
|
return {
|
|
relationshipUpdate: detection.relationshipUpdate,
|
|
relationshipProfile: detection.profile,
|
|
resolvedRelationships: detection.resolvedRelationships,
|
|
compositeRelationships: detection.compositeRelationships,
|
|
relationships: detection.relationships,
|
|
statisticalValidation: detection.statisticalValidation,
|
|
llmRelationshipValidation: detection.llmRelationshipValidation,
|
|
warnings: detection.warnings,
|
|
};
|
|
},
|
|
});
|
|
|
|
summary.deterministicRelationships = 'completed';
|
|
summary.llmRelationshipValidation = relationshipStage.llmRelationshipValidation;
|
|
summary.statisticalValidation = relationshipStage.statisticalValidation;
|
|
relationshipUpdate = relationshipStage.relationshipUpdate;
|
|
relationshipProfile = relationshipStage.relationshipProfile;
|
|
resolvedRelationships = relationshipStage.resolvedRelationships;
|
|
compositeRelationships = relationshipStage.compositeRelationships;
|
|
relationships = relationshipStage.relationships;
|
|
warnings.push(...relationshipStage.warnings);
|
|
}
|
|
|
|
await progress?.update(1, 'Enrichment complete');
|
|
return {
|
|
snapshot,
|
|
summary,
|
|
relationships,
|
|
state: summarizeKtxScanEnrichmentState(state),
|
|
warnings,
|
|
descriptionUpdates: descriptions,
|
|
embeddingUpdates,
|
|
relationshipUpdate,
|
|
relationshipProfile,
|
|
resolvedRelationships,
|
|
compositeRelationships,
|
|
};
|
|
}
|