fix(snowflake): unblock multi-schema ingest and relationship discovery (#204)

* feat(setup): drop redundant Snowflake schema prompt; fall back to free-text on listSchemas failure

Snowflake setup previously asked for a single schema as free text, then
ran a multiselect against the discovered schemas — two schema questions
back-to-back, with the first being only a session bootstrap. The SDK's
`schema` is optional, so the bootstrap step is unnecessary.

- Remove the free-text Snowflake schema prompt; only pass `schema` to
  snowflake-sdk when one is configured.
- When `listSchemas()` fails (e.g. role lacks SHOW SCHEMAS), prompt the
  user for a comma-separated list, persist it as `schema_names`, and use
  it as both the table-list filter and the multiselect default. Applies
  to every driver with a scope-discovery spec, not just Snowflake.
- Update docs to lead with `schema_names`; keep `schema_name` as a
  documented single-schema shorthand.

* fix(snowflake): keep introspecting when primary-key discovery is denied

The PK query joins INFORMATION_SCHEMA.TABLE_CONSTRAINTS and
INFORMATION_SCHEMA.KEY_COLUMN_USAGE, which require grants the
connection role may not have. Previously a 'SQL compilation error:
Object ANALYTICS.INFORMATION_SCHEMA.KEY_COLUMN_USAGE does not exist
or not authorized' aborted the entire introspect — schemas, columns,
and row counts were all discarded over a missing nice-to-have.

Wrap the constraint query in try/catch, log a one-line warning per
schema, and return an empty PK map. Columns end up with
primaryKey=false; relationship inference still has FK and profiling
to fall back on.

* fix(scan): unblock relationship discovery on Snowflake

Two adjacent bugs prevented the scan's relationship pipeline from producing
any joins on a Snowflake warehouse:

- relationship-profiling.ts fell through to a default `GROUP_CONCAT` branch
  for unknown drivers. Snowflake has no GROUP_CONCAT, so every per-table
  profile query failed with "Unknown function GROUP_CONCAT". Add an explicit
  Snowflake branch that uses LISTAGG with a literal '\x1f' delimiter
  (Snowflake requires the delimiter to be a constant, so CHR(31) is rejected).
- description-generation.ts destructured `connector.sampleTable` and
  `connector.sampleColumn` into bare locals, losing the `this` binding when
  the class-method connectors (Snowflake, Postgres, MySQL) were invoked.
  Every sample call threw "Cannot read properties of undefined (reading
  'assertConnection')" and degraded LLM descriptions to metadata-only
  prompts. Call the methods through the connector instead.

Without these, even after the primary-key probe is allowed to fail softly,
the scan ends up with 0 validated relationships and an empty `joins:` block
in every shard YAML.

* test(scan): cover table-ref helpers

* feat(scan): plumb tableScope through live-database introspection port

* feat(scan): apply tableScope during metadata fetch

* feat(scan): enforce table scope at fetch boundary

* feat(scan): pool Snowflake sessions and batch enrichment for faster ingest (#206)

* feat(cli): add RSA key-pair auth option to Snowflake setup wizard

Extends the interactive Snowflake setup flow with an authentication-method
prompt (password vs RSA/JWT key-pair). The RSA branch collects a private-key
path (env/file/absolute) and an optional passphrase; the resulting connection
config records `authMethod: 'rsa'` with `privateKey` and `passphrase` instead
of `password`.

* feat(scan): pool Snowflake sessions

* fix(scan): reuse structural snapshots and cleanup connectors

* feat(scan): parallelize relationship profiling

* feat(scan): batch table description generation

* docs: document Snowflake ingest concurrency knobs

* fix(scan): close Snowflake ingest perf verification gaps

* fix(scan): keep batched description failure bounded

* feat(scan): dispatch query-history probes by connection driver

Extract historic-sql dialect resolution into a shared helper so the
status-project readiness check and the local ingest factory agree on
which connections enable query history and which probe to run. The
status command now picks the postgres/snowflake/bigquery probe based on
the connection's driver instead of always reporting against postgres,
which previously caused snowflake connections with queryHistory.enabled
to surface a misleading "driver is snowflake" failure.

Also drops a noisy console.warn from Snowflake primary-key discovery —
INFORMATION_SCHEMA.KEY_COLUMN_USAGE is commonly ungranted for read-only
roles and the FK + profiling paths handle the empty PK map already.

* fix(llm): allow StructuredOutput tool and raise maxTurns for generateObject

The Claude Code agent SDK announces an internal pseudo-tool named
StructuredOutput in the system/init message whenever outputFormat is set
to { type: 'json_schema' }. The runtime's isolation check built its
allowedToolIds set only from MCP tool ids and treated StructuredOutput
as an unexpected host-injected tool, so every generateObject call threw
"Claude Code runtime isolation failed: tools=StructuredOutput ..." and
the table-descriptions and relationship-LLM-proposal enrichment stages
recorded null output across the board.

Whitelist StructuredOutput specifically in generateObject's
allowedToolIds — the check also enforces missing_tools symmetry, so
generateText and runAgentLoop, which do not see StructuredOutput, must
not require it.

generateObject also ran with maxTurns: 1, which the model intermittently
breached when it emitted thinking text before the structured response.
Raised to 5 to give the schema-bound call enough headroom without
allowing unbounded loops. The existing tests now exercise the path with
an init message that announces StructuredOutput so the regression cannot
slip back in.

* chore(scripts): add ktx-reset.sh project-cleanup helper

Convenience script for repeatable ingest testing: takes a project
directory and prunes everything except ktx.yaml and .ktx/secrets/, so
the next ktx setup or ktx ingest run starts from a known-clean state.
This commit is contained in:
Andrey Avtomonov 2026-05-23 10:41:30 +02:00 committed by GitHub
parent b0dd13ce7c
commit 394a985d2a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
72 changed files with 3508 additions and 655 deletions

View file

@ -545,8 +545,8 @@ describe('setup databases step', () => {
},
{
driver: 'snowflake',
selectValues: ['no'],
textValues: ['', 'env:SNOWFLAKE_ACCOUNT', 'ANALYTICS_WH', 'ANALYTICS', '', 'env:SNOWFLAKE_USER', ''],
selectValues: ['password', 'no'],
textValues: ['', 'env:SNOWFLAKE_ACCOUNT', 'ANALYTICS_WH', 'ANALYTICS', 'env:SNOWFLAKE_USER', ''],
passwordValues: ['env:SNOWFLAKE_PASSWORD'],
expectedTextPrompts: [
{
@ -563,11 +563,6 @@ describe('setup databases step', () => {
{
message: 'Snowflake database name',
},
{
message: 'Snowflake schema\nPress Enter for PUBLIC, or enter a schema name.',
placeholder: 'PUBLIC',
initialValue: 'PUBLIC',
},
{
message: 'Snowflake username',
},
@ -602,6 +597,8 @@ describe('setup databases step', () => {
prompts,
testConnection: vi.fn(async () => 0),
scanConnection: vi.fn(async () => 0),
listSchemas: vi.fn(async () => []),
listTables: vi.fn(async () => []),
},
);
@ -775,6 +772,8 @@ describe('setup databases step', () => {
});
const testConnection = vi.fn(async () => 0);
const scanConnection = vi.fn(async () => 0);
const listSchemas = vi.fn(async () => []);
const listTables = vi.fn(async () => []);
const result = await runKtxSetupDatabasesStep(
{
@ -785,7 +784,7 @@ describe('setup databases step', () => {
disableQueryHistory: true,
},
makeIo().io,
{ prompts, testConnection, scanConnection },
{ prompts, testConnection, scanConnection, listSchemas, listTables },
);
expect(result).toEqual({
@ -1692,6 +1691,62 @@ describe('setup databases step', () => {
expect(io.stdout()).toContain('✓ orbit_analytics, orbit_raw');
});
it('falls back to comma-separated free-text when listSchemas fails interactively', async () => {
const io = makeIo();
const prompts = makePromptAdapter({
selectValues: ['url'],
textValues: ['', 'env:DATABASE_URL', 'orbit_analytics, orbit_raw'],
});
const testConnection = vi.fn(async () => 0);
const scanConnection = vi.fn(async () => 0);
const listSchemas = vi.fn(async () => {
throw new Error('permission denied to list schemas');
});
const listTables = vi.fn(async (_projectDir: string, _connectionId: string, schemas?: string[]) =>
(schemas ?? []).map((schema) => ({ schema, name: 'events', kind: 'table' as const })),
);
const pickers = makePickerStubs({
scopes: [
{
schemas: ['orbit_analytics', 'orbit_raw'],
tables: ['orbit_analytics.events', 'orbit_raw.events'],
},
],
});
const result = await runKtxSetupDatabasesStep(
{
projectDir: tempDir,
inputMode: 'auto',
databaseDrivers: ['postgres'],
databaseSchemas: [],
skipDatabases: false,
},
io.io,
{
prompts,
testConnection,
scanConnection,
listSchemas,
listTables,
pickDatabaseScope: pickers.pickDatabaseScope,
},
);
expect(result.status).toBe('ready');
expect(io.stderr()).toContain('Could not discover postgresql schemas');
expect(vi.mocked(prompts.text).mock.calls.map(([options]) => options.message)).toContain(
textInputPrompt(
'Enter schemas for postgres-warehouse as a comma-separated list (e.g. SALES, MARKETING).',
),
);
expect(pickers.scopeCalls[0]).toMatchObject({
schemas: ['orbit_analytics', 'orbit_raw'],
initialSchemas: ['orbit_analytics', 'orbit_raw'],
schemaSuggestion: { suggested: new Set(['orbit_analytics', 'orbit_raw']) },
});
});
it('passes schemas and a lazy table callback to the scope picker instead of eager table discovery', async () => {
const listSchemas = vi.fn(async () => ['analytics', 'raw']);
const listTables = vi.fn(async (_projectDir: string, _connectionId: string, schemas?: string[]) =>
@ -2015,6 +2070,7 @@ describe('setup databases step', () => {
it('writes query history config for supported Snowflake databases after validation succeeds', async () => {
const io = makeIo();
const historicSqlProbe = vi.fn(async () => ({ ok: true, lines: [] }));
const result = await runKtxSetupDatabasesStep(
{
projectDir: tempDir,
@ -2032,12 +2088,21 @@ describe('setup databases step', () => {
{
testConnection: vi.fn(async () => 0),
scanConnection: vi.fn(async () => 0),
historicSqlProbe,
prompts: makePromptAdapter({
textValues: ['env:SNOWFLAKE_ACCOUNT', 'WH', 'ANALYTICS', 'PUBLIC', 'reader', ''],
selectValues: ['password'],
textValues: ['env:SNOWFLAKE_ACCOUNT', 'WH', 'ANALYTICS', 'reader', ''],
passwordValues: ['env:SNOWFLAKE_PASSWORD'],
}),
},
);
expect(historicSqlProbe).toHaveBeenCalledWith(
expect.objectContaining({
projectDir: tempDir,
connectionId: 'snowflake',
dialect: 'snowflake',
}),
);
expect(result.status).toBe('ready');
const configText = await readFile(join(tempDir, 'ktx.yaml'), 'utf-8');
@ -2067,6 +2132,51 @@ describe('setup databases step', () => {
expect(config.ingest.adapters).toEqual([]);
});
it('configures Snowflake with RSA key-pair auth via setup wizard', async () => {
const io = makeIo();
const result = await runKtxSetupDatabasesStep(
{
projectDir: tempDir,
inputMode: 'disabled',
databaseDrivers: ['snowflake'],
databaseConnectionId: 'snowflake',
databaseSchemas: [],
skipDatabases: false,
},
io.io,
{
testConnection: vi.fn(async () => 0),
scanConnection: vi.fn(async () => 0),
prompts: makePromptAdapter({
selectValues: ['rsa'],
textValues: [
'env:SNOWFLAKE_ACCOUNT',
'WH',
'ANALYTICS',
'reader',
'~/.ssh/snowflake_rsa_key.p8',
'',
],
passwordValues: ['env:SNOWFLAKE_KEY_PASS'],
}),
},
);
expect(result.status).toBe('ready');
const config = parseKtxProjectConfig(await readFile(join(tempDir, 'ktx.yaml'), 'utf-8'));
expect(config.connections.snowflake).toMatchObject({
driver: 'snowflake',
authMethod: 'rsa',
account: 'env:SNOWFLAKE_ACCOUNT',
warehouse: 'WH',
database: 'ANALYTICS',
username: 'reader',
privateKey: 'file:~/.ssh/snowflake_rsa_key.p8', // pragma: allowlist secret
passphrase: 'env:SNOWFLAKE_KEY_PASS', // pragma: allowlist secret
});
expect(config.connections.snowflake.password).toBeUndefined();
});
it('writes Postgres query history config with minExecutions and ignores window/redaction output', async () => {
const io = makeIo();
const result = await runKtxSetupDatabasesStep(
@ -2427,7 +2537,53 @@ describe('setup databases step', () => {
expect(io.stdout()).toContain('Query history probe...');
expect(io.stdout()).not.toContain('Historic SQL probe...');
expect(io.stdout()).toContain('pg_stat_statements extension is not installed');
expect(io.stdout()).toContain('Setup written; first ingest run will fail until fixed.');
expect(io.stdout()).toContain('Setup written; query history will be skipped until fixed.');
});
it('prints a non-blocking Snowflake query history probe failure with the grants remediation', async () => {
const io = makeIo();
const historicSqlProbe = vi.fn(async () => ({
ok: false,
lines: [
' FAIL Snowflake role cannot read SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY',
' Fix: Run (as ACCOUNTADMIN): GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE <connection role>;',
],
}));
const result = await runKtxSetupDatabasesStep(
{
projectDir: tempDir,
inputMode: 'disabled',
databaseDrivers: ['snowflake'],
databaseConnectionId: 'warehouse',
databaseSchemas: [],
enableQueryHistory: true,
skipDatabases: false,
},
io.io,
{
testConnection: vi.fn(async () => 0),
scanConnection: vi.fn(async () => 0),
historicSqlProbe,
prompts: makePromptAdapter({
textValues: ['env:SNOWFLAKE_ACCOUNT', 'WH', 'ANALYTICS', 'reader', ''],
passwordValues: ['env:SNOWFLAKE_PASSWORD'],
}),
},
);
expect(result.status).toBe('ready');
expect(historicSqlProbe).toHaveBeenCalledWith(
expect.objectContaining({
projectDir: tempDir,
connectionId: 'warehouse',
dialect: 'snowflake',
}),
);
expect(io.stdout()).toContain('Query history probe...');
expect(io.stdout()).toContain('Snowflake role cannot read SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY');
expect(io.stdout()).toContain('GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE');
expect(io.stdout()).toContain('Setup written; query history will be skipped until fixed.');
});
it('does not run the query history probe when the regular connection test fails', async () => {