ktx/packages/cli/test/context/connections/dialects.test.ts
Pintouch 2afab61417
feat(connectors): add MongoDB connector (#305) (#310)
* refactor(connectors): split KtxDialect into core and KtxSqlDialect

Separate the dialect contract into a driver-agnostic core (display/ref
formatting and type mapping) and a SQL-only extension (query generators).
The catalog and entity-details paths resolve the core dialect for any
snapshot driver, so it must stay free of SQL generation; this is the
prerequisite refactor for adding non-SQL primary sources.

- KtxDialect keeps type, formatDisplayRef, parseDisplayRef,
  columnDisplayTablePartCount, mapDataType, mapToDimensionType
- KtxSqlDialect extends it with quoteIdentifier, formatTableName, and the
  query/sample/statistics generators; the 7 SQL dialects implement it
- add getSqlDialectForDriver for SQL drivers; the 7 connectors and the
  relationship-benchmark harness consume it
- thread the relationship pipeline (profiling/validation/composite/
  discovery) as KtxSqlDialect | null so a non-SQL source skips coverage SQL
  and its candidates stay in review; local-enrichment builds the SQL
  dialect only when the connector advertises readOnlySql

Pure extraction: no behavior change for the existing 7 drivers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(connectors): add MongoDB connector for issue #305

Add a read-only MongoDB connector that treats a database as a primary
context source: collections map to tables and inferred top-level fields to
columns. MongoDB is the first non-SQL source (readOnlySql: false), so
ktx sql and metric compilation do not apply, but its collections flow
through ingest, descriptions, and relationship discovery.

- schema-inference: infer a flat column schema from the most recent
  sample_size documents (by _id desc, or order_by for non-ObjectId keys).
  Union BSON types per field, mark multi-type fields mixed (string), keep
  sub-documents/arrays as a single opaque json column, derive nullability
  from presence, treat _id as the primary key
- connector: KtxMongoDbScanConnector behind an injectable client seam;
  strictly read-only (find/listCollections/estimatedDocumentCount only),
  no executeReadOnly; resolves env:/file: via resolveKtxConfigReference
- core-only KtxMongoDbDialect and a live-database introspection adapter
- wire the mongodb driver: driver union, dialect registry, driver
  registration (scopeConfigKey databases), mongodbConnectionSchema,
  connection-drivers, normalizeDriver, the live-database route, and the
  ktx setup picker. ktx sql is refused by the read-only SQL capability gate
- tests: schema inference, connector snapshot via a fake client, dialect,
  driver-schema parsing, and the ktx sql rejection

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(integrations): document the MongoDB primary source

Add a MongoDB section to the primary-sources reference: connection config
(url, databases, enabled_tables, sample_size, order_by), mongodb+srv/TLS/
Atlas notes, the schema-inference explainer, a features matrix, and the
non-SQL caveat. Update the frontmatter and connection field reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(connectors): address review blockers on the MongoDB connector

- introspect: skip estimatedDocumentCount for views. The count command is
  rejected on a MongoDB view (CommandNotSupportedOnView), so counting a view
  aborted introspect for the whole connection; compute estimatedRows only for
  real collections, as ClickHouse does.
- sl: refuse a semantic-layer query against a non-SQL connection instead of
  defaulting it to the Postgres dialect. compileLocalSlQuery (the shared CLI +
  MCP path) now rejects a driver with no SQL dialect via the new
  isSqlQueryableDriver authority, keeping MongoDB context-only per issue #305.
- tests: cover input.tableScope and the empty-scope skip for the Mongo
  connector (the scan layer does not post-filter), the view no-count path, and
  the ktx sl query refusal for a mongodb connection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(mongodb): compute sampled nullCount and document sampling caveats

Address the non-blocking review notes:

- sampleColumn now counts null/absent values over the sampled window instead of
  returning nullCount: null, since the documents are already in hand
- warn that a custom order_by must be indexed (an unindexed sort hits MongoDB's
  in-memory sort limit on large collections) in the connection schema and docs
- note that sampled values for nested fields are stringified, not faithfully
  serialized, so the json opacity is deliberate

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(examples): add a MongoDB connector example

A manual, container-backed example mirroring examples/postgres-historic:

- docker-compose.yml + init/seed.js seed a representative dataset (nested
  documents, arrays, a Decimal128, a mixed-type field, a nullable field, an
  ObjectId reference, and a view) on first container start
- scripts/smoke.sh + introspect-smoke.mjs assert the connector's inferred
  schema with no LLM credentials — the same introspection entry point ktx
  ingest's database-schema stage uses, including the view-no-count path
- README.md documents the smoke and a full keyless ktx ingest run
  (claude-code LLM + managed sentence-transformers embeddings)

Works with Docker Compose or podman compose. Verified end to end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: ignore examples/** in knip to fix dead-code false positives

The MongoDB connector example files (examples/mongodb/init/seed.js and
examples/mongodb/scripts/introspect-smoke.mjs) are used at runtime but were
flagged as unused by knip. Add examples/** to the ignore array, matching the
existing .context/** entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0114qQV8fJ5a5ME3XbMVRzbL

* fix(mongodb): refuse non-SQL connections before SQL analysis

`ktx sql` and the MCP sql_execution tool resolved a SQL-analysis dialect
(falling back to Postgres for a non-SQL driver) and ran read-only
validation before the connector capability gate refused the connection.
For a MongoDB connection that spun up the parser/daemon and produced
Postgres parser diagnostics instead of a clean non-SQL refusal.

Route both entry points through a shared assertSqlQueryableConnection
guard before dialect selection, mirroring compileLocalSlQuery. The
federated duckdb path has no driver and is exempted at each call site.
Add CLI and MCP regression tests asserting validation/connector work
never starts for a MongoDB connection.

* fix(mongodb): pass CI gates (dialect boundary, secrets, setup test)

Three latent failures in the connector surfaced once CI ran on the branch:

- connector.ts imported the concrete KtxMongoDbDialect, which the connector
  dialect-import boundary forbids. Route it through getDialectForDriver('mongodb')
  and widen inferKtxMongoCollectionColumns to the base KtxDialect (it only uses
  mapDataType/mapToDimensionType).
- detect-secrets flagged a test ObjectId hex and the mongodb+srv example URL;
  annotate both with allowlist pragmas.
- the "shows every supported database" setup test omitted the new MongoDB option.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com>
Co-authored-by: Luca Martial <lucamrtl@gmail.com>
Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
2026-06-29 15:17:56 +02:00

316 lines
14 KiB
TypeScript

import { describe, expect, it } from 'vitest';
import { getDialectForDriver, getSqlDialectForDriver } from '../../../src/context/connections/dialects.js';
import type { KtxConnectionDriver, KtxTableRef } from '../../../src/context/scan/types.js';
interface DialectFixture {
driver: KtxConnectionDriver;
table: KtxTableRef;
quoteInput: string;
quotedIdentifier: string;
formattedTable: string;
display: string;
invalidDisplay: string;
columnDisplayTablePartCount: 1 | 2 | 3;
limitClause: string;
topClause: string;
randomFilter: string;
tableSampleClause: string;
sampleQuery: string;
columnSampleContains: string;
nullCountExpression: string;
distinctCountExpression: string;
textLengthExpression: string;
castToText: string;
sampleValueAggregation: string;
cardinalityContains: string;
randomizedCardinalityContains: string;
distinctValuesContains: string;
statisticsContains: string | null;
dimensionInput: string;
dimensionType: 'time' | 'string' | 'number' | 'boolean';
nativeTypeInput: string;
normalizedType: string;
}
const innerSampleSql = 'SELECT status AS value FROM orders';
const fixtures: DialectFixture[] = [
{
driver: 'postgres',
table: { catalog: null, db: 'public', name: 'orders' },
quoteInput: 'order"items',
quotedIdentifier: '"order""items"',
formattedTable: '"public"."orders"',
display: 'public.orders',
invalidDisplay: 'orders',
columnDisplayTablePartCount: 2,
limitClause: 'LIMIT 25 OFFSET 5',
topClause: '',
randomFilter: 'RANDOM() < 0.25',
tableSampleClause: 'TABLESAMPLE SYSTEM (25)',
sampleQuery: 'SELECT "id", "status" FROM "public"."orders" LIMIT 5',
columnSampleContains: 'TRIM(CAST("status" AS TEXT)) != \'\'',
nullCountExpression: 'COUNT(*) FILTER (WHERE "status" IS NULL)',
distinctCountExpression: 'COUNT(DISTINCT "status")',
textLengthExpression: 'LENGTH(CAST("status" AS TEXT))',
castToText: 'CAST("status" AS TEXT)',
sampleValueAggregation:
'(SELECT STRING_AGG(CAST(value AS TEXT), CHR(31)) FROM (SELECT status AS value FROM orders) AS relationship_profile_values)',
cardinalityContains: 'SELECT COUNT(DISTINCT val) AS cardinality',
randomizedCardinalityContains: 'ORDER BY RANDOM()',
distinctValuesContains: 'SELECT DISTINCT "status"::text AS val',
statisticsContains: 'FROM pg_stats s',
dimensionInput: 'timestamp with time zone',
dimensionType: 'time',
nativeTypeInput: 'numeric(12,2)',
normalizedType: 'numeric(12,2)',
},
{
driver: 'mysql',
table: { catalog: null, db: 'analytics', name: 'orders' },
quoteInput: 'order`items',
quotedIdentifier: '`order``items`',
formattedTable: '`analytics`.`orders`',
display: 'analytics.orders',
invalidDisplay: 'orders',
columnDisplayTablePartCount: 2,
limitClause: 'LIMIT 25 OFFSET 5',
topClause: '',
randomFilter: 'RAND() < 0.25',
tableSampleClause: '',
sampleQuery: 'SELECT `id`, `status` FROM `analytics`.`orders` LIMIT 5',
columnSampleContains: 'TRIM(CAST(`status` AS CHAR)) != \'\'',
nullCountExpression: 'SUM(CASE WHEN `status` IS NULL THEN 1 ELSE 0 END)',
distinctCountExpression: 'COUNT(DISTINCT `status`)',
textLengthExpression: 'CHAR_LENGTH(CAST(`status` AS CHAR))',
castToText: 'CAST(`status` AS CHAR)',
sampleValueAggregation:
'(SELECT GROUP_CONCAT(CAST(value AS CHAR) SEPARATOR CHAR(31)) FROM (SELECT status AS value FROM orders) AS relationship_profile_values)',
cardinalityContains: 'SELECT COUNT(DISTINCT val) AS cardinality',
randomizedCardinalityContains: 'ORDER BY RAND()',
distinctValuesContains: 'SELECT DISTINCT CAST(`status` AS CHAR) AS val',
statisticsContains: 'INFORMATION_SCHEMA.STATISTICS',
dimensionInput: 'tinyint(1)',
dimensionType: 'boolean',
nativeTypeInput: 'varchar(255)',
normalizedType: 'varchar(255)',
},
{
driver: 'clickhouse',
table: { catalog: null, db: 'analytics', name: 'events' },
quoteInput: 'order`items',
quotedIdentifier: '`order``items`',
formattedTable: '`analytics`.`events`',
display: 'analytics.events',
invalidDisplay: 'events',
columnDisplayTablePartCount: 2,
limitClause: 'LIMIT 25 OFFSET 5',
topClause: '',
randomFilter: 'rand() / 4294967295.0 < 0.25',
tableSampleClause: '',
sampleQuery: 'SELECT `id`, `status` FROM `analytics`.`events` LIMIT 5',
columnSampleContains: 'trim(toString(`status`)) != \'\'',
nullCountExpression: 'countIf(`status` IS NULL)',
distinctCountExpression: 'COUNT(DISTINCT `status`)',
textLengthExpression: 'length(toString(`status`))',
castToText: 'toString(`status`)',
sampleValueAggregation:
'(SELECT arrayStringConcat(groupArray(toString(value)), \'\\x1F\') FROM (SELECT status AS value FROM orders) AS relationship_profile_values)',
cardinalityContains: 'SELECT COUNT(DISTINCT val) AS cardinality',
randomizedCardinalityContains: 'ORDER BY rand()',
distinctValuesContains: 'SELECT DISTINCT toString(`status`) AS val',
statisticsContains: null,
dimensionInput: 'Nullable(DateTime64(3))',
dimensionType: 'time',
nativeTypeInput: 'LowCardinality(String)',
normalizedType: 'LowCardinality(String)',
},
{
driver: 'sqlite',
table: { catalog: null, db: null, name: 'orders' },
quoteInput: 'order"items',
quotedIdentifier: '"order""items"',
formattedTable: '"orders"',
display: 'orders',
invalidDisplay: 'public.orders',
columnDisplayTablePartCount: 1,
limitClause: 'LIMIT 25 OFFSET 5',
topClause: '',
randomFilter: '(RANDOM() % 100) < 25',
tableSampleClause: '',
sampleQuery: 'SELECT "id", "status" FROM "orders" LIMIT 5',
columnSampleContains: 'TRIM(CAST("status" AS TEXT)) != \'\'',
nullCountExpression: 'SUM(CASE WHEN "status" IS NULL THEN 1 ELSE 0 END)',
distinctCountExpression: 'COUNT(DISTINCT "status")',
textLengthExpression: 'LENGTH(CAST("status" AS TEXT))',
castToText: 'CAST("status" AS TEXT)',
sampleValueAggregation:
'(SELECT GROUP_CONCAT(CAST(value AS TEXT), char(31)) FROM (SELECT status AS value FROM orders) AS relationship_profile_values)',
cardinalityContains: 'SELECT COUNT(DISTINCT val) AS cardinality',
randomizedCardinalityContains: 'ORDER BY RANDOM()',
distinctValuesContains: 'SELECT DISTINCT CAST("status" AS TEXT) AS val',
statisticsContains: null,
dimensionInput: 'INTEGER',
dimensionType: 'number',
nativeTypeInput: 'VARCHAR(255)',
normalizedType: 'VARCHAR(255)',
},
{
driver: 'snowflake',
table: { catalog: 'ANALYTICS', db: 'PUBLIC', name: 'ORDERS' },
quoteInput: 'order"items',
quotedIdentifier: '"order""items"',
formattedTable: '"ANALYTICS"."PUBLIC"."ORDERS"',
display: 'ANALYTICS.PUBLIC.ORDERS',
invalidDisplay: 'PUBLIC.ORDERS',
columnDisplayTablePartCount: 3,
limitClause: 'LIMIT 25 OFFSET 5',
topClause: '',
randomFilter: 'UNIFORM(0::FLOAT, 1::FLOAT, RANDOM()) < 0.25',
tableSampleClause: 'SAMPLE (25)',
sampleQuery: 'SELECT "id", "status" FROM "ANALYTICS"."PUBLIC"."ORDERS" SAMPLE ROW (5 ROWS)',
columnSampleContains: 'TRIM(CAST("status" AS STRING)) != \'\'',
nullCountExpression: 'COUNT_IF("status" IS NULL)',
distinctCountExpression: 'APPROX_COUNT_DISTINCT("status")',
textLengthExpression: 'LENGTH(CAST("status" AS TEXT))',
castToText: 'CAST("status" AS VARCHAR)',
sampleValueAggregation:
'(SELECT LISTAGG(CAST(value AS VARCHAR), \'\\x1f\') FROM (SELECT status AS value FROM orders) AS relationship_profile_values)',
cardinalityContains: 'SELECT COUNT(DISTINCT val) AS cardinality',
randomizedCardinalityContains: 'SAMPLE ROW (100 ROWS)',
distinctValuesContains: 'SELECT DISTINCT "status"::VARCHAR AS val',
statisticsContains: null,
dimensionInput: 'TIMESTAMP_NTZ',
dimensionType: 'time',
nativeTypeInput: 'NUMBER(38,0)',
normalizedType: 'NUMBER(38,0)',
},
{
driver: 'bigquery',
table: { catalog: 'analytics-project', db: 'warehouse', name: 'orders' },
quoteInput: 'order`items',
quotedIdentifier: '`order\\`items`',
formattedTable: '`analytics-project`.`warehouse`.`orders`',
display: 'analytics-project.warehouse.orders',
invalidDisplay: 'warehouse.orders',
columnDisplayTablePartCount: 3,
limitClause: 'LIMIT 25 OFFSET 5',
topClause: '',
randomFilter: 'RAND() < 0.25',
tableSampleClause: 'TABLESAMPLE SYSTEM (25 PERCENT)',
sampleQuery: 'SELECT `id`, `status` FROM `analytics-project`.`warehouse`.`orders` ORDER BY RAND() LIMIT 5',
columnSampleContains: 'TRIM(CAST(`status` AS STRING)) != \'\'',
nullCountExpression: 'COUNTIF(`status` IS NULL)',
distinctCountExpression: 'APPROX_COUNT_DISTINCT(`status`)',
textLengthExpression: 'LENGTH(CAST(`status` AS STRING))',
castToText: 'CAST(`status` AS STRING)',
sampleValueAggregation:
'(SELECT STRING_AGG(CAST(value AS STRING), \'\\u001F\') FROM (SELECT status AS value FROM orders) AS relationship_profile_values)',
cardinalityContains: 'SELECT APPROX_COUNT_DISTINCT(val) AS cardinality',
randomizedCardinalityContains: 'ORDER BY RAND()',
distinctValuesContains: 'SELECT DISTINCT CAST(`status` AS STRING) AS val',
statisticsContains: null,
dimensionInput: 'INT64',
dimensionType: 'number',
nativeTypeInput: 'INT64',
normalizedType: 'BIGINT',
},
{
driver: 'sqlserver',
table: { catalog: 'warehouse', db: 'dbo', name: 'events' },
quoteInput: 'odd]name',
quotedIdentifier: '[odd]]name]',
formattedTable: '[warehouse].[dbo].[events]',
display: 'warehouse.dbo.events',
invalidDisplay: 'dbo.events',
columnDisplayTablePartCount: 3,
limitClause: '',
topClause: 'TOP (25)',
randomFilter: 'ABS(CHECKSUM(NEWID())) % 100 < 25',
tableSampleClause: 'TABLESAMPLE (25 PERCENT)',
sampleQuery: 'SELECT TOP 5 [id], [status] FROM [warehouse].[dbo].[events]',
columnSampleContains: 'LTRIM(RTRIM(CAST([status] AS NVARCHAR(MAX)))) != \'\'',
nullCountExpression: 'SUM(CASE WHEN [status] IS NULL THEN 1 ELSE 0 END)',
distinctCountExpression: 'COUNT(DISTINCT [status])',
textLengthExpression: 'LEN(CAST([status] AS NVARCHAR(MAX)))',
castToText: 'CAST([status] AS NVARCHAR(MAX))',
sampleValueAggregation:
'(SELECT STRING_AGG(CAST(value AS NVARCHAR(MAX)), CHAR(31)) FROM (SELECT status AS value FROM orders) AS relationship_profile_values)',
cardinalityContains: 'SELECT COUNT(DISTINCT val) AS cardinality',
randomizedCardinalityContains: 'ORDER BY NEWID()',
distinctValuesContains: 'SELECT TOP 20 val',
statisticsContains: null,
dimensionInput: 'datetime2',
dimensionType: 'time',
nativeTypeInput: 'uniqueidentifier',
normalizedType: 'uniqueidentifier',
},
];
describe('getDialectForDriver', () => {
it.each(fixtures)('returns a full KtxSqlDialect for $driver', (fixture) => {
const dialect = getSqlDialectForDriver(fixture.driver);
const column = dialect.quoteIdentifier('status');
expect(dialect.type).toBe(fixture.driver);
expect(dialect.quoteIdentifier(fixture.quoteInput)).toBe(fixture.quotedIdentifier);
expect(dialect.formatTableName(fixture.table)).toBe(fixture.formattedTable);
expect(dialect.formatDisplayRef(fixture.table)).toBe(fixture.display);
expect(dialect.parseDisplayRef(fixture.display)).toEqual(fixture.table);
expect(dialect.parseDisplayRef(fixture.invalidDisplay)).toBeNull();
expect(dialect.columnDisplayTablePartCount()).toBe(fixture.columnDisplayTablePartCount);
expect(dialect.getLimitOffsetClause(25, 5)).toBe(fixture.limitClause);
expect(dialect.getTopClause(25)).toBe(fixture.topClause);
expect(dialect.getRandomSampleFilter(0.25)).toBe(fixture.randomFilter);
expect(dialect.getTableSampleClause(0.25)).toBe(fixture.tableSampleClause);
expect(dialect.generateSampleQuery(fixture.formattedTable, 5, ['id', 'status'])).toBe(fixture.sampleQuery);
expect(dialect.generateColumnSampleQuery(fixture.formattedTable, 'status', 10)).toContain(
fixture.columnSampleContains,
);
expect(dialect.getNullCountExpression(column)).toBe(fixture.nullCountExpression);
expect(dialect.getDistinctCountExpression(column)).toBe(fixture.distinctCountExpression);
expect(dialect.textLengthExpression(column)).toBe(fixture.textLengthExpression);
expect(dialect.castToText(column)).toBe(fixture.castToText);
expect(dialect.getSampleValueAggregation(innerSampleSql)).toBe(fixture.sampleValueAggregation);
expect(dialect.generateCardinalitySampleQuery(fixture.formattedTable, column, 100)).toContain(
fixture.cardinalityContains,
);
expect(dialect.generateRandomizedCardinalitySampleQuery(fixture.formattedTable, column, 100)).toContain(
fixture.randomizedCardinalityContains,
);
expect(dialect.generateDistinctValuesQuery(fixture.formattedTable, column, 20)).toContain(
fixture.distinctValuesContains,
);
const statistics = dialect.generateColumnStatisticsQuery(fixture.table.db ?? '', fixture.table.name);
if (fixture.statisticsContains) {
expect(statistics).toContain(fixture.statisticsContains);
} else {
expect(statistics).toBeNull();
}
expect(dialect.mapToDimensionType(fixture.dimensionInput)).toBe(fixture.dimensionType);
expect(dialect.mapDataType(fixture.nativeTypeInput)).toBe(fixture.normalizedType);
});
it('accepts three-part ANSI display refs while keeping one-part names caller-owned', () => {
for (const driver of ['postgres', 'mysql', 'clickhouse'] as const) {
const dialect = getDialectForDriver(driver);
expect(dialect.parseDisplayRef('warehouse.public.orders')).toEqual({
catalog: 'warehouse',
db: 'public',
name: 'orders',
});
expect(dialect.parseDisplayRef('orders')).toBeNull();
}
});
it('throws with a supported-driver list for unknown drivers', () => {
expect(() => getDialectForDriver('oracle')).toThrow(
'Unsupported driver "oracle". Supported drivers: bigquery, clickhouse, mongodb, mysql, postgres, snowflake, sqlite, sqlserver',
);
});
it('rejects legacy driver aliases', () => {
expect(() => getDialectForDriver('postgresql')).toThrow('Unsupported driver "postgresql"');
expect(() => getDialectForDriver('sqlite3')).toThrow('Unsupported driver "sqlite3"');
});
});