apunkt/ktx - bitfreedom.net: free all bits, everywhere

apunkt/ktx

mirror of https://github.com/Kaelio/ktx.git synced 2026-07-04 10:52:13 +02:00

Author	SHA1	Message	Date
Patel Dhrit	fe7e6bd1fa	feat(connector): add Amazon Athena connector via Glue Data Catalog (#309 ) * feat(connector): add Amazon Athena connector via Glue Data Catalog * fix(athena): address reviewer feedback * fix(athena): wire scope discovery, fix normalizeDriver, tighten types and tests * fix(athena): honor databases scope, wire sql-analysis dialect, harden config resolution - introspect() limits to the configured `databases` scope instead of scanning every Glue database in the account (docs promised this; connector ignored it) - add athena -> athena to sql-analysis SQLGLOT_DIALECTS so `ktx sql` and MCP read-only validation parse Athena SQL under the Trino grammar, not postgres - stringConfigValue coerces a resolved-empty `env:` reference to undefined so optional fields fall back to their defaults (workgroup 'primary', catalog 'AwsDataCatalog') instead of '' - drop trailing whitespace in dialect.test.ts * fix(athena): integrate with main's SQL/non-SQL dialect split and add dialect notes Rebase onto main, which introduced the KtxDialect (core) vs KtxSqlDialect (SQL-only) split for MongoDB: - KtxAthenaDialect implements KtxSqlDialect; the connector resolves it via getSqlDialectForDriver so SQL-generation methods stay in scope - add authored athena.md SQL notes for the sql_dialect_notes MCP tool, required now that athena resolves to the athena sqlglot dialect (dialect-notes coverage is derived from the warehouse-driver registry) --------- Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>	2026-07-02 15:00:26 +02:00
Kevin Messiaen	3c4fcc27c7	feat: Add duckdb connector (#308 ) * refactor(duckdb): extract shared json-safe bigint helper Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(duckdb): add and register the duckdb primary connector Add KtxDuckDbDialect, KtxDuckDbScanConnector (local file-backed, read-only, never-create, main-schema introspection via information_schema and duckdb_constraints() for foreign keys), and register the duckdb driver across the dialect factory, driver registry, connection-type enum, warehouse descriptor, config schema, scan normalization, connection test drivers, and status display. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(duckdb): route live-database ingest through the DuckDB connector Add the DuckDB live-database introspection bridge and dispatch duckdb connections to it in local-adapters, matching the SQLite path. Repoint the config-rejection test off duckdb (now a valid driver) onto the no-driver case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(duckdb): add duckdb to the setup database flow Offer DuckDB in the interactive checklist and via ktx setup --database duckdb, with a file-path prompt and duckdb-local default connection id, parallel to SQLite. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(duckdb): attach native duckdb files in federation Native .duckdb members ATTACH with (READ_ONLY) and no TYPE/INSTALL/LOAD, since the duckdb format needs no extension. attachTypeForDriver returns null for the native case; buildAttachStatements builds load statements from non-null types only and emits a conditional ATTACH clause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(duckdb): document the duckdb primary-source connector Add a DuckDB section to the primary-sources integration page (config, read-only never-create behavior, main-schema scope, federation) and update the supported-driver assertion in dialects.test.ts to include duckdb. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(duckdb): use single-namespace display shape for main-only refs DuckDB v1 introspects the main schema and sets db=null on every table, so its display refs are single-namespace like SQLite. The ansi shape emitted a 1-part table display it then refused to parse, breaking column-level display resolution. Switch the dialect to the sqlite display shape and add a round-trip test plus a composite-foreign-key test that were missing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(duckdb): resolve connector dialect via getDialectForDriver Route the connector's dialect through the shared factory like every other connector, now that duckdb is registered. Single construction path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(duckdb): skip schema picker for single-file duckdb setup DuckDB is a single-file, single-namespace ('main') database like SQLite, but the setup scope step only skipped the schema picker for sqlite. DuckDB fell into the multi-schema path with an empty schema list, rendering a broken picker ("No matches found" for main). Extend the file-based-driver early-return to cover duckdb so it ingests every table directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(duckdb): reuse shared config helper and derive scope skip Route duckdb path resolution through the shared resolveStringReference helper instead of a local third copy of env:/file: handling. Derive the setup scope-picker skip from SCOPE_DISCOVERY_SPECS membership rather than a hardcoded sqlite/duckdb driver list. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(duckdb): use a genuinely unknown driver in the rejection test The merged "rejects unknown drivers" test used `driver: duckdb` as its unknown-driver stand-in, which stopped being unknown once this branch added the duckdb connector. Switch to `nonsense` so it again exercises the unsupported-driver config error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(duckdb): cover dialect, connector, and live-introspection branches Codecov flagged uncovered branches as dead code; all are real connector, dialect, and live-ingest behavior. Add unit tests instead of removing them. - dialect: precedence ladder, sample/clause builders, profiling expressions - connector: url/env config forms, error throws, never-create guard, cardinality cap branches, table-scope empty/non-empty paths - live-introspection: full-schema and table-scope extraction Functions 100%, lines ~99% across the duckdb connector dir. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: add DuckDB to supported-driver references The DuckDB connector PR documented the connector itself but left the scattered supported-driver enumerations stale. Add duckdb to the federation concept page (participation table, activation, table naming, limitations), the ktx setup CLI reference, the ktx.yaml warehouse-driver table, the primary-sources field reference, and the quickstart driver list (which also restores the missing ClickHouse entry). --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>	2026-07-01 12:06:02 +00:00
Pintouch	2afab61417	feat(connectors): add MongoDB connector (#305 ) (#310 ) * refactor(connectors): split KtxDialect into core and KtxSqlDialect Separate the dialect contract into a driver-agnostic core (display/ref formatting and type mapping) and a SQL-only extension (query generators). The catalog and entity-details paths resolve the core dialect for any snapshot driver, so it must stay free of SQL generation; this is the prerequisite refactor for adding non-SQL primary sources. - KtxDialect keeps type, formatDisplayRef, parseDisplayRef, columnDisplayTablePartCount, mapDataType, mapToDimensionType - KtxSqlDialect extends it with quoteIdentifier, formatTableName, and the query/sample/statistics generators; the 7 SQL dialects implement it - add getSqlDialectForDriver for SQL drivers; the 7 connectors and the relationship-benchmark harness consume it - thread the relationship pipeline (profiling/validation/composite/ discovery) as KtxSqlDialect \| null so a non-SQL source skips coverage SQL and its candidates stay in review; local-enrichment builds the SQL dialect only when the connector advertises readOnlySql Pure extraction: no behavior change for the existing 7 drivers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(connectors): add MongoDB connector for issue #305 Add a read-only MongoDB connector that treats a database as a primary context source: collections map to tables and inferred top-level fields to columns. MongoDB is the first non-SQL source (readOnlySql: false), so ktx sql and metric compilation do not apply, but its collections flow through ingest, descriptions, and relationship discovery. - schema-inference: infer a flat column schema from the most recent sample_size documents (by _id desc, or order_by for non-ObjectId keys). Union BSON types per field, mark multi-type fields mixed (string), keep sub-documents/arrays as a single opaque json column, derive nullability from presence, treat _id as the primary key - connector: KtxMongoDbScanConnector behind an injectable client seam; strictly read-only (find/listCollections/estimatedDocumentCount only), no executeReadOnly; resolves env:/file: via resolveKtxConfigReference - core-only KtxMongoDbDialect and a live-database introspection adapter - wire the mongodb driver: driver union, dialect registry, driver registration (scopeConfigKey databases), mongodbConnectionSchema, connection-drivers, normalizeDriver, the live-database route, and the ktx setup picker. ktx sql is refused by the read-only SQL capability gate - tests: schema inference, connector snapshot via a fake client, dialect, driver-schema parsing, and the ktx sql rejection Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(integrations): document the MongoDB primary source Add a MongoDB section to the primary-sources reference: connection config (url, databases, enabled_tables, sample_size, order_by), mongodb+srv/TLS/ Atlas notes, the schema-inference explainer, a features matrix, and the non-SQL caveat. Update the frontmatter and connection field reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(connectors): address review blockers on the MongoDB connector - introspect: skip estimatedDocumentCount for views. The count command is rejected on a MongoDB view (CommandNotSupportedOnView), so counting a view aborted introspect for the whole connection; compute estimatedRows only for real collections, as ClickHouse does. - sl: refuse a semantic-layer query against a non-SQL connection instead of defaulting it to the Postgres dialect. compileLocalSlQuery (the shared CLI + MCP path) now rejects a driver with no SQL dialect via the new isSqlQueryableDriver authority, keeping MongoDB context-only per issue #305. - tests: cover input.tableScope and the empty-scope skip for the Mongo connector (the scan layer does not post-filter), the view no-count path, and the ktx sl query refusal for a mongodb connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(mongodb): compute sampled nullCount and document sampling caveats Address the non-blocking review notes: - sampleColumn now counts null/absent values over the sampled window instead of returning nullCount: null, since the documents are already in hand - warn that a custom order_by must be indexed (an unindexed sort hits MongoDB's in-memory sort limit on large collections) in the connection schema and docs - note that sampled values for nested fields are stringified, not faithfully serialized, so the json opacity is deliberate Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(examples): add a MongoDB connector example A manual, container-backed example mirroring examples/postgres-historic: - docker-compose.yml + init/seed.js seed a representative dataset (nested documents, arrays, a Decimal128, a mixed-type field, a nullable field, an ObjectId reference, and a view) on first container start - scripts/smoke.sh + introspect-smoke.mjs assert the connector's inferred schema with no LLM credentials — the same introspection entry point ktx ingest's database-schema stage uses, including the view-no-count path - README.md documents the smoke and a full keyless ktx ingest run (claude-code LLM + managed sentence-transformers embeddings) Works with Docker Compose or podman compose. Verified end to end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: ignore examples/ in knip to fix dead-code false positives The MongoDB connector example files (examples/mongodb/init/seed.js and examples/mongodb/scripts/introspect-smoke.mjs) are used at runtime but were flagged as unused by knip. Add examples/ to the ignore array, matching the existing .context/** entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0114qQV8fJ5a5ME3XbMVRzbL * fix(mongodb): refuse non-SQL connections before SQL analysis `ktx sql` and the MCP sql_execution tool resolved a SQL-analysis dialect (falling back to Postgres for a non-SQL driver) and ran read-only validation before the connector capability gate refused the connection. For a MongoDB connection that spun up the parser/daemon and produced Postgres parser diagnostics instead of a clean non-SQL refusal. Route both entry points through a shared assertSqlQueryableConnection guard before dialect selection, mirroring compileLocalSlQuery. The federated duckdb path has no driver and is exempted at each call site. Add CLI and MCP regression tests asserting validation/connector work never starts for a MongoDB connection. * fix(mongodb): pass CI gates (dialect boundary, secrets, setup test) Three latent failures in the connector surfaced once CI ran on the branch: - connector.ts imported the concrete KtxMongoDbDialect, which the connector dialect-import boundary forbids. Route it through getDialectForDriver('mongodb') and widen inferKtxMongoCollectionColumns to the base KtxDialect (it only uses mapDataType/mapToDimensionType). - detect-secrets flagged a test ObjectId hex and the mongodb+srv example URL; annotate both with allowlist pragmas. - the "shows every supported database" setup test omitted the new MongoDB option. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com> Co-authored-by: Luca Martial <lucamrtl@gmail.com> Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>	2026-06-29 15:17:56 +02:00
Andrey Avtomonov	663eaff940	feat(cli): setup progress spinners, Tab-to-select, and banner polish (#296 ) * fix(cli): double the height of the setup banner t crossbar * fix(cli): unify setup multi-select hints and make Tab the select key The six interactive multi-select surfaces in `ktx setup` documented three different hint voices, one had no hint at all, and they named two different select keys (Space vs Tab). Tab is the only key that can toggle selection without colliding with type-to-search input, so make it the single documented select key everywhere and compose every hint from one shared fragment vocabulary in prompt-navigation.ts. - Register `updateSettings({ aliases: { tab: 'space' } })` so Tab toggles flat multiselects; the alias applies only to non-text prompts, leaving typed search input (schema/Notion) untouched. - Add the missing hint to the agent-targets prompt and drop the stray "Space to select … Esc …" info line plus the now-dead writeSetupInfo helper. - Replace the schema-scope ad-hoc hint with the searchable-multiselect voice and standardize "filter" -> "search" vocabulary. - Delete DEFAULT_TREE_PICKER_HELP_TEXT and the unused TreePickerChrome.helpText seam; render the shared tree hint instead. * refactor(cli): show LLM check progress for every setup backend Rename runLlmHealthCheckWithProgress to validateModelWithProgress and wrap the Claude subscription and Codex auth probes in the same spinner progress as the Anthropic API and Vertex backends, so each backend shows consistent "Checking <provider> LLM" output during setup. * feat(cli): add ktx-orange progress spinners to setup steps Add a shared runWithCliSpinner helper and a TTY-aware createCliSpinner: an animated clack spinner in a terminal, and a static stderr-only spinner before raw-mode pickers (the table tree picker and demo tour), where the animated spinner's stdin grab would otherwise corrupt the next prompt. Wrap the slow setup waits in progress spinners: managed runtime install, embedding daemon start + first-run model download, embeddings health check, the connection-test gate, and source validation / dbt clone / Metabase discovery. Recolor every spinner frame from clack's magenta to the ktx mascot orange (#FF8A4C) via the static helper and clack's styleFrame option.	2026-06-12 16:43:10 +02:00
Andrey Avtomonov	00cdf2de90	refactor: enforce ktx naming and AGENTS.md compliance sweep (#289 ) Align the tree with AGENTS.md/CLAUDE.md conventions: - Rewrite user-facing strings, docs, and tests to lowercase `ktx` (no bare uppercase `KTX` tokens remain outside literal identifiers). - Drop the legacy `historicSql` migration path and its now-unused helpers, per the no-backward-compat rule. - Remove `as unknown as` / `any` casts: narrow `BaseTool` generics to `z.ZodObject`, add a typed `createLookerClient`, and delete the dead `getParametersSchema`/`toAnthropicFormat` pre-AI-SDK helpers. - Use `InvalidArgumentError` for Commander parse failures. - Finish the adapter→connector prose conversion in the `ktx.yaml` docs while keeping the literal `adapters` config key.	2026-06-11 13:49:45 +02:00
Andrey Avtomonov	853f39a7c3	fix(setup): require explicit no-input database scope (#286 ) * test(setup): supply explicit --no-input scope to disabled-mode database tests * fix(setup): require explicit database scope in --no-input instead of auto-scanning the warehouse * docs(setup): document --no-input database scope requirement	2026-06-10 10:36:53 +00:00
Andrey Avtomonov	c2beaf7d55	feat(setup): wizard prompt tweaks and quieter query-history filter output (#259 ) Setup wizard flow tweaks: - Add a reveal-tail password prompt (reveal-password-prompt.ts) that unmasks the last few characters of a typed/pasted secret, and wire it into the setup prompt adapter in place of clack's password(); adds the @clack/core dep. - Reorder wizard select options: surface "Paste a key" before the environment-variable option across embeddings/models/sources, promote Metabase/Notion in the source list, put Git URL before Local path, reorder the Notion crawl-mode choices, and relabel the sources "Done" action. Query-history filter picker output: - Collapse the per-template parse-failure lines into a single count in the setup output and route the full template-id list to --debug stderr. - Model parse failures as a structured parseFailedTemplateIds field instead of warning strings. - Add a privacy-safe query_history_filter_completed telemetry event (counts/enums only), mirrored into the Python daemon schema.	2026-06-04 14:11:08 +02:00
Andrey Avtomonov	e70ae1e63b	feat(query-history): scope mining to modeled schemas by default (#258 ) * feat(query-history): structure SQL analysis table refs * feat(query-history): qualify SQL analysis table refs * feat(query-history): wire modeled scope floor through ingest * chore(query-history): verify scope floor * test(query-history): align daemon SQL batch endpoint contract * feat(query-history): build scope from same-run scan catalog * feat(query-history): fail open on scope-floor catalog failures * chore(query-history): verify scope-floor v1 closure * refactor(query-history): share scope membership * feat(setup): apply derived query history filters * docs: document derived query history filters * fix(query-history): redact filter picker LLM prompt SQL * fix(setup): run filter picker SQL analysis through managed daemon * chore(query-history): verify filter picker v1 closure * fix(query-history): fail open on partial service-account attribution * fix(query-history): aggregate BigQuery users by execution count * fix(query-history): aggregate Snowflake users by execution count * fix(query-history): use BigQuery query info hash	2026-06-03 17:19:42 +02:00
Andrey Avtomonov	ce1516b357	feat(cli): consistent connection setup recovery and build-time gate (#257 ) * feat(cli): block context build when a required connection fails its live test A context build can take several minutes, so a connection that is unreachable or misconfigured should stop the build up front instead of failing partway through. Before the build starts, run a live connection test for every primary- and context-source connection the build depends on. Each test's output is captured in a discarded buffer so raw error text (and database paths) never reach the user; failures are surfaced only by connection id and connector type, with a pointer to `ktx connection test <id>` for the underlying error. - Interactive setup lets the user fix the connection and retry without restarting, re-resolving targets so an added/removed/reconfigured connection is honored. - `--no-input` exits non-zero and writes a failed context state with a failureReason, so scripts stop early and setup never reads as ready. Extract the buffered command IO helper out of setup-databases into src/io/buffered-command-io.ts so both call sites share one implementation. * feat(cli): use recovery primitive for database setup * feat(cli): use recovery primitive for source setup * docs: document setup connection recovery * fix(cli): close database recovery gaps * fix(cli): target failing project in gate hint and preserve missing-input Address two review findings on the connection-recovery work: - The connection-gate failure hint emitted `ktx connection test <id>` with no --project-dir, so a setup run started with `--project-dir ./analytics` pointed users at cwd/KTX_PROJECT_DIR instead of the project that just failed. Emit the resolved project dir, matching the contextBuildCommands convention. - The non-interactive database configure path returned `cancelled`, which the recovery primitive collapses to `failed`. Sibling paths still report `missing-input` for absent flags, so incomplete-flag runs were indistinguishable from real connection failures. The database wrapper now tracks the configure missing-input signal and restores the `missing-input` step status; the shared primitive keeps its four outcomes.	2026-06-03 11:08:46 +00:00
Andrey Avtomonov	3f0d11e07d	feat(cli)!: remove fast mode; ktx ingest always builds enriched context (KLO-721) (#237 ) Fast mode (the ktx ingest --fast/--deep database-ingest depth toggle) is removed. ktx ingest now always builds the full enriched ("deep") context. There is no structural fallback: a database connection without a configured model and embeddings fails the enrichment-readiness preflight before any work runs, with a 'Run ktx setup to configure a model and embeddings' hint. - Remove --fast/--deep flags, the per-connection context.depth field, and the ktx setup depth prompt (delete setup-database-context-depth.ts). - Rename ingest-depth.ts -> connection-drivers.ts; ingest always requests scan mode 'enriched'; readiness gate (enrichmentReadinessGaps) runs for every database target. - Drop the database-context-depth telemetry step (Node + Python schema mirrors regenerated). - Update CLI, setup, context-build view, docs, the public ktx skill, and the release-smoke / artifacts scripts (now assert the no-LLM guard failure). ktx status --fast (a separate network-probe flag) is unchanged. Follow-ups: KLO-726 (live progress for ktx ingest --all), KLO-727 (restore credentialed successful-ingest release smoke coverage).	2026-05-29 17:41:04 +02:00
Andrey Avtomonov	56985b7e09	test: split cli tests from source tree (#216 ) * feat(cli): define full warehouse dialect contract * test(cli): keep dialect edge tests focused * fix(cli): stabilize dialect contract foundation * refactor(connectors): own read-only query preparation * refactor(connectors): resolve dialects through registry * refactor(connectors): keep concrete dialect classes internal * chore(workspace): enforce dialect import boundary * refactor(cli): resolve relationship dialect at scan boundary * refactor(cli): use dialect display parsing for entity details * refactor(cli): use dialect display parsing for warehouse catalog * refactor(cli): use dialect SQL in relationship workflows * test(cli): verify solid dialect scan workflow closure * test: split cli tests from source tree * refactor(cli): standardize BigQuery scope listing * feat(sqlite): implement connector scope listing * test(connectors): cover required table listing * feat(cli): add warehouse driver registry * refactor(setup): route scope discovery through driver registry * refactor(cli): route local query execution through driver registry * refactor(historic-sql): route dialect support through driver registry * refactor(cli): test warehouse connections through driver registry * fix(cli): close driver registry type export gaps * Improve setup daemon diagnostics * refactor(setup): centralize rail-prefixed diagnostics + query-history fallback Extract errorMessage, writePrefixedLines, and flushPrefixedBufferedCommandOutput into clack.ts so the setup wizard, managed daemons, and embedding/agent steps share one rail-formatted writer. setup-databases.ts also adds a "disable query history and retry" option when the schema-context build fails and query history is the likely culprit, surfaced via a new failed-query-history-unavailable status. * fix(cli): carry catalog through the picker so BigQuery/Snowflake/SQL Server scope filters match The setup picker's KtxTableListEntry was a 2-level { schema, name }, so qualifiedTableId always wrote db.name into enabled_tables. When BigQuery, Snowflake, or SQL Server later ran fast ingest, their introspect step filtered the scope set with scopedTableNames(scope, { catalog: projectId\|database, db }) — catalog was non-null on the introspect side but null in the scope refs, so every entry was rejected, the live-database adapter staged zero table files, and detect() failed with 'Adapter "live-database" did not recognize fetched source output'. Align the picker boundary with the canonical 3-level KtxTableRef: - Add catalog: string \| null to KtxTableListEntry. - BigQuery/Snowflake/SQL Server listTables populate catalog from the resolved projectId / database; Postgres/MySQL/ClickHouse/SQLite set null. - qualifiedTableId emits catalog.schema.name when catalog is non-null (resolveEnabledTables already accepts the 3-part shape) and schemasFromEnabledTables now goes through parseDottedTableEntry so it recovers the schema correctly from both 2-part and 3-part entries. - Export parseDottedTableEntry from enabled-tables.ts (@internal) for picker reuse. Update listTables expectations in all seven connector tests and the setup / picker test fixtures. Add a picker regression test that covers the catalog-bearing round-trip (save + refine). * fix(cli): allow debug telemetry under opt-out env	2026-05-26 08:49:05 +02:00

11 commits