feat: Add duckdb connector (#308)

* refactor(duckdb): extract shared json-safe bigint helper

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(duckdb): add and register the duckdb primary connector

Add KtxDuckDbDialect, KtxDuckDbScanConnector (local file-backed, read-only,
never-create, main-schema introspection via information_schema and
duckdb_constraints() for foreign keys), and register the duckdb driver across
the dialect factory, driver registry, connection-type enum, warehouse descriptor,
config schema, scan normalization, connection test drivers, and status display.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(duckdb): route live-database ingest through the DuckDB connector

Add the DuckDB live-database introspection bridge and dispatch duckdb
connections to it in local-adapters, matching the SQLite path. Repoint the
config-rejection test off duckdb (now a valid driver) onto the no-driver case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(duckdb): add duckdb to the setup database flow

Offer DuckDB in the interactive checklist and via ktx setup --database duckdb,
with a file-path prompt and duckdb-local default connection id, parallel to SQLite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(duckdb): attach native duckdb files in federation

Native .duckdb members ATTACH with (READ_ONLY) and no TYPE/INSTALL/LOAD, since
the duckdb format needs no extension. attachTypeForDriver returns null for the
native case; buildAttachStatements builds load statements from non-null types
only and emits a conditional ATTACH clause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(duckdb): document the duckdb primary-source connector

Add a DuckDB section to the primary-sources integration page (config, read-only
never-create behavior, main-schema scope, federation) and update the
supported-driver assertion in dialects.test.ts to include duckdb.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(duckdb): use single-namespace display shape for main-only refs

DuckDB v1 introspects the main schema and sets db=null on every table, so its
display refs are single-namespace like SQLite. The ansi shape emitted a 1-part
table display it then refused to parse, breaking column-level display resolution.
Switch the dialect to the sqlite display shape and add a round-trip test plus a
composite-foreign-key test that were missing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* refactor(duckdb): resolve connector dialect via getDialectForDriver

Route the connector's dialect through the shared factory like every other
connector, now that duckdb is registered. Single construction path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(duckdb): skip schema picker for single-file duckdb setup

DuckDB is a single-file, single-namespace ('main') database like SQLite,
but the setup scope step only skipped the schema picker for sqlite. DuckDB
fell into the multi-schema path with an empty schema list, rendering a
broken picker ("No matches found" for main). Extend the file-based-driver
early-return to cover duckdb so it ingests every table directly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* refactor(duckdb): reuse shared config helper and derive scope skip

Route duckdb path resolution through the shared resolveStringReference
helper instead of a local third copy of env:/file: handling. Derive the
setup scope-picker skip from SCOPE_DISCOVERY_SPECS membership rather than
a hardcoded sqlite/duckdb driver list.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(duckdb): use a genuinely unknown driver in the rejection test

The merged "rejects unknown drivers" test used `driver: duckdb` as its
unknown-driver stand-in, which stopped being unknown once this branch
added the duckdb connector. Switch to `nonsense` so it again exercises
the unsupported-driver config error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(duckdb): cover dialect, connector, and live-introspection branches

Codecov flagged uncovered branches as dead code; all are real connector,
dialect, and live-ingest behavior. Add unit tests instead of removing them.

- dialect: precedence ladder, sample/clause builders, profiling expressions
- connector: url/env config forms, error throws, never-create guard,
  cardinality cap branches, table-scope empty/non-empty paths
- live-introspection: full-schema and table-scope extraction

Functions 100%, lines ~99% across the duckdb connector dir.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs: add DuckDB to supported-driver references

The DuckDB connector PR documented the connector itself but left the
scattered supported-driver enumerations stale. Add duckdb to the
federation concept page (participation table, activation, table naming,
limitations), the ktx setup CLI reference, the ktx.yaml warehouse-driver
table, the primary-sources field reference, and the quickstart driver
list (which also restores the missing ClickHouse entry).

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
This commit is contained in:
Kevin Messiaen 2026-07-01 19:06:02 +07:00 committed by GitHub
parent f21594c42a
commit 3c4fcc27c7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
39 changed files with 1366 additions and 59 deletions

View file

@ -120,9 +120,9 @@ runtime features are missing.
| Flag | Description |
|------|-------------|
| `--database <driver>` | Database driver to configure; repeatable. Choices: `sqlite`, `postgres`, `mysql`, `clickhouse`, `sqlserver`, `bigquery`, `snowflake` |
| `--database <driver>` | Database driver to configure; repeatable. Choices: `sqlite`, `duckdb`, `postgres`, `mysql`, `clickhouse`, `sqlserver`, `bigquery`, `snowflake` |
| `--database-connection-id <id>` | Existing selected connection id; repeatable. With `--database` or `--database-url`, connection id for the new connection. |
| `--database-url <url>` | URL, `env:NAME`, or `file:/path` for one new URL-style database connection; also used as the SQLite path |
| `--database-url <url>` | URL, `env:NAME`, or `file:/path` for one new URL-style database connection; also used as the SQLite or DuckDB path |
| `--database-schema <schema>` | Database schema or dataset to include; repeatable |
| `--skip-databases` | Leave database setup incomplete |

View file

@ -1,6 +1,6 @@
---
title: Cross-database federation
description: How ktx federates postgres, mysql, and sqlite connections so a single read-only SQL query can join across them without copying data.
description: How ktx federates postgres, mysql, sqlite, and duckdb connections so a single read-only SQL query can join across them without copying data.
---
Cross-database federation lets a single read-only SQL query join tables that
@ -20,13 +20,14 @@ block to add. With zero or one compatible connection the behavior is unchanged.
## Which connections participate
The v1 federation engine supports three drivers:
The v1 federation engine supports four drivers:
| Driver | Participates in federation |
|--------|---------------------------|
| `postgres` | Yes |
| `mysql` | Yes |
| `sqlite` | Yes |
| `duckdb` | Yes |
| `snowflake` | No — standalone connection |
| `bigquery` | No — standalone connection |
| `clickhouse` | No — standalone connection |
@ -38,7 +39,7 @@ queried independently; they do not appear as federation members.
## How it activates
**ktx** inspects the connections in `ktx.yaml` at startup. When it finds two or
more connections whose driver is `postgres`, `mysql`, or `sqlite`, it
more connections whose driver is `postgres`, `mysql`, `sqlite`, or `duckdb`, it
instantiates the DuckDB federation engine and attaches each one read-only.
There is no `federation:` key, no opt-in flag, and no connection-level setting
to enable. The engine is derived entirely from what is already declared.
@ -60,9 +61,10 @@ Two attach-compatible connections are present, so federation is active.
## Table naming in federated queries
Inside a federated query, postgres and mysql tables use a three-part name:
`connectionId.schema.table`. SQLite tables, which have no schema layer in
DuckDB, use the two-part form `connectionId.table`. In both cases the
connection's `id` field in `ktx.yaml` becomes the catalog name inside DuckDB.
`connectionId.schema.table`. SQLite and DuckDB tables use the two-part form
`connectionId.table`, since ktx addresses both as single-namespace members. In
both cases the connection's `id` field in `ktx.yaml` becomes the catalog name
inside DuckDB.
If a connection `id` is not a bare SQL identifier — for example it contains a
hyphen, like `books-db` — double-quote it in the query the same way DuckDB
@ -131,8 +133,8 @@ ktx sql -c _ktx_federated \
Table names follow the rules from
[Table naming in federated queries](#table-naming-in-federated-queries):
three-part `connectionId.schema.table` for postgres and mysql, two-part
`connectionId.table` for sqlite. The `_ktx_federated` id is virtual — it is
never written to `ktx.yaml` and only exists when two or more attach-compatible
`connectionId.table` for sqlite and duckdb. The `_ktx_federated` id is virtual —
it is never written to `ktx.yaml` and only exists when two or more attach-compatible
connections are declared. It surfaces in `ktx connection` and in the agent's
connection list so the id is discoverable. Querying a single member database
directly with its own connection id (`ktx sql -c pg_books ...`) is unchanged.
@ -149,6 +151,6 @@ database through the federation engine.
them in a source's `joins:` block and automatic discovery of cross-database
relationships are not available yet. Intra-database relationship discovery for
each member connection is unchanged.
- **postgres, mysql, and sqlite only.** Other drivers (snowflake, bigquery,
clickhouse, sqlserver) do not participate in federation in this version. They
remain usable as standalone connections.
- **postgres, mysql, sqlite, and duckdb only.** Other drivers (snowflake,
bigquery, clickhouse, sqlserver) do not participate in federation in this
version. They remain usable as standalone connections.

View file

@ -109,6 +109,7 @@ context-source drivers share the map.
| `postgres` | Warehouse | `driver` | `url`, `enabled_tables`, `historicSql`, `context.queryHistory` |
| `mysql` | Warehouse | `driver` | `url`, `enabled_tables` |
| `sqlite` | Warehouse | `driver` | `url` or `path`, `enabled_tables` |
| `duckdb` | Warehouse | `driver` | `url` or `path`, `enabled_tables` |
| `sqlserver` | Warehouse | `driver` | `url`, `enabled_tables` |
| `bigquery` | Warehouse | `driver` | `credentials_json`, `dataset_ids`, `enabled_tables`, `historicSql` |
| `snowflake` | Warehouse | `driver` | `schema_names`, `enabled_tables`, `historicSql` |

View file

@ -218,7 +218,8 @@ The wizard walks you through everything **ktx** needs in one pass:
3. **Embeddings** - picks an embeddings backend. Choose OpenAI for hosted
embeddings or `sentence-transformers` to run locally without an API key.
4. **Database** - adds at least one primary connection. Supported drivers:
SQLite, PostgreSQL, MySQL, SQL Server, BigQuery, and Snowflake.
PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, and
DuckDB.
5. **Context sources** - optionally adds dbt, MetricFlow, LookML, Looker,
Metabase, or Notion. You can skip and add them later.
6. **Build** - offers to run the first ingest so semantic sources and wiki

View file

@ -1,6 +1,6 @@
---
title: Primary Sources
description: Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, or MongoDB.
description: Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, or MongoDB.
---
**ktx** connects to your data warehouse or database to build schema context,
@ -26,14 +26,14 @@ Agents should prefer environment or file references over literal secrets.
| Field | Required | Applies to | Description |
|-------|----------|------------|-------------|
| `driver` | Yes | all connections | Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, or `mongodb` |
| `driver` | Yes | all connections | Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, `duckdb`, or `mongodb` |
| `url` | One of the connection methods | URL-style connectors | Database URL, `env:NAME`, or `file:/path/to/secret` |
| `host`, `port`, `database`, `username`, `password` | One of the connection methods | PostgreSQL, MySQL, SQL Server | Field-by-field connection values |
| `schema` or `schemas` | No | schema-aware warehouses | Single schema or list of schemas to scan |
| `databases` | No | ClickHouse, MongoDB | List of databases to scan |
| `sample_size`, `order_by` | No | MongoDB | Schema-inference sampling controls (recent documents, sort field) |
| `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
| `path` | Yes for path-style SQLite | SQLite | Local SQLite database path or `env:NAME` reference |
| `path` | Yes for path-style SQLite/DuckDB | SQLite, DuckDB | Local SQLite or DuckDB database path or `env:NAME` reference |
| `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job |
| `query_timeout_ms` | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise. |
| `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
@ -545,6 +545,52 @@ No authentication required - SQLite is file-based. The file must be readable by
---
## DuckDB
File-based connector using the DuckDB Node.js API. Ideal for local analytics, embedded warehouses, and cross-database federation.
### Connection config
```yaml title="ktx.yaml"
connections:
warehouse:
driver: duckdb
path: data/warehouse.duckdb
```
`path` is resolved relative to the project directory. The `.duckdb` file must already exist — **ktx** never creates a missing database file.
### Authentication
No authentication required — DuckDB is file-based. The `.duckdb` file must be readable by the process running **ktx**.
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `information_schema` on the `main` schema |
| Primary keys | Yes | Via `information_schema.table_constraints` |
| Foreign keys | Yes | Via DuckDB's `duckdb_constraints()` catalog function |
| Row count estimates | Yes | Exact count via `SELECT COUNT(*)` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |
### Dialect notes
- Introspection scans the `main` schema only
- Execution is read-only; **ktx** opens the file without write access
- Parameter binding uses positional `?` placeholders
- Uses `LIMIT X OFFSET Y` for pagination
- Database file must exist before `ktx connection test` or ingest runs
### Cross-database federation
When a project declares two or more attach-compatible connections — any combination of `postgres`, `mysql`, `sqlite`, and `duckdb` — **ktx** derives a cross-database federation connection. That connection can ATTACH a native `.duckdb` file, allowing semantic queries to join across sources without manually copying data.
---
## MongoDB
Connects to MongoDB as a primary context source. **ktx** treats each collection