ktx/docs-site/content/docs/integrations/primary-sources.mdx
Patel Dhrit fe7e6bd1fa
feat(connector): add Amazon Athena connector via Glue Data Catalog (#309)
* feat(connector): add Amazon Athena connector via Glue Data Catalog

* fix(athena): address reviewer feedback

* fix(athena): wire scope discovery, fix normalizeDriver, tighten types and tests

* fix(athena): honor databases scope, wire sql-analysis dialect, harden config resolution

- introspect() limits to the configured `databases` scope instead of scanning
  every Glue database in the account (docs promised this; connector ignored it)
- add athena -> athena to sql-analysis SQLGLOT_DIALECTS so `ktx sql` and MCP
  read-only validation parse Athena SQL under the Trino grammar, not postgres
- stringConfigValue coerces a resolved-empty `env:` reference to undefined so
  optional fields fall back to their defaults (workgroup 'primary', catalog
  'AwsDataCatalog') instead of ''
- drop trailing whitespace in dialect.test.ts

* fix(athena): integrate with main's SQL/non-SQL dialect split and add dialect notes

Rebase onto main, which introduced the KtxDialect (core) vs KtxSqlDialect
(SQL-only) split for MongoDB:
- KtxAthenaDialect implements KtxSqlDialect; the connector resolves it via
  getSqlDialectForDriver so SQL-generation methods stay in scope
- add authored athena.md SQL notes for the sql_dialect_notes MCP tool, required
  now that athena resolves to the athena sqlglot dialect (dialect-notes coverage
  is derived from the warehouse-driver registry)

---------

Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
2026-07-02 15:00:26 +02:00

759 lines
27 KiB
Text

---
title: Primary Sources
description: Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, MongoDB, or Amazon Athena.
---
**ktx** connects to your data warehouse or database to build schema context,
discover relationships, and execute semantic layer queries. Each connection is
defined in `ktx.yaml` under the `connections` key.
For analytics tools and knowledge systems such as dbt, MetricFlow, LookML,
Metabase, Looker, and Notion, use [Context Sources](/docs/integrations/context-sources).
For Claude Code, Codex, Cursor, OpenCode, and other agent clients, use
[Agent Clients](/docs/integrations/agent-clients).
All connectors share these conventions:
- Sensitive values support `env:VAR_NAME` (read from environment) and
`file:/path/to/secret` (read from file) references
- Connections are read-only; **ktx** never writes to your database
- Database ingest discovers tables, columns, types, and constraints
automatically
## Connection field reference
Agents should prefer environment or file references over literal secrets.
| Field | Required | Applies to | Description |
|-------|----------|------------|-------------|
| `driver` | Yes | all connections | Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, `duckdb`, `mongodb`, or `athena` |
| `url` | One of the connection methods | URL-style connectors | Database URL, `env:NAME`, or `file:/path/to/secret` |
| `host`, `port`, `database`, `username`, `password` | One of the connection methods | PostgreSQL, MySQL, SQL Server | Field-by-field connection values |
| `schema` or `schemas` | No | schema-aware warehouses | Single schema or list of schemas to scan |
| `databases` | No | ClickHouse, MongoDB, Athena | List of databases to scan |
| `sample_size`, `order_by` | No | MongoDB | Schema-inference sampling controls (recent documents, sort field) |
| `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
| `path` | Yes for path-style SQLite/DuckDB | SQLite, DuckDB | Local SQLite or DuckDB database path or `env:NAME` reference |
| `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job |
| `query_timeout_ms` | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise. |
| `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
| `region` | Yes | Athena | AWS region where the Athena workgroup and Glue catalog reside (e.g. `us-east-1`) |
| `s3_staging_dir` | Yes | Athena | S3 URI for Athena query result storage (e.g. `s3://my-bucket/athena-results/`) |
| `workgroup` | No | Athena | Athena workgroup name (default `primary`) |
| `catalog` | No | Athena | Glue Data Catalog name (default `AwsDataCatalog`) |
| `database` | No | Athena | Default Glue database name passed as the query execution context |
| `databases` | No | Athena | Glue databases to include in schema scans; written by `ktx setup` and read by `ktx ingest` |
## PostgreSQL
The most full-featured connector. Supports schema introspection, foreign key detection, column statistics, and query history via `pg_stat_statements`.
### Connection config
```yaml title="ktx.yaml"
connections:
my-postgres:
driver: postgres
url: env:DATABASE_URL
schema: public
```
Or with individual fields:
```yaml title="ktx.yaml"
connections:
my-postgres:
driver: postgres
host: localhost
port: 5432
database: analytics
username: ktx_reader
password: env:PG_PASSWORD
schemas:
- public
- analytics
ssl: true
```
### Authentication
| Method | Config |
|--------|--------|
| Password | `password: env:PG_PASSWORD` or `password: file:/path/to/secret` |
| Connection URL | `url: env:DATABASE_URL` |
| SSL | `ssl: true`, optionally `rejectUnauthorized: false` for self-signed certs |
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `pg_catalog` |
| Primary keys | Yes | Via `information_schema.table_constraints` |
| Foreign keys | Yes | Full constraint detection |
| Row count estimates | Yes | Via `pg_class.reltuples` |
| Column statistics | Yes | Requires `pg_read_all_stats` role |
| Query history | Yes | Via `pg_stat_statements` extension |
| Table sampling | Yes | `TABLESAMPLE SYSTEM` |
### Query history
PostgreSQL query history mines real query patterns from `pg_stat_statements`.
This helps **ktx** understand how your team actually queries the data.
**Requirements:**
- `pg_stat_statements` extension enabled
- `pg_read_all_stats` role granted to the **ktx** user
**Config options:**
```yaml
context:
queryHistory:
enabled: true
minExecutions: 5
filters:
dropTrivialProbes: true
```
### Dialect notes
- SQL compilation uses `LIMIT/OFFSET` pagination
- Named parameters converted to positional (`$1`, `$2`, ...)
- Supports `COUNT(*) FILTER (WHERE ...)` for null analysis
- Full support for PostgreSQL types: `uuid`, `jsonb`, `timestamptz`, `numeric`, `text[]`, etc.
---
## Snowflake
Connects via the Snowflake SDK. Supports multi-schema scanning, RSA key authentication, and query-history configuration for Snowflake query history.
### Connection config
```yaml title="ktx.yaml"
connections:
my-snowflake:
driver: snowflake
account: xy12345
warehouse: ANALYTICS_WH
database: PROD
schema_names:
- PUBLIC
- SALES
- MARKETING
username: KTX_SERVICE
password: env:SNOWFLAKE_PASSWORD
role: ANALYST
```
`ktx setup` discovers schemas after the connection is verified and writes the
selected list to `schema_names`. You can also set this field manually. For a
single schema, `schema_name: PUBLIC` is accepted as an equivalent shorthand.
### Authentication
| Method | Config |
|--------|--------|
| Password | `password: env:SNOWFLAKE_PASSWORD` |
| RSA key pair | `authMethod: rsa`, `privateKey: file:~/.ssh/snowflake_key.pem`, optional `passphrase` |
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` |
| Primary keys | Yes | Via table constraints |
| Foreign keys | No | Not available in Snowflake |
| Row count estimates | Yes | From `INFORMATION_SCHEMA.TABLES.ROW_COUNT` |
| Column statistics | No | - |
| Query history | Yes | Via `SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` when enabled |
| Table sampling | Yes | - |
### Query history
Snowflake query history reads aggregated query-history templates from
`SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` and feeds the same unified staged
artifact shape as Postgres and BigQuery.
```yaml
context:
queryHistory:
enabled: true
windowDays: 90
minExecutions: 5
filters:
dropTrivialProbes: true
serviceAccounts:
patterns: ['^svc_']
mode: exclude
redactionPatterns: []
```
### Dialect notes
- All identifiers are uppercase by default (case-insensitive matching)
- Connection context set per query (`USE ROLE`, `USE WAREHOUSE`, `USE DATABASE`, `USE SCHEMA`)
- Parameter binding uses positional `?` placeholders
- Date values normalized to ISO 8601 strings
---
## BigQuery
Authenticates via GCP service account credentials. Supports multi-dataset scanning and query-history configuration for `INFORMATION_SCHEMA.JOBS_BY_PROJECT`.
### Connection config
```yaml title="ktx.yaml"
connections:
my-bigquery:
driver: bigquery
credentials_json: file:~/.config/gcloud/bq-service-account.json
dataset_id: analytics
location: US
```
For multiple datasets:
```yaml
dataset_ids:
- analytics
- marketing
- finance
```
BigQuery dataset scope is stored in `connections.<id>.dataset_ids`. Interactive
setup discovers datasets from credentials plus location, then writes the chosen
dataset ids as the scan scope.
### Cross-project datasets
To introspect a dataset hosted in a **different project** than the one your
credentials bill to — for example Google's `bigquery-public-data`, a partner's
shared project, or an organization's central data project — qualify the entry
as `project.dataset`:
```yaml title="ktx.yaml"
connections:
public-bq:
driver: bigquery
credentials_json: file:~/.config/gcloud/bq-service-account.json
location: US
dataset_ids:
- bigquery-public-data.austin_311
- bigquery-public-data.census_bureau_usa
- analytics
```
**ktx** introspects each dataset in its host project while every query job still
bills to the `project_id` inside your `credentials_json`. A bare `dataset` entry
(no prefix) is scanned in your own project, exactly as before. A single
connection may mix datasets from several projects, and two projects may host
datasets with the same name without colliding.
Interactive setup does not enumerate datasets in projects your credentials don't
own, so hand-write `project.dataset` entries for foreign datasets. The wizard's
table picker also only lists datasets in your connection's `location` region;
this affects table selection only — ingest and `discover_data` introspect a
cross-project dataset regardless of region.
### Authentication
| Method | Config |
|--------|--------|
| Service account JSON | `credentials_json: file:/path/to/key.json` |
| Environment variable | `credentials_json: env:BIGQUERY_CREDENTIALS_JSON` |
The project ID is extracted automatically from the service account JSON file.
If you set `project_id` in `ktx.yaml`, **ktx** treats it as local descriptor and
mapping metadata. The BigQuery connector still authenticates with the
`project_id` inside `credentials_json`.
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Including materialized views and external tables |
| Primary keys | Yes | Via `INFORMATION_SCHEMA` table constraints when declared |
| Foreign keys | No | Not available in BigQuery |
| Row count estimates | Yes | From table metadata |
| Column statistics | No | - |
| Query history | Yes | Via region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` when enabled |
| Table sampling | Yes | - |
### Query history
BigQuery query history reads aggregated query-history templates from
region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` and feeds the same unified
staged artifact shape as Postgres and Snowflake.
```yaml
context:
queryHistory:
enabled: true
windowDays: 90
minExecutions: 5
filters:
dropTrivialProbes: true
serviceAccounts:
patterns: ['@bot\\.']
mode: exclude
redactionPatterns: []
```
### Dialect notes
- Parameter binding uses named `@param` syntax
- Arrays flattened to comma-separated strings in results
- Location specified at query execution time
- Supports the `max_bytes_billed` limit from `ktx.yaml`; the shared `query_timeout_ms` field maps to the query job's `jobTimeoutMs`
---
## Amazon Athena
Connects to Amazon Athena using the AWS Glue Data Catalog for schema introspection and the Athena query API for read-only SQL execution. Authentication uses the standard AWS credential chain — no credentials are embedded in `ktx.yaml`.
### Connection config
```yaml title="ktx.yaml"
connections:
my-athena:
driver: athena
region: us-east-1
s3_staging_dir: s3://my-bucket/athena-results/
```
With optional fields:
```yaml title="ktx.yaml"
connections:
my-athena:
driver: athena
region: us-east-1
s3_staging_dir: env:ATHENA_S3_STAGING_DIR
workgroup: analytics
catalog: AwsDataCatalog
database: my_default_database
databases:
- analytics
- raw
```
`ktx setup` writes the `databases` array when you select Glue databases during setup. `ktx scan` reads it to limit introspection to those databases.
### Authentication
**ktx** uses the AWS SDK default credential chain — no credentials appear in `ktx.yaml`. The chain resolves credentials in this order:
| Method | How to configure |
|--------|-----------------|
| Environment variables | Set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and optionally `AWS_SESSION_TOKEN` |
| Shared credentials file | Configure `~/.aws/credentials` with a `[default]` or named profile; set `AWS_PROFILE` to select a non-default profile |
| IAM instance profile | Attach an IAM role to the EC2 instance or ECS task — no local configuration needed |
| IAM roles for service accounts (EKS) | Annotate the pod's service account with the IAM role ARN |
The IAM principal must have `athena:StartQueryExecution`, `athena:GetQueryExecution`, `athena:GetQueryResults`, `glue:GetDatabases`, and `glue:GetTables` permissions, plus read access to the S3 results bucket.
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via AWS Glue Data Catalog |
| Primary keys | No | Glue does not expose constraint metadata |
| Foreign keys | No | Not available in Glue/Athena |
| Row count estimates | No | Glue table statistics are often stale |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | `SELECT ... LIMIT n` |
### Dialect notes
- SQL dialect is Presto/Trino; identifiers are quoted with double-quotes
- Table names use three-part format: `"catalog"."database"."table"` (e.g. `"AwsDataCatalog"."analytics"."orders"`)
- Partition columns (`PartitionKeys` in Glue) are included after regular columns in the schema and are fully queryable
- Athena does not support `TABLESAMPLE`; random sampling uses `ORDER BY rand()`
- Query execution is asynchronous: **ktx** starts the query, polls until completion, then fetches results from S3
- Results are stored in `s3_staging_dir`; the IAM principal needs write access to that bucket
- Use `workgroup` to apply per-workgroup cost controls and result configuration
- The connector always uses your account's default Glue Data Catalog; cross-account catalog access (`CatalogId` pointing to another account) is not supported
---
## MySQL
Standard MySQL/MariaDB connector with full foreign key support and schema introspection.
### Connection config
```yaml title="ktx.yaml"
connections:
my-mysql:
driver: mysql
url: env:MYSQL_DATABASE_URL
```
MySQL supports selecting one or more databases during `ktx setup`. The selected
database scope is stored in `connections.<id>.schemas`, and `ktx scan` reads
exactly those databases.
Or with individual fields:
```yaml title="ktx.yaml"
connections:
my-mysql:
driver: mysql
host: mysql.internal
port: 3306
database: analytics
username: ktx_reader
password: env:MYSQL_PASSWORD
ssl: true
```
### Authentication
| Method | Config |
|--------|--------|
| Password | `password: env:MYSQL_PASSWORD` or `password: file:/path/to/secret` |
| SSL | `ssl: true` or `ssl: { rejectUnauthorized: false }` |
| URL parameters | `?ssl=true` or `?sslmode=required` in connection URL |
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` |
| Primary keys | Yes | Via `KEY_COLUMN_USAGE` |
| Foreign keys | Yes | Via `REFERENTIAL_CONSTRAINTS` |
| Row count estimates | Yes | From `TABLE_ROWS` (InnoDB estimate) |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Uses `RAND()` filter |
### Dialect notes
- Parameter binding uses positional `?` placeholders
- Uses `LIMIT X OFFSET Y` for pagination
- Multi-database scanning uses `schemas` as the selected database list
- Supports 20+ MySQL types including `enum`, `json`, `datetime`, `decimal`
- Table comments extracted with InnoDB metadata prefix stripping
---
## ClickHouse
Connects to ClickHouse over HTTP. Supports table and column introspection across
one or more selected databases.
### Connection config
```yaml title="ktx.yaml"
connections:
my-clickhouse:
driver: clickhouse
url: env:CLICKHOUSE_DATABASE_URL
database: analytics
```
For multiple databases:
```yaml
databases:
- analytics
- mart
```
ClickHouse supports selecting one or more databases during `ktx setup`. The
selected scan scope is stored in `connections.<id>.databases`. The single
`database` field remains the connection default for raw SQL and `ktx sql`.
### Authentication
| Method | Config |
|--------|--------|
| URL | `url: env:CLICKHOUSE_DATABASE_URL` |
| Password | `password: env:CLICKHOUSE_PASSWORD` or `password: file:/path/to/secret` |
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `system.tables` |
| Primary keys | No | Not exposed as relational constraints |
| Foreign keys | No | Not available in ClickHouse |
| Row count estimates | Yes | From ClickHouse metadata where available |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Uses ClickHouse sampling syntax when supported |
### Dialect notes
- Parameter binding uses named placeholders
- The `database` field sets the default database for SQL execution
- The `databases` array controls the scan scope
---
## SQL Server
Connects to Microsoft SQL Server and Azure SQL. Supports multi-schema scanning with `dbo` as the default schema.
### Connection config
```yaml title="ktx.yaml"
connections:
my-sqlserver:
driver: sqlserver
url: env:SQLSERVER_DATABASE_URL
```
Or with individual fields:
```yaml title="ktx.yaml"
connections:
my-sqlserver:
driver: sqlserver
host: sql.internal
port: 1433
database: Analytics
username: ktx_reader
password: env:MSSQL_PASSWORD
schema: dbo
trustServerCertificate: true
```
For multiple schemas:
```yaml
schemas:
- dbo
- analytics
- staging
```
### Authentication
| Method | Config |
|--------|--------|
| SQL Server auth | `username` + `password` |
| Encrypted connection | Always enabled, `trustServerCertificate: true` for self-signed |
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` |
| Primary keys | Yes | Via `TABLE_CONSTRAINTS` and `KEY_COLUMN_USAGE` |
| Foreign keys | Yes | Via `REFERENTIAL_CONSTRAINTS` |
| Row count estimates | Yes | Via `sys.dm_db_partition_stats` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |
### Dialect notes
- Parameter binding uses `@paramName` syntax
- Row limiting uses `SELECT TOP N * FROM (query) AS ktx_query_result`
- Encryption is always required; certificate validation is optional
- Multi-schema support with per-schema isolation
---
## SQLite
File-based connector using `better-sqlite3`. Ideal for local development, embedded analytics, or testing.
### Connection config
```yaml title="ktx.yaml"
connections:
my-sqlite:
driver: sqlite
path: ./data/warehouse.sqlite
```
Path supports multiple formats:
```yaml
# Relative path (resolved against project directory)
path: ./warehouse.sqlite
# Absolute path
path: /var/data/analytics.db
# Home directory expansion
path: ~/data/warehouse.sqlite
# Environment variable
path: env:SQLITE_DB_PATH
# URL format
url: sqlite:///path/to/db.sqlite
```
### Authentication
No authentication required - SQLite is file-based. The file must be readable by the process running **ktx**.
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `sqlite_master` |
| Primary keys | Yes | Via `PRAGMA table_info()` |
| Foreign keys | Yes | Via `PRAGMA foreign_key_list()` (requires `PRAGMA foreign_keys = ON`) |
| Row count estimates | Yes | Exact count via `SELECT COUNT(*)` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |
### Dialect notes
- Synchronous query execution (no connection pooling)
- Parameter binding uses `:paramName` syntax
- Uses `LIMIT X OFFSET Y` for pagination
- SQLite type affinity system: `TEXT`, `NUMERIC`, `INTEGER`, `REAL`, `BLOB`
- Foreign key enforcement requires explicit `PRAGMA foreign_keys = ON`
- Database file must exist before `ktx connection test` or ingest runs
---
## DuckDB
File-based connector using the DuckDB Node.js API. Ideal for local analytics, embedded warehouses, and cross-database federation.
### Connection config
```yaml title="ktx.yaml"
connections:
warehouse:
driver: duckdb
path: data/warehouse.duckdb
```
`path` is resolved relative to the project directory. The `.duckdb` file must already exist — **ktx** never creates a missing database file.
### Authentication
No authentication required — DuckDB is file-based. The `.duckdb` file must be readable by the process running **ktx**.
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `information_schema` on the `main` schema |
| Primary keys | Yes | Via `information_schema.table_constraints` |
| Foreign keys | Yes | Via DuckDB's `duckdb_constraints()` catalog function |
| Row count estimates | Yes | Exact count via `SELECT COUNT(*)` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |
### Dialect notes
- Introspection scans the `main` schema only
- Execution is read-only; **ktx** opens the file without write access
- Parameter binding uses positional `?` placeholders
- Uses `LIMIT X OFFSET Y` for pagination
- Database file must exist before `ktx connection test` or ingest runs
### Cross-database federation
When a project declares two or more attach-compatible connections — any combination of `postgres`, `mysql`, `sqlite`, and `duckdb` — **ktx** derives a cross-database federation connection. That connection can ATTACH a native `.duckdb` file, allowing semantic queries to join across sources without manually copying data.
---
## MongoDB
Connects to MongoDB as a primary context source. **ktx** treats each collection
as a table and each inferred top-level field as a column. MongoDB is a non-SQL
source: `ktx sql` and semantic-layer metric compilation do not apply to a MongoDB
connection, but its collections still flow through `ktx ingest`, descriptions, and
relationship discovery.
### Connection config
```yaml title="ktx.yaml"
connections:
mongo-prod:
driver: mongodb
url: env:MONGO_URL
databases: [app]
enabled_tables: [app.users, app.orders] # optional collection allowlist
sample_size: 1000
# order_by: createdAt # only when _id is not an ObjectId
```
Standard `mongodb://` and `mongodb+srv://` connection strings are supported,
including TLS and MongoDB Atlas — pass the full connection string (with its
query parameters) as `url`. The `databases` list selects which databases to
introspect; if omitted, **ktx** uses the database in the URL path. `ktx setup`
also offers MongoDB and stores the selected databases under
`connections.<id>.databases`.
### Authentication
| Method | Config |
|--------|--------|
| Connection string | `url: env:MONGO_URL` or `url: file:/path/to/secret` |
| Atlas / TLS | Use a `mongodb+srv://` URL with the credentials and TLS options Atlas provides |
### Schema inference
MongoDB has no fixed schema, so **ktx** infers one by sampling the most recent
`sample_size` documents per collection (default 1000), sorted by `_id`
descending. Because an ObjectId embeds its creation time, this captures the
collection's current shape with zero configuration. When `_id` is not an
ObjectId (custom string or UUID keys), set `order_by` to a timestamp field such
as `createdAt` so "most recent" is well-defined. A custom `order_by` field
should be indexed — an unindexed sort hits MongoDB's in-memory sort limit and
fails on large collections (`_id`, the default, is always indexed).
For each top-level field, **ktx** unions the BSON types seen and derives
nullability from how often the field is present:
- Scalar BSON types map to `string`, `number`, `time`, or `boolean`
- A field seen with more than one type is recorded as `mixed` and treated as a string
- Sub-documents and arrays become a single opaque `json` column (no dotted-path
columns); their sampled values are stringified, not faithfully serialized
- `_id` is the primary key
### Features
| Feature | Supported | Notes |
|---------|-----------|-------|
| Collections (as tables) | Yes | Via `listCollections`; `system.*` collections are excluded |
| Primary keys | Yes | `_id` |
| Foreign keys | No | MongoDB has no formal foreign keys |
| Row count estimates | Yes | Via `estimatedDocumentCount` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Reads the most recent documents |
| Nested analysis | Yes | Sub-documents and arrays modeled as opaque `json` |
| Read-only SQL (`ktx sql`) | No | MongoDB is not a SQL source |
### Dialect notes
- Strictly read-only: the connector only issues `find`, `listCollections`,
`estimatedDocumentCount`, and read aggregations
- Sampling rides the `_id` index and uses a server-side time limit so large
collections do not stall a run; a custom `order_by` must be indexed for the
same guarantee
- `sample_size` trades inference coverage for speed; raise it for collections
with highly variable documents
## Common errors
| Error or symptom | Likely cause | Recovery |
|------------------|--------------|----------|
| Connection URL appears in git diff | A literal credential URL was written to `ktx.yaml` | Replace it with `env:NAME` or `file:/path/to/secret` and rotate exposed credentials |
| Database ingest returns no tables | Schema, database, or project filter is wrong, or the user lacks metadata permissions | Verify the schema list and grant metadata read permissions. For Athena, confirm the IAM principal has `glue:GetDatabases` and `glue:GetTables` permissions |
| Query history is empty | Query history extension or warehouse history view is unavailable | Enable the warehouse-specific history feature, then rerun `ktx ingest <connectionId> --query-history` or `ktx setup` |
| Column statistics are missing | Connector cannot access stats tables or the warehouse does not expose them | Grant stats permissions where supported; otherwise rely on schema-level context without column statistics |
| Semantic query execution fails | Connection is missing, unreachable, or query execution is disabled | Run `ktx connection test <id>` and check the `ktx sl query` flags |
| Athena query fails with `ACCESS_DENIED` | IAM principal lacks `athena:StartQueryExecution` or S3 write access to `s3_staging_dir` | Attach a policy granting Athena query permissions and `s3:PutObject` on the staging bucket |
| Athena ingest finds databases but no tables | IAM principal has `glue:GetDatabases` but not `glue:GetTables` | Grant `glue:GetTables` on the relevant Glue catalog resources |