ktx/docs-site/content/docs/integrations/primary-sources.mdx

---
title: Primary Sources
description: Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, MongoDB, or Amazon Athena.
---

**ktx** connects to your data warehouse or database to build schema context,
discover relationships, and execute semantic layer queries. Each connection is
defined in `ktx.yaml` under the `connections` key.

For analytics tools and knowledge systems such as dbt, MetricFlow, LookML,
Metabase, Looker, and Notion, use [Context Sources](/docs/integrations/context-sources).
For Claude Code, Codex, Cursor, OpenCode, and other agent clients, use
[Agent Clients](/docs/integrations/agent-clients).

All connectors share these conventions:

- Sensitive values support `env:VAR_NAME` (read from environment) and
  `file:/path/to/secret` (read from file) references
- Connections are read-only; **ktx** never writes to your database
- Database ingest discovers tables, columns, types, and constraints
  automatically

## Connection field reference

Agents should prefer environment or file references over literal secrets.

| Field | Required | Applies to | Description |
|-------|----------|------------|-------------|
| `driver` | Yes | all connections | Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, `duckdb`, `mongodb`, or `athena` |
| `url` | One of the connection methods | URL-style connectors | Database URL, `env:NAME`, or `file:/path/to/secret` |
| `host`, `port`, `database`, `username`, `password` | One of the connection methods | PostgreSQL, MySQL, SQL Server | Field-by-field connection values |
| `schema` or `schemas` | No | schema-aware warehouses | Single schema or list of schemas to scan |
| `databases` | No | ClickHouse, MongoDB, Athena | List of databases to scan |
| `sample_size`, `order_by` | No | MongoDB | Schema-inference sampling controls (recent documents, sort field) |
| `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
| `path` | Yes for path-style SQLite/DuckDB | SQLite, DuckDB | Local SQLite or DuckDB database path or `env:NAME` reference |
| `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job |
| `query_timeout_ms` | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise. |
| `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
| `region` | Yes | Athena | AWS region where the Athena workgroup and Glue catalog reside (e.g. `us-east-1`) |
| `s3_staging_dir` | Yes | Athena | S3 URI for Athena query result storage (e.g. `s3://my-bucket/athena-results/`) |
| `workgroup` | No | Athena | Athena workgroup name (default `primary`) |
| `catalog` | No | Athena | Glue Data Catalog name (default `AwsDataCatalog`) |
| `database` | No | Athena | Default Glue database name passed as the query execution context |
| `databases` | No | Athena | Glue databases to include in schema scans; written by `ktx setup` and read by `ktx ingest` |

## PostgreSQL

The most full-featured connector. Supports schema introspection, foreign key detection, column statistics, and query history via `pg_stat_statements`.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-postgres:
    driver: postgres
    url: env:DATABASE_URL
    schema: public
```

Or with individual fields:

```yaml title="ktx.yaml"
connections:
  my-postgres:
    driver: postgres
    host: localhost
    port: 5432
    database: analytics
    username: ktx_reader
    password: env:PG_PASSWORD
    schemas:
      - public
      - analytics
    ssl: true
```

### Authentication

| Method | Config |
|--------|--------|
| Password | `password: env:PG_PASSWORD` or `password: file:/path/to/secret` |
| Connection URL | `url: env:DATABASE_URL` |
| SSL | `ssl: true`, optionally `rejectUnauthorized: false` for self-signed certs |

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `pg_catalog` |
| Primary keys | Yes | Via `information_schema.table_constraints` |
| Foreign keys | Yes | Full constraint detection |
| Row count estimates | Yes | Via `pg_class.reltuples` |
| Column statistics | Yes | Requires `pg_read_all_stats` role |
| Query history | Yes | Via `pg_stat_statements` extension |
| Table sampling | Yes | `TABLESAMPLE SYSTEM` |

### Query history

PostgreSQL query history mines real query patterns from `pg_stat_statements`.
This helps **ktx** understand how your team actually queries the data.

**Requirements:**
- `pg_stat_statements` extension enabled
- `pg_read_all_stats` role granted to the **ktx** user

**Config options:**

```yaml
    context:
      queryHistory:
        enabled: true
        minExecutions: 5
        filters:
          dropTrivialProbes: true
```

### Dialect notes

- SQL compilation uses `LIMIT/OFFSET` pagination
- Named parameters converted to positional (`$1`, `$2`, ...)
- Supports `COUNT(*) FILTER (WHERE ...)` for null analysis
- Full support for PostgreSQL types: `uuid`, `jsonb`, `timestamptz`, `numeric`, `text[]`, etc.

---

## Snowflake

Connects via the Snowflake SDK. Supports multi-schema scanning, RSA key authentication, and query-history configuration for Snowflake query history.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-snowflake:
    driver: snowflake
    account: xy12345
    warehouse: ANALYTICS_WH
    database: PROD
    schema_names:
      - PUBLIC
      - SALES
      - MARKETING
    username: KTX_SERVICE
    password: env:SNOWFLAKE_PASSWORD
    role: ANALYST
```

`ktx setup` discovers schemas after the connection is verified and writes the
selected list to `schema_names`. You can also set this field manually. For a
single schema, `schema_name: PUBLIC` is accepted as an equivalent shorthand.

### Authentication

| Method | Config |
|--------|--------|
| Password | `password: env:SNOWFLAKE_PASSWORD` |
| RSA key pair | `authMethod: rsa`, `privateKey: file:~/.ssh/snowflake_key.pem`, optional `passphrase` |

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` |
| Primary keys | Yes | Via table constraints |
| Foreign keys | No | Not available in Snowflake |
| Row count estimates | Yes | From `INFORMATION_SCHEMA.TABLES.ROW_COUNT` |
| Column statistics | No | - |
| Query history | Yes | Via `SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` when enabled |
| Table sampling | Yes | - |

### Query history

Snowflake query history reads aggregated query-history templates from
`SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` and feeds the same unified staged
artifact shape as Postgres and BigQuery.

```yaml
    context:
      queryHistory:
        enabled: true
        windowDays: 90
        minExecutions: 5
        filters:
          dropTrivialProbes: true
          serviceAccounts:
            patterns: ['^svc_']
            mode: exclude
        redactionPatterns: []
```

### Dialect notes

- All identifiers are uppercase by default (case-insensitive matching)
- Connection context set per query (`USE ROLE`, `USE WAREHOUSE`, `USE DATABASE`, `USE SCHEMA`)
- Parameter binding uses positional `?` placeholders
- Date values normalized to ISO 8601 strings

---

## BigQuery

Authenticates via GCP service account credentials. Supports multi-dataset scanning and query-history configuration for `INFORMATION_SCHEMA.JOBS_BY_PROJECT`.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-bigquery:
    driver: bigquery
    credentials_json: file:~/.config/gcloud/bq-service-account.json
    dataset_id: analytics
    location: US
```

For multiple datasets:

```yaml
    dataset_ids:
      - analytics
      - marketing
      - finance
```

BigQuery dataset scope is stored in `connections.<id>.dataset_ids`. Interactive
setup discovers datasets from credentials plus location, then writes the chosen
dataset ids as the scan scope.

### Cross-project datasets

To introspect a dataset hosted in a **different project** than the one your
credentials bill to — for example Google's `bigquery-public-data`, a partner's
shared project, or an organization's central data project — qualify the entry
as `project.dataset`:

```yaml title="ktx.yaml"
connections:
  public-bq:
    driver: bigquery
    credentials_json: file:~/.config/gcloud/bq-service-account.json
    location: US
    dataset_ids:
      - bigquery-public-data.austin_311
      - bigquery-public-data.census_bureau_usa
      - analytics
```

**ktx** introspects each dataset in its host project while every query job still
bills to the `project_id` inside your `credentials_json`. A bare `dataset` entry
(no prefix) is scanned in your own project, exactly as before. A single
connection may mix datasets from several projects, and two projects may host
datasets with the same name without colliding.

Interactive setup does not enumerate datasets in projects your credentials don't
own, so hand-write `project.dataset` entries for foreign datasets. The wizard's
table picker also only lists datasets in your connection's `location` region;
this affects table selection only — ingest and `discover_data` introspect a
cross-project dataset regardless of region.

### Authentication

| Method | Config |
|--------|--------|
| Service account JSON | `credentials_json: file:/path/to/key.json` |
| Environment variable | `credentials_json: env:BIGQUERY_CREDENTIALS_JSON` |

The project ID is extracted automatically from the service account JSON file.
If you set `project_id` in `ktx.yaml`, **ktx** treats it as local descriptor and
mapping metadata. The BigQuery connector still authenticates with the
`project_id` inside `credentials_json`.

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Including materialized views and external tables |
| Primary keys | Yes | Via `INFORMATION_SCHEMA` table constraints when declared |
| Foreign keys | No | Not available in BigQuery |
| Row count estimates | Yes | From table metadata |
| Column statistics | No | - |
| Query history | Yes | Via region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` when enabled |
| Table sampling | Yes | - |

### Query history

BigQuery query history reads aggregated query-history templates from
region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` and feeds the same unified
staged artifact shape as Postgres and Snowflake.

```yaml
    context:
      queryHistory:
        enabled: true
        windowDays: 90
        minExecutions: 5
        filters:
          dropTrivialProbes: true
          serviceAccounts:
            patterns: ['@bot\\.']
            mode: exclude
        redactionPatterns: []
```

### Dialect notes

- Parameter binding uses named `@param` syntax
- Arrays flattened to comma-separated strings in results
- Location specified at query execution time
- Supports the `max_bytes_billed` limit from `ktx.yaml`; the shared `query_timeout_ms` field maps to the query job's `jobTimeoutMs`

---

## Amazon Athena

Connects to Amazon Athena using the AWS Glue Data Catalog for schema introspection and the Athena query API for read-only SQL execution. Authentication uses the standard AWS credential chain — no credentials are embedded in `ktx.yaml`.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-athena:
    driver: athena
    region: us-east-1
    s3_staging_dir: s3://my-bucket/athena-results/
```

With optional fields:

```yaml title="ktx.yaml"
connections:
  my-athena:
    driver: athena
    region: us-east-1
    s3_staging_dir: env:ATHENA_S3_STAGING_DIR
    workgroup: analytics
    catalog: AwsDataCatalog
    database: my_default_database
    databases:
      - analytics
      - raw
```

`ktx setup` writes the `databases` array when you select Glue databases during setup. `ktx scan` reads it to limit introspection to those databases.

### Authentication

**ktx** uses the AWS SDK default credential chain — no credentials appear in `ktx.yaml`. The chain resolves credentials in this order:

| Method | How to configure |
|--------|-----------------|
| Environment variables | Set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and optionally `AWS_SESSION_TOKEN` |
| Shared credentials file | Configure `~/.aws/credentials` with a `[default]` or named profile; set `AWS_PROFILE` to select a non-default profile |
| IAM instance profile | Attach an IAM role to the EC2 instance or ECS task — no local configuration needed |
| IAM roles for service accounts (EKS) | Annotate the pod's service account with the IAM role ARN |

The IAM principal must have `athena:StartQueryExecution`, `athena:GetQueryExecution`, `athena:GetQueryResults`, `glue:GetDatabases`, and `glue:GetTables` permissions, plus read access to the S3 results bucket.

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via AWS Glue Data Catalog |
| Primary keys | No | Glue does not expose constraint metadata |
| Foreign keys | No | Not available in Glue/Athena |
| Row count estimates | No | Glue table statistics are often stale |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | `SELECT ... LIMIT n` |

### Dialect notes

- SQL dialect is Presto/Trino; identifiers are quoted with double-quotes
- Table names use three-part format: `"catalog"."database"."table"` (e.g. `"AwsDataCatalog"."analytics"."orders"`)
- Partition columns (`PartitionKeys` in Glue) are included after regular columns in the schema and are fully queryable
- Athena does not support `TABLESAMPLE`; random sampling uses `ORDER BY rand()`
- Query execution is asynchronous: **ktx** starts the query, polls until completion, then fetches results from S3
- Results are stored in `s3_staging_dir`; the IAM principal needs write access to that bucket
- Use `workgroup` to apply per-workgroup cost controls and result configuration
- The connector always uses your account's default Glue Data Catalog; cross-account catalog access (`CatalogId` pointing to another account) is not supported

---

## MySQL

Standard MySQL/MariaDB connector with full foreign key support and schema introspection.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-mysql:
    driver: mysql
    url: env:MYSQL_DATABASE_URL
```

MySQL supports selecting one or more databases during `ktx setup`. The selected
database scope is stored in `connections.<id>.schemas`, and `ktx scan` reads
exactly those databases.

Or with individual fields:

```yaml title="ktx.yaml"
connections:
  my-mysql:
    driver: mysql
    host: mysql.internal
    port: 3306
    database: analytics
    username: ktx_reader
    password: env:MYSQL_PASSWORD
    ssl: true
```

### Authentication

| Method | Config |
|--------|--------|
| Password | `password: env:MYSQL_PASSWORD` or `password: file:/path/to/secret` |
| SSL | `ssl: true` or `ssl: { rejectUnauthorized: false }` |
| URL parameters | `?ssl=true` or `?sslmode=required` in connection URL |

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` |
| Primary keys | Yes | Via `KEY_COLUMN_USAGE` |
| Foreign keys | Yes | Via `REFERENTIAL_CONSTRAINTS` |
| Row count estimates | Yes | From `TABLE_ROWS` (InnoDB estimate) |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Uses `RAND()` filter |

### Dialect notes

- Parameter binding uses positional `?` placeholders
- Uses `LIMIT X OFFSET Y` for pagination
- Multi-database scanning uses `schemas` as the selected database list
- Supports 20+ MySQL types including `enum`, `json`, `datetime`, `decimal`
- Table comments extracted with InnoDB metadata prefix stripping

---

## ClickHouse

Connects to ClickHouse over HTTP. Supports table and column introspection across
one or more selected databases.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-clickhouse:
    driver: clickhouse
    url: env:CLICKHOUSE_DATABASE_URL
    database: analytics
```

For multiple databases:

```yaml
    databases:
      - analytics
      - mart
```

ClickHouse supports selecting one or more databases during `ktx setup`. The
selected scan scope is stored in `connections.<id>.databases`. The single
`database` field remains the connection default for raw SQL and `ktx sql`.

### Authentication

| Method | Config |
|--------|--------|
| URL | `url: env:CLICKHOUSE_DATABASE_URL` |
| Password | `password: env:CLICKHOUSE_PASSWORD` or `password: file:/path/to/secret` |

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `system.tables` |
| Primary keys | No | Not exposed as relational constraints |
| Foreign keys | No | Not available in ClickHouse |
| Row count estimates | Yes | From ClickHouse metadata where available |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Uses ClickHouse sampling syntax when supported |

### Dialect notes

- Parameter binding uses named placeholders
- The `database` field sets the default database for SQL execution
- The `databases` array controls the scan scope

---

## SQL Server

Connects to Microsoft SQL Server and Azure SQL. Supports multi-schema scanning with `dbo` as the default schema.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-sqlserver:
    driver: sqlserver
    url: env:SQLSERVER_DATABASE_URL
```

Or with individual fields:

```yaml title="ktx.yaml"
connections:
  my-sqlserver:
    driver: sqlserver
    host: sql.internal
    port: 1433
    database: Analytics
    username: ktx_reader
    password: env:MSSQL_PASSWORD
    schema: dbo
    trustServerCertificate: true
```

For multiple schemas:

```yaml
    schemas:
      - dbo
      - analytics
      - staging
```

### Authentication

| Method | Config |
|--------|--------|
| SQL Server auth | `username` + `password` |
| Encrypted connection | Always enabled, `trustServerCertificate: true` for self-signed |

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` |
| Primary keys | Yes | Via `TABLE_CONSTRAINTS` and `KEY_COLUMN_USAGE` |
| Foreign keys | Yes | Via `REFERENTIAL_CONSTRAINTS` |
| Row count estimates | Yes | Via `sys.dm_db_partition_stats` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |

### Dialect notes

- Parameter binding uses `@paramName` syntax
- Row limiting uses `SELECT TOP N * FROM (query) AS ktx_query_result`
- Encryption is always required; certificate validation is optional
- Multi-schema support with per-schema isolation

---

## SQLite

File-based connector using `better-sqlite3`. Ideal for local development, embedded analytics, or testing.

### Connection config

```yaml title="ktx.yaml"
connections:
  my-sqlite:
    driver: sqlite
    path: ./data/warehouse.sqlite
```

Path supports multiple formats:

```yaml
# Relative path (resolved against project directory)
path: ./warehouse.sqlite

# Absolute path
path: /var/data/analytics.db

# Home directory expansion
path: ~/data/warehouse.sqlite

# Environment variable
path: env:SQLITE_DB_PATH

# URL format
url: sqlite:///path/to/db.sqlite
```

### Authentication

No authentication required - SQLite is file-based. The file must be readable by the process running **ktx**.

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `sqlite_master` |
| Primary keys | Yes | Via `PRAGMA table_info()` |
| Foreign keys | Yes | Via `PRAGMA foreign_key_list()` (requires `PRAGMA foreign_keys = ON`) |
| Row count estimates | Yes | Exact count via `SELECT COUNT(*)` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |

### Dialect notes

- Synchronous query execution (no connection pooling)
- Parameter binding uses `:paramName` syntax
- Uses `LIMIT X OFFSET Y` for pagination
- SQLite type affinity system: `TEXT`, `NUMERIC`, `INTEGER`, `REAL`, `BLOB`
- Foreign key enforcement requires explicit `PRAGMA foreign_keys = ON`
- Database file must exist before `ktx connection test` or ingest runs

---

## DuckDB

File-based connector using the DuckDB Node.js API. Ideal for local analytics, embedded warehouses, and cross-database federation.

### Connection config

```yaml title="ktx.yaml"
connections:
  warehouse:
    driver: duckdb
    path: data/warehouse.duckdb
```

`path` is resolved relative to the project directory. The `.duckdb` file must already exist — **ktx** never creates a missing database file.

### Authentication

No authentication required — DuckDB is file-based. The `.duckdb` file must be readable by the process running **ktx**.

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Tables & views | Yes | Via `information_schema` on the `main` schema |
| Primary keys | Yes | Via `information_schema.table_constraints` |
| Foreign keys | Yes | Via DuckDB's `duckdb_constraints()` catalog function |
| Row count estimates | Yes | Exact count via `SELECT COUNT(*)` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |

### Dialect notes

- Introspection scans the `main` schema only
- Execution is read-only; **ktx** opens the file without write access
- Parameter binding uses positional `?` placeholders
- Uses `LIMIT X OFFSET Y` for pagination
- Database file must exist before `ktx connection test` or ingest runs

### Cross-database federation

When a project declares two or more attach-compatible connections — any combination of `postgres`, `mysql`, `sqlite`, and `duckdb` — **ktx** derives a cross-database federation connection. That connection can ATTACH a native `.duckdb` file, allowing semantic queries to join across sources without manually copying data.

---

## MongoDB

Connects to MongoDB as a primary context source. **ktx** treats each collection
as a table and each inferred top-level field as a column. MongoDB is a non-SQL
source: `ktx sql` and semantic-layer metric compilation do not apply to a MongoDB
connection, but its collections still flow through `ktx ingest`, descriptions, and
relationship discovery.

### Connection config

```yaml title="ktx.yaml"
connections:
  mongo-prod:
    driver: mongodb
    url: env:MONGO_URL
    databases: [app]
    enabled_tables: [app.users, app.orders] # optional collection allowlist
    sample_size: 1000
    # order_by: createdAt                    # only when _id is not an ObjectId
```

Standard `mongodb://` and `mongodb+srv://` connection strings are supported,
including TLS and MongoDB Atlas — pass the full connection string (with its
query parameters) as `url`. The `databases` list selects which databases to
introspect; if omitted, **ktx** uses the database in the URL path. `ktx setup`
also offers MongoDB and stores the selected databases under
`connections.<id>.databases`.

### Authentication

| Method | Config |
|--------|--------|
| Connection string | `url: env:MONGO_URL` or `url: file:/path/to/secret` |
| Atlas / TLS | Use a `mongodb+srv://` URL with the credentials and TLS options Atlas provides |

### Schema inference

MongoDB has no fixed schema, so **ktx** infers one by sampling the most recent
`sample_size` documents per collection (default 1000), sorted by `_id`
descending. Because an ObjectId embeds its creation time, this captures the
collection's current shape with zero configuration. When `_id` is not an
ObjectId (custom string or UUID keys), set `order_by` to a timestamp field such
as `createdAt` so "most recent" is well-defined. A custom `order_by` field
should be indexed — an unindexed sort hits MongoDB's in-memory sort limit and
fails on large collections (`_id`, the default, is always indexed).

For each top-level field, **ktx** unions the BSON types seen and derives
nullability from how often the field is present:

- Scalar BSON types map to `string`, `number`, `time`, or `boolean`
- A field seen with more than one type is recorded as `mixed` and treated as a string
- Sub-documents and arrays become a single opaque `json` column (no dotted-path
  columns); their sampled values are stringified, not faithfully serialized
- `_id` is the primary key

### Features

| Feature | Supported | Notes |
|---------|-----------|-------|
| Collections (as tables) | Yes | Via `listCollections`; `system.*` collections are excluded |
| Primary keys | Yes | `_id` |
| Foreign keys | No | MongoDB has no formal foreign keys |
| Row count estimates | Yes | Via `estimatedDocumentCount` |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Reads the most recent documents |
| Nested analysis | Yes | Sub-documents and arrays modeled as opaque `json` |
| Read-only SQL (`ktx sql`) | No | MongoDB is not a SQL source |

### Dialect notes

- Strictly read-only: the connector only issues `find`, `listCollections`,
  `estimatedDocumentCount`, and read aggregations
- Sampling rides the `_id` index and uses a server-side time limit so large
  collections do not stall a run; a custom `order_by` must be indexed for the
  same guarantee
- `sample_size` trades inference coverage for speed; raise it for collections
  with highly variable documents

## Common errors

| Error or symptom | Likely cause | Recovery |
|------------------|--------------|----------|
| Connection URL appears in git diff | A literal credential URL was written to `ktx.yaml` | Replace it with `env:NAME` or `file:/path/to/secret` and rotate exposed credentials |
| Database ingest returns no tables | Schema, database, or project filter is wrong, or the user lacks metadata permissions | Verify the schema list and grant metadata read permissions. For Athena, confirm the IAM principal has `glue:GetDatabases` and `glue:GetTables` permissions |
| Query history is empty | Query history extension or warehouse history view is unavailable | Enable the warehouse-specific history feature, then rerun `ktx ingest <connectionId> --query-history` or `ktx setup` |
| Column statistics are missing | Connector cannot access stats tables or the warehouse does not expose them | Grant stats permissions where supported; otherwise rely on schema-level context without column statistics |
| Semantic query execution fails | Connection is missing, unreachable, or query execution is disabled | Run `ktx connection test <id>` and check the `ktx sl query` flags |
| Athena query fails with `ACCESS_DENIED` | IAM principal lacks `athena:StartQueryExecution` or S3 write access to `s3_staging_dir` | Attach a policy granting Athena query permissions and `s3:PutObject` on the staging bucket |
| Athena ingest finds databases but no tables | IAM principal has `glue:GetDatabases` but not `glue:GetTables` | Grant `glue:GetTables` on the relevant Glue catalog resources |