feat(connector): add Amazon Athena connector via Glue Data Catalog (#309)

* feat(connector): add Amazon Athena connector via Glue Data Catalog * fix(athena): address reviewer feedback * fix(athena): wire scope discovery, fix normalizeDriver, tighten types and tests * fix(athena): honor databases scope, wire sql-analysis dialect, harden config resolution - introspect() limits to the configured `databases` scope instead of scanning every Glue database in the account (docs promised this; connector ignored it) - add athena -> athena to sql-analysis SQLGLOT_DIALECTS so `ktx sql` and MCP read-only validation parse Athena SQL under the Trino grammar, not postgres - stringConfigValue coerces a resolved-empty `env:` reference to undefined so optional fields fall back to their defaults (workgroup 'primary', catalog 'AwsDataCatalog') instead of '' - drop trailing whitespace in dialect.test.ts * fix(athena): integrate with main's SQL/non-SQL dialect split and add dialect notes Rebase onto main, which introduced the KtxDialect (core) vs KtxSqlDialect (SQL-only) split for MongoDB: - KtxAthenaDialect implements KtxSqlDialect; the connector resolves it via getSqlDialectForDriver so SQL-generation methods stay in scope - add authored athena.md SQL notes for the sql_dialect_notes MCP tool, required now that athena resolves to the athena sqlglot dialect (dialect-notes coverage is derived from the warehouse-driver registry) --------- Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
2026-07-04 10:52:13 +02:00 · 2026-07-02 06:00:26 -07:00 · 2026-07-02 06:00:26 -07:00 · fe7e6bd1fa
commit fe7e6bd1fa
parent 6d01030745
24 changed files with 2047 additions and 6 deletions
--- a/docs-site/content/docs/integrations/primary-sources.mdx
+++ b/docs-site/content/docs/integrations/primary-sources.mdx
@ -1,6 +1,6 @@
 ---
 title: Primary Sources
-description: Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, or MongoDB.
+description: Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, MongoDB, or Amazon Athena.
 ---

 **ktx** connects to your data warehouse or database to build schema context,
@ -26,17 +26,23 @@ Agents should prefer environment or file references over literal secrets.

 | Field | Required | Applies to | Description |
 |-------|----------|------------|-------------|
-| `driver` | Yes | all connections | Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, `duckdb`, or `mongodb` |
+| `driver` | Yes | all connections | Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, `duckdb`, `mongodb`, or `athena` |
 | `url` | One of the connection methods | URL-style connectors | Database URL, `env:NAME`, or `file:/path/to/secret` |
 | `host`, `port`, `database`, `username`, `password` | One of the connection methods | PostgreSQL, MySQL, SQL Server | Field-by-field connection values |
 | `schema` or `schemas` | No | schema-aware warehouses | Single schema or list of schemas to scan |
-| `databases` | No | ClickHouse, MongoDB | List of databases to scan |
+| `databases` | No | ClickHouse, MongoDB, Athena | List of databases to scan |
 | `sample_size`, `order_by` | No | MongoDB | Schema-inference sampling controls (recent documents, sort field) |
 | `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
 | `path` | Yes for path-style SQLite/DuckDB | SQLite, DuckDB | Local SQLite or DuckDB database path or `env:NAME` reference |
 | `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job |
 | `query_timeout_ms` | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise. |
 | `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
+| `region` | Yes | Athena | AWS region where the Athena workgroup and Glue catalog reside (e.g. `us-east-1`) |
+| `s3_staging_dir` | Yes | Athena | S3 URI for Athena query result storage (e.g. `s3://my-bucket/athena-results/`) |
+| `workgroup` | No | Athena | Athena workgroup name (default `primary`) |
+| `catalog` | No | Athena | Glue Data Catalog name (default `AwsDataCatalog`) |
+| `database` | No | Athena | Default Glue database name passed as the query execution context |
+| `databases` | No | Athena | Glue databases to include in schema scans; written by `ktx setup` and read by `ktx ingest` |

 ## PostgreSQL

@ -304,6 +310,76 @@ staged artifact shape as Postgres and Snowflake.

 ---

+## Amazon Athena
+
+Connects to Amazon Athena using the AWS Glue Data Catalog for schema introspection and the Athena query API for read-only SQL execution. Authentication uses the standard AWS credential chain — no credentials are embedded in `ktx.yaml`.
+
+### Connection config
+
+```yaml title="ktx.yaml"
+connections:
+  my-athena:
+    driver: athena
+    region: us-east-1
+    s3_staging_dir: s3://my-bucket/athena-results/
+```
+
+With optional fields:
+
+```yaml title="ktx.yaml"
+connections:
+  my-athena:
+    driver: athena
+    region: us-east-1
+    s3_staging_dir: env:ATHENA_S3_STAGING_DIR
+    workgroup: analytics
+    catalog: AwsDataCatalog
+    database: my_default_database
+    databases:
+      - analytics
+      - raw
+```
+
+`ktx setup` writes the `databases` array when you select Glue databases during setup. `ktx scan` reads it to limit introspection to those databases.
+
+### Authentication
+
+**ktx** uses the AWS SDK default credential chain — no credentials appear in `ktx.yaml`. The chain resolves credentials in this order:
+
+| Method | How to configure |
+|--------|-----------------|
+| Environment variables | Set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and optionally `AWS_SESSION_TOKEN` |
+| Shared credentials file | Configure `~/.aws/credentials` with a `[default]` or named profile; set `AWS_PROFILE` to select a non-default profile |
+| IAM instance profile | Attach an IAM role to the EC2 instance or ECS task — no local configuration needed |
+| IAM roles for service accounts (EKS) | Annotate the pod's service account with the IAM role ARN |
+
+The IAM principal must have `athena:StartQueryExecution`, `athena:GetQueryExecution`, `athena:GetQueryResults`, `glue:GetDatabases`, and `glue:GetTables` permissions, plus read access to the S3 results bucket.
+
+### Features
+
+| Feature | Supported | Notes |
+|---------|-----------|-------|
+| Tables & views | Yes | Via AWS Glue Data Catalog |
+| Primary keys | No | Glue does not expose constraint metadata |
+| Foreign keys | No | Not available in Glue/Athena |
+| Row count estimates | No | Glue table statistics are often stale |
+| Column statistics | No | - |
+| Query history | No | - |
+| Table sampling | Yes | `SELECT ... LIMIT n` |
+
+### Dialect notes
+
+- SQL dialect is Presto/Trino; identifiers are quoted with double-quotes
+- Table names use three-part format: `"catalog"."database"."table"` (e.g. `"AwsDataCatalog"."analytics"."orders"`)
+- Partition columns (`PartitionKeys` in Glue) are included after regular columns in the schema and are fully queryable
+- Athena does not support `TABLESAMPLE`; random sampling uses `ORDER BY rand()`
+- Query execution is asynchronous: **ktx** starts the query, polls until completion, then fetches results from S3
+- Results are stored in `s3_staging_dir`; the IAM principal needs write access to that bucket
+- Use `workgroup` to apply per-workgroup cost controls and result configuration
+- The connector always uses your account's default Glue Data Catalog; cross-account catalog access (`CatalogId` pointing to another account) is not supported
+
+---
+
 ## MySQL

 Standard MySQL/MariaDB connector with full foreign key support and schema introspection.
@ -675,7 +751,9 @@ nullability from how often the field is present:
 | Error or symptom | Likely cause | Recovery |
 |------------------|--------------|----------|
 | Connection URL appears in git diff | A literal credential URL was written to `ktx.yaml` | Replace it with `env:NAME` or `file:/path/to/secret` and rotate exposed credentials |
-| Database ingest returns no tables | Schema, database, or project filter is wrong, or the user lacks metadata permissions | Verify the schema list and grant metadata read permissions |
+| Database ingest returns no tables | Schema, database, or project filter is wrong, or the user lacks metadata permissions | Verify the schema list and grant metadata read permissions. For Athena, confirm the IAM principal has `glue:GetDatabases` and `glue:GetTables` permissions |
 | Query history is empty | Query history extension or warehouse history view is unavailable | Enable the warehouse-specific history feature, then rerun `ktx ingest <connectionId> --query-history` or `ktx setup` |
 | Column statistics are missing | Connector cannot access stats tables or the warehouse does not expose them | Grant stats permissions where supported; otherwise rely on schema-level context without column statistics |
 | Semantic query execution fails | Connection is missing, unreachable, or query execution is disabled | Run `ktx connection test <id>` and check the `ktx sl query` flags |
+| Athena query fails with `ACCESS_DENIED` | IAM principal lacks `athena:StartQueryExecution` or S3 write access to `s3_staging_dir` | Attach a policy granting Athena query permissions and `s3:PutObject` on the staging bucket |
+| Athena ingest finds databases but no tables | IAM principal has `glue:GetDatabases` but not `glue:GetTables` | Grant `glue:GetTables` on the relevant Glue catalog resources |