--- title: Primary Sources description: Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, or MongoDB. --- **ktx** connects to your data warehouse or database to build schema context, discover relationships, and execute semantic layer queries. Each connection is defined in `ktx.yaml` under the `connections` key. For analytics tools and knowledge systems such as dbt, MetricFlow, LookML, Metabase, Looker, and Notion, use [Context Sources](/docs/integrations/context-sources). For Claude Code, Codex, Cursor, OpenCode, and other agent clients, use [Agent Clients](/docs/integrations/agent-clients). All connectors share these conventions: - Sensitive values support `env:VAR_NAME` (read from environment) and `file:/path/to/secret` (read from file) references - Connections are read-only; **ktx** never writes to your database - Database ingest discovers tables, columns, types, and constraints automatically ## Connection field reference Agents should prefer environment or file references over literal secrets. | Field | Required | Applies to | Description | |-------|----------|------------|-------------| | `driver` | Yes | all connections | Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, or `mongodb` | | `url` | One of the connection methods | URL-style connectors | Database URL, `env:NAME`, or `file:/path/to/secret` | | `host`, `port`, `database`, `username`, `password` | One of the connection methods | PostgreSQL, MySQL, SQL Server | Field-by-field connection values | | `schema` or `schemas` | No | schema-aware warehouses | Single schema or list of schemas to scan | | `databases` | No | ClickHouse, MongoDB | List of databases to scan | | `sample_size`, `order_by` | No | MongoDB | Schema-inference sampling controls (recent documents, sort field) | | `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it | | `path` | Yes for path-style SQLite | SQLite | Local SQLite database path or `env:NAME` reference | | `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job | | `query_timeout_ms` | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise. | | `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication | ## PostgreSQL The most full-featured connector. Supports schema introspection, foreign key detection, column statistics, and query history via `pg_stat_statements`. ### Connection config ```yaml title="ktx.yaml" connections: my-postgres: driver: postgres url: env:DATABASE_URL schema: public ``` Or with individual fields: ```yaml title="ktx.yaml" connections: my-postgres: driver: postgres host: localhost port: 5432 database: analytics username: ktx_reader password: env:PG_PASSWORD schemas: - public - analytics ssl: true ``` ### Authentication | Method | Config | |--------|--------| | Password | `password: env:PG_PASSWORD` or `password: file:/path/to/secret` | | Connection URL | `url: env:DATABASE_URL` | | SSL | `ssl: true`, optionally `rejectUnauthorized: false` for self-signed certs | ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Tables & views | Yes | Via `pg_catalog` | | Primary keys | Yes | Via `information_schema.table_constraints` | | Foreign keys | Yes | Full constraint detection | | Row count estimates | Yes | Via `pg_class.reltuples` | | Column statistics | Yes | Requires `pg_read_all_stats` role | | Query history | Yes | Via `pg_stat_statements` extension | | Table sampling | Yes | `TABLESAMPLE SYSTEM` | ### Query history PostgreSQL query history mines real query patterns from `pg_stat_statements`. This helps **ktx** understand how your team actually queries the data. **Requirements:** - `pg_stat_statements` extension enabled - `pg_read_all_stats` role granted to the **ktx** user **Config options:** ```yaml context: queryHistory: enabled: true minExecutions: 5 filters: dropTrivialProbes: true ``` ### Dialect notes - SQL compilation uses `LIMIT/OFFSET` pagination - Named parameters converted to positional (`$1`, `$2`, ...) - Supports `COUNT(*) FILTER (WHERE ...)` for null analysis - Full support for PostgreSQL types: `uuid`, `jsonb`, `timestamptz`, `numeric`, `text[]`, etc. --- ## Snowflake Connects via the Snowflake SDK. Supports multi-schema scanning, RSA key authentication, and query-history configuration for Snowflake query history. ### Connection config ```yaml title="ktx.yaml" connections: my-snowflake: driver: snowflake account: xy12345 warehouse: ANALYTICS_WH database: PROD schema_names: - PUBLIC - SALES - MARKETING username: KTX_SERVICE password: env:SNOWFLAKE_PASSWORD role: ANALYST ``` `ktx setup` discovers schemas after the connection is verified and writes the selected list to `schema_names`. You can also set this field manually. For a single schema, `schema_name: PUBLIC` is accepted as an equivalent shorthand. ### Authentication | Method | Config | |--------|--------| | Password | `password: env:SNOWFLAKE_PASSWORD` | | RSA key pair | `authMethod: rsa`, `privateKey: file:~/.ssh/snowflake_key.pem`, optional `passphrase` | ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` | | Primary keys | Yes | Via table constraints | | Foreign keys | No | Not available in Snowflake | | Row count estimates | Yes | From `INFORMATION_SCHEMA.TABLES.ROW_COUNT` | | Column statistics | No | - | | Query history | Yes | Via `SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` when enabled | | Table sampling | Yes | - | ### Query history Snowflake query history reads aggregated query-history templates from `SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` and feeds the same unified staged artifact shape as Postgres and BigQuery. ```yaml context: queryHistory: enabled: true windowDays: 90 minExecutions: 5 filters: dropTrivialProbes: true serviceAccounts: patterns: ['^svc_'] mode: exclude redactionPatterns: [] ``` ### Dialect notes - All identifiers are uppercase by default (case-insensitive matching) - Connection context set per query (`USE ROLE`, `USE WAREHOUSE`, `USE DATABASE`, `USE SCHEMA`) - Parameter binding uses positional `?` placeholders - Date values normalized to ISO 8601 strings --- ## BigQuery Authenticates via GCP service account credentials. Supports multi-dataset scanning and query-history configuration for `INFORMATION_SCHEMA.JOBS_BY_PROJECT`. ### Connection config ```yaml title="ktx.yaml" connections: my-bigquery: driver: bigquery credentials_json: file:~/.config/gcloud/bq-service-account.json dataset_id: analytics location: US ``` For multiple datasets: ```yaml dataset_ids: - analytics - marketing - finance ``` BigQuery dataset scope is stored in `connections..dataset_ids`. Interactive setup discovers datasets from credentials plus location, then writes the chosen dataset ids as the scan scope. ### Cross-project datasets To introspect a dataset hosted in a **different project** than the one your credentials bill to — for example Google's `bigquery-public-data`, a partner's shared project, or an organization's central data project — qualify the entry as `project.dataset`: ```yaml title="ktx.yaml" connections: public-bq: driver: bigquery credentials_json: file:~/.config/gcloud/bq-service-account.json location: US dataset_ids: - bigquery-public-data.austin_311 - bigquery-public-data.census_bureau_usa - analytics ``` **ktx** introspects each dataset in its host project while every query job still bills to the `project_id` inside your `credentials_json`. A bare `dataset` entry (no prefix) is scanned in your own project, exactly as before. A single connection may mix datasets from several projects, and two projects may host datasets with the same name without colliding. Interactive setup does not enumerate datasets in projects your credentials don't own, so hand-write `project.dataset` entries for foreign datasets. The wizard's table picker also only lists datasets in your connection's `location` region; this affects table selection only — ingest and `discover_data` introspect a cross-project dataset regardless of region. ### Authentication | Method | Config | |--------|--------| | Service account JSON | `credentials_json: file:/path/to/key.json` | | Environment variable | `credentials_json: env:BIGQUERY_CREDENTIALS_JSON` | The project ID is extracted automatically from the service account JSON file. If you set `project_id` in `ktx.yaml`, **ktx** treats it as local descriptor and mapping metadata. The BigQuery connector still authenticates with the `project_id` inside `credentials_json`. ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Tables & views | Yes | Including materialized views and external tables | | Primary keys | Yes | Via `INFORMATION_SCHEMA` table constraints when declared | | Foreign keys | No | Not available in BigQuery | | Row count estimates | Yes | From table metadata | | Column statistics | No | - | | Query history | Yes | Via region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` when enabled | | Table sampling | Yes | - | ### Query history BigQuery query history reads aggregated query-history templates from region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` and feeds the same unified staged artifact shape as Postgres and Snowflake. ```yaml context: queryHistory: enabled: true windowDays: 90 minExecutions: 5 filters: dropTrivialProbes: true serviceAccounts: patterns: ['@bot\\.'] mode: exclude redactionPatterns: [] ``` ### Dialect notes - Parameter binding uses named `@param` syntax - Arrays flattened to comma-separated strings in results - Location specified at query execution time - Supports the `max_bytes_billed` limit from `ktx.yaml`; the shared `query_timeout_ms` field maps to the query job's `jobTimeoutMs` --- ## MySQL Standard MySQL/MariaDB connector with full foreign key support and schema introspection. ### Connection config ```yaml title="ktx.yaml" connections: my-mysql: driver: mysql url: env:MYSQL_DATABASE_URL ``` MySQL supports selecting one or more databases during `ktx setup`. The selected database scope is stored in `connections..schemas`, and `ktx scan` reads exactly those databases. Or with individual fields: ```yaml title="ktx.yaml" connections: my-mysql: driver: mysql host: mysql.internal port: 3306 database: analytics username: ktx_reader password: env:MYSQL_PASSWORD ssl: true ``` ### Authentication | Method | Config | |--------|--------| | Password | `password: env:MYSQL_PASSWORD` or `password: file:/path/to/secret` | | SSL | `ssl: true` or `ssl: { rejectUnauthorized: false }` | | URL parameters | `?ssl=true` or `?sslmode=required` in connection URL | ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` | | Primary keys | Yes | Via `KEY_COLUMN_USAGE` | | Foreign keys | Yes | Via `REFERENTIAL_CONSTRAINTS` | | Row count estimates | Yes | From `TABLE_ROWS` (InnoDB estimate) | | Column statistics | No | - | | Query history | No | - | | Table sampling | Yes | Uses `RAND()` filter | ### Dialect notes - Parameter binding uses positional `?` placeholders - Uses `LIMIT X OFFSET Y` for pagination - Multi-database scanning uses `schemas` as the selected database list - Supports 20+ MySQL types including `enum`, `json`, `datetime`, `decimal` - Table comments extracted with InnoDB metadata prefix stripping --- ## ClickHouse Connects to ClickHouse over HTTP. Supports table and column introspection across one or more selected databases. ### Connection config ```yaml title="ktx.yaml" connections: my-clickhouse: driver: clickhouse url: env:CLICKHOUSE_DATABASE_URL database: analytics ``` For multiple databases: ```yaml databases: - analytics - mart ``` ClickHouse supports selecting one or more databases during `ktx setup`. The selected scan scope is stored in `connections..databases`. The single `database` field remains the connection default for raw SQL and `ktx sql`. ### Authentication | Method | Config | |--------|--------| | URL | `url: env:CLICKHOUSE_DATABASE_URL` | | Password | `password: env:CLICKHOUSE_PASSWORD` or `password: file:/path/to/secret` | ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Tables & views | Yes | Via `system.tables` | | Primary keys | No | Not exposed as relational constraints | | Foreign keys | No | Not available in ClickHouse | | Row count estimates | Yes | From ClickHouse metadata where available | | Column statistics | No | - | | Query history | No | - | | Table sampling | Yes | Uses ClickHouse sampling syntax when supported | ### Dialect notes - Parameter binding uses named placeholders - The `database` field sets the default database for SQL execution - The `databases` array controls the scan scope --- ## SQL Server Connects to Microsoft SQL Server and Azure SQL. Supports multi-schema scanning with `dbo` as the default schema. ### Connection config ```yaml title="ktx.yaml" connections: my-sqlserver: driver: sqlserver url: env:SQLSERVER_DATABASE_URL ``` Or with individual fields: ```yaml title="ktx.yaml" connections: my-sqlserver: driver: sqlserver host: sql.internal port: 1433 database: Analytics username: ktx_reader password: env:MSSQL_PASSWORD schema: dbo trustServerCertificate: true ``` For multiple schemas: ```yaml schemas: - dbo - analytics - staging ``` ### Authentication | Method | Config | |--------|--------| | SQL Server auth | `username` + `password` | | Encrypted connection | Always enabled, `trustServerCertificate: true` for self-signed | ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Tables & views | Yes | Via `INFORMATION_SCHEMA.TABLES` | | Primary keys | Yes | Via `TABLE_CONSTRAINTS` and `KEY_COLUMN_USAGE` | | Foreign keys | Yes | Via `REFERENTIAL_CONSTRAINTS` | | Row count estimates | Yes | Via `sys.dm_db_partition_stats` | | Column statistics | No | - | | Query history | No | - | | Table sampling | Yes | - | | Nested analysis | No | - | ### Dialect notes - Parameter binding uses `@paramName` syntax - Row limiting uses `SELECT TOP N * FROM (query) AS ktx_query_result` - Encryption is always required; certificate validation is optional - Multi-schema support with per-schema isolation --- ## SQLite File-based connector using `better-sqlite3`. Ideal for local development, embedded analytics, or testing. ### Connection config ```yaml title="ktx.yaml" connections: my-sqlite: driver: sqlite path: ./data/warehouse.sqlite ``` Path supports multiple formats: ```yaml # Relative path (resolved against project directory) path: ./warehouse.sqlite # Absolute path path: /var/data/analytics.db # Home directory expansion path: ~/data/warehouse.sqlite # Environment variable path: env:SQLITE_DB_PATH # URL format url: sqlite:///path/to/db.sqlite ``` ### Authentication No authentication required - SQLite is file-based. The file must be readable by the process running **ktx**. ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Tables & views | Yes | Via `sqlite_master` | | Primary keys | Yes | Via `PRAGMA table_info()` | | Foreign keys | Yes | Via `PRAGMA foreign_key_list()` (requires `PRAGMA foreign_keys = ON`) | | Row count estimates | Yes | Exact count via `SELECT COUNT(*)` | | Column statistics | No | - | | Query history | No | - | | Table sampling | Yes | - | | Nested analysis | No | - | ### Dialect notes - Synchronous query execution (no connection pooling) - Parameter binding uses `:paramName` syntax - Uses `LIMIT X OFFSET Y` for pagination - SQLite type affinity system: `TEXT`, `NUMERIC`, `INTEGER`, `REAL`, `BLOB` - Foreign key enforcement requires explicit `PRAGMA foreign_keys = ON` - Database file must exist before `ktx connection test` or ingest runs --- ## MongoDB Connects to MongoDB as a primary context source. **ktx** treats each collection as a table and each inferred top-level field as a column. MongoDB is a non-SQL source: `ktx sql` and semantic-layer metric compilation do not apply to a MongoDB connection, but its collections still flow through `ktx ingest`, descriptions, and relationship discovery. ### Connection config ```yaml title="ktx.yaml" connections: mongo-prod: driver: mongodb url: env:MONGO_URL databases: [app] enabled_tables: [app.users, app.orders] # optional collection allowlist sample_size: 1000 # order_by: createdAt # only when _id is not an ObjectId ``` Standard `mongodb://` and `mongodb+srv://` connection strings are supported, including TLS and MongoDB Atlas — pass the full connection string (with its query parameters) as `url`. The `databases` list selects which databases to introspect; if omitted, **ktx** uses the database in the URL path. `ktx setup` also offers MongoDB and stores the selected databases under `connections..databases`. ### Authentication | Method | Config | |--------|--------| | Connection string | `url: env:MONGO_URL` or `url: file:/path/to/secret` | | Atlas / TLS | Use a `mongodb+srv://` URL with the credentials and TLS options Atlas provides | ### Schema inference MongoDB has no fixed schema, so **ktx** infers one by sampling the most recent `sample_size` documents per collection (default 1000), sorted by `_id` descending. Because an ObjectId embeds its creation time, this captures the collection's current shape with zero configuration. When `_id` is not an ObjectId (custom string or UUID keys), set `order_by` to a timestamp field such as `createdAt` so "most recent" is well-defined. A custom `order_by` field should be indexed — an unindexed sort hits MongoDB's in-memory sort limit and fails on large collections (`_id`, the default, is always indexed). For each top-level field, **ktx** unions the BSON types seen and derives nullability from how often the field is present: - Scalar BSON types map to `string`, `number`, `time`, or `boolean` - A field seen with more than one type is recorded as `mixed` and treated as a string - Sub-documents and arrays become a single opaque `json` column (no dotted-path columns); their sampled values are stringified, not faithfully serialized - `_id` is the primary key ### Features | Feature | Supported | Notes | |---------|-----------|-------| | Collections (as tables) | Yes | Via `listCollections`; `system.*` collections are excluded | | Primary keys | Yes | `_id` | | Foreign keys | No | MongoDB has no formal foreign keys | | Row count estimates | Yes | Via `estimatedDocumentCount` | | Column statistics | No | - | | Query history | No | - | | Table sampling | Yes | Reads the most recent documents | | Nested analysis | Yes | Sub-documents and arrays modeled as opaque `json` | | Read-only SQL (`ktx sql`) | No | MongoDB is not a SQL source | ### Dialect notes - Strictly read-only: the connector only issues `find`, `listCollections`, `estimatedDocumentCount`, and read aggregations - Sampling rides the `_id` index and uses a server-side time limit so large collections do not stall a run; a custom `order_by` must be indexed for the same guarantee - `sample_size` trades inference coverage for speed; raise it for collections with highly variable documents ## Common errors | Error or symptom | Likely cause | Recovery | |------------------|--------------|----------| | Connection URL appears in git diff | A literal credential URL was written to `ktx.yaml` | Replace it with `env:NAME` or `file:/path/to/secret` and rotate exposed credentials | | Database ingest returns no tables | Schema, database, or project filter is wrong, or the user lacks metadata permissions | Verify the schema list and grant metadata read permissions | | Query history is empty | Query history extension or warehouse history view is unavailable | Enable the warehouse-specific history feature, then rerun `ktx ingest --query-history` or `ktx setup` | | Column statistics are missing | Connector cannot access stats tables or the warehouse does not expose them | Grant stats permissions where supported; otherwise rely on schema-level context without column statistics | | Semantic query execution fails | Connection is missing, unreachable, or query execution is disabled | Run `ktx connection test ` and check the `ktx sl query` flags |