mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-07 07:55:13 +02:00
* feat(setup): drop redundant Snowflake schema prompt; fall back to free-text on listSchemas failure Snowflake setup previously asked for a single schema as free text, then ran a multiselect against the discovered schemas — two schema questions back-to-back, with the first being only a session bootstrap. The SDK's `schema` is optional, so the bootstrap step is unnecessary. - Remove the free-text Snowflake schema prompt; only pass `schema` to snowflake-sdk when one is configured. - When `listSchemas()` fails (e.g. role lacks SHOW SCHEMAS), prompt the user for a comma-separated list, persist it as `schema_names`, and use it as both the table-list filter and the multiselect default. Applies to every driver with a scope-discovery spec, not just Snowflake. - Update docs to lead with `schema_names`; keep `schema_name` as a documented single-schema shorthand. * fix(snowflake): keep introspecting when primary-key discovery is denied The PK query joins INFORMATION_SCHEMA.TABLE_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE, which require grants the connection role may not have. Previously a 'SQL compilation error: Object ANALYTICS.INFORMATION_SCHEMA.KEY_COLUMN_USAGE does not exist or not authorized' aborted the entire introspect — schemas, columns, and row counts were all discarded over a missing nice-to-have. Wrap the constraint query in try/catch, log a one-line warning per schema, and return an empty PK map. Columns end up with primaryKey=false; relationship inference still has FK and profiling to fall back on. * fix(scan): unblock relationship discovery on Snowflake Two adjacent bugs prevented the scan's relationship pipeline from producing any joins on a Snowflake warehouse: - relationship-profiling.ts fell through to a default `GROUP_CONCAT` branch for unknown drivers. Snowflake has no GROUP_CONCAT, so every per-table profile query failed with "Unknown function GROUP_CONCAT". Add an explicit Snowflake branch that uses LISTAGG with a literal '\x1f' delimiter (Snowflake requires the delimiter to be a constant, so CHR(31) is rejected). - description-generation.ts destructured `connector.sampleTable` and `connector.sampleColumn` into bare locals, losing the `this` binding when the class-method connectors (Snowflake, Postgres, MySQL) were invoked. Every sample call threw "Cannot read properties of undefined (reading 'assertConnection')" and degraded LLM descriptions to metadata-only prompts. Call the methods through the connector instead. Without these, even after the primary-key probe is allowed to fail softly, the scan ends up with 0 validated relationships and an empty `joins:` block in every shard YAML. * test(scan): cover table-ref helpers * feat(scan): plumb tableScope through live-database introspection port * feat(scan): apply tableScope during metadata fetch * feat(scan): enforce table scope at fetch boundary * feat(scan): pool Snowflake sessions and batch enrichment for faster ingest (#206) * feat(cli): add RSA key-pair auth option to Snowflake setup wizard Extends the interactive Snowflake setup flow with an authentication-method prompt (password vs RSA/JWT key-pair). The RSA branch collects a private-key path (env/file/absolute) and an optional passphrase; the resulting connection config records `authMethod: 'rsa'` with `privateKey` and `passphrase` instead of `password`. * feat(scan): pool Snowflake sessions * fix(scan): reuse structural snapshots and cleanup connectors * feat(scan): parallelize relationship profiling * feat(scan): batch table description generation * docs: document Snowflake ingest concurrency knobs * fix(scan): close Snowflake ingest perf verification gaps * fix(scan): keep batched description failure bounded * feat(scan): dispatch query-history probes by connection driver Extract historic-sql dialect resolution into a shared helper so the status-project readiness check and the local ingest factory agree on which connections enable query history and which probe to run. The status command now picks the postgres/snowflake/bigquery probe based on the connection's driver instead of always reporting against postgres, which previously caused snowflake connections with queryHistory.enabled to surface a misleading "driver is snowflake" failure. Also drops a noisy console.warn from Snowflake primary-key discovery — INFORMATION_SCHEMA.KEY_COLUMN_USAGE is commonly ungranted for read-only roles and the FK + profiling paths handle the empty PK map already. * fix(llm): allow StructuredOutput tool and raise maxTurns for generateObject The Claude Code agent SDK announces an internal pseudo-tool named StructuredOutput in the system/init message whenever outputFormat is set to { type: 'json_schema' }. The runtime's isolation check built its allowedToolIds set only from MCP tool ids and treated StructuredOutput as an unexpected host-injected tool, so every generateObject call threw "Claude Code runtime isolation failed: tools=StructuredOutput ..." and the table-descriptions and relationship-LLM-proposal enrichment stages recorded null output across the board. Whitelist StructuredOutput specifically in generateObject's allowedToolIds — the check also enforces missing_tools symmetry, so generateText and runAgentLoop, which do not see StructuredOutput, must not require it. generateObject also ran with maxTurns: 1, which the model intermittently breached when it emitted thinking text before the structured response. Raised to 5 to give the schema-bound call enough headroom without allowing unbounded loops. The existing tests now exercise the path with an init message that announces StructuredOutput so the regression cannot slip back in. * chore(scripts): add ktx-reset.sh project-cleanup helper Convenience script for repeatable ingest testing: takes a project directory and prunes everything except ktx.yaml and .ktx/secrets/, so the next ktx setup or ktx ingest run starts from a known-clean state.
643 lines
25 KiB
Text
643 lines
25 KiB
Text
---
|
|
title: ktx.yaml reference
|
|
description: Every top-level block of the ktx.yaml project file, what it controls, accepted values, and defaults.
|
|
---
|
|
|
|
`ktx.yaml` is the single source of truth for a **ktx** project. The file lives
|
|
at the project root and tells **ktx** which databases to read, which context
|
|
sources to ingest, which LLM and embedding providers to use, how to store
|
|
state, and how the scan and agent layers behave. Every block below is optional
|
|
and falls back to a documented default, so a minimal `ktx.yaml` is just one
|
|
connection.
|
|
|
|
This page is the canonical reference for the file. For the guided flow that
|
|
writes it, see [`ktx setup`](/docs/cli-reference/ktx-setup).
|
|
|
|
## Where blocks fit
|
|
|
|
`ktx.yaml` has eight top-level keys. They group into three layers: what to
|
|
read, how to think, and where to put the results.
|
|
|
|
<figure
|
|
className="not-prose my-8 overflow-hidden rounded-lg border border-fd-border bg-fd-card shadow-sm"
|
|
aria-label="ktx.yaml block layout"
|
|
>
|
|
<div className="border-b border-fd-border bg-fd-muted/35 px-4 py-3">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
ktx.yaml at a glance
|
|
</p>
|
|
<p className="mt-1 text-sm leading-6 text-fd-muted-foreground">
|
|
Inputs flow left to right. Storage and memory persist the result.
|
|
</p>
|
|
</div>
|
|
<div className="grid gap-3 p-4 md:grid-cols-[1.1fr_1.1fr_1fr]">
|
|
<div className="rounded-md border border-fd-border bg-fd-background p-4">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
Inputs
|
|
</p>
|
|
<ul className="mt-3 space-y-2 text-sm leading-6 text-fd-foreground">
|
|
<li><code className="text-[13px] font-semibold">connections</code> - warehouses, BI tools, dbt, Notion</li>
|
|
<li><code className="text-[13px] font-semibold">setup</code> - which connections are primary databases</li>
|
|
</ul>
|
|
</div>
|
|
<div className="rounded-md border-2 border-fd-primary bg-fd-background p-4">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-primary">
|
|
Compute
|
|
</p>
|
|
<ul className="mt-3 space-y-2 text-sm leading-6 text-fd-foreground">
|
|
<li><code className="text-[13px] font-semibold">llm</code> - provider, models, prompt cache</li>
|
|
<li><code className="text-[13px] font-semibold">ingest</code> - adapters, embeddings, work units</li>
|
|
<li><code className="text-[13px] font-semibold">scan</code> - enrichment, relationships</li>
|
|
<li><code className="text-[13px] font-semibold">agent</code> - research-agent feature flags</li>
|
|
</ul>
|
|
</div>
|
|
<div className="rounded-md border border-fd-border bg-fd-background p-4">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
Persistence
|
|
</p>
|
|
<ul className="mt-3 space-y-2 text-sm leading-6 text-fd-foreground">
|
|
<li><code className="text-[13px] font-semibold">storage</code> - state and search backends, git policy</li>
|
|
<li><code className="text-[13px] font-semibold">memory</code> - agent memory commit policy</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</figure>
|
|
|
|
## Minimal config
|
|
|
|
A working `ktx.yaml` needs one entry in `connections`. Everything else accepts
|
|
defaults. The example below is enough for `ktx ingest warehouse` to run a fast
|
|
schema scan against a local Postgres.
|
|
|
|
```yaml
|
|
connections:
|
|
warehouse:
|
|
driver: postgres
|
|
url: env:DATABASE_URL
|
|
```
|
|
|
|
## Secret references
|
|
|
|
Several fields accept either a literal value or a reference. References keep
|
|
secrets out of `ktx.yaml` so the file can stay in git.
|
|
|
|
| Form | Resolved to | Used for |
|
|
|------|-------------|----------|
|
|
| `env:VAR_NAME` | The value of the environment variable `VAR_NAME` at runtime | API keys, connection URLs, OAuth secrets |
|
|
| `file:/abs/path` or `file:~/path` | The first line of the referenced file, with `~` expanded to your home directory | Long-lived credentials kept under `.ktx/secrets/` |
|
|
| Literal string | Used as-is | Non-secret values such as `base_url` |
|
|
|
|
References work in: warehouse `url`, Metabase `api_key` / `api_key_ref`, Looker
|
|
`client_secret` / `client_secret_ref`, Notion / dbt / LookML / MetricFlow
|
|
`auth_token` / `auth_token_ref`, and any `api_key` under the `llm` and
|
|
`ingest.embeddings` blocks.
|
|
|
|
## `connections`
|
|
|
|
The `connections` block is a map from a connection ID you choose to the
|
|
configuration for that connector. The connection ID is what every other part
|
|
of **ktx** uses to address a connector - `ktx ingest warehouse`,
|
|
`ktx sql --connection warehouse`, the semantic-layer path
|
|
`semantic-layer/warehouse/`, and so on.
|
|
|
|
Each entry is discriminated by the `driver` field. Warehouse drivers and
|
|
context-source drivers share the map.
|
|
|
|
| Driver | Kind | Required fields | Common optional fields |
|
|
|--------|------|-----------------|------------------------|
|
|
| `postgres` / `postgresql` | Warehouse | `driver` | `url`, `enabled_tables`, `historicSql`, `context.queryHistory` |
|
|
| `mysql` | Warehouse | `driver` | `url`, `enabled_tables` |
|
|
| `sqlite` | Warehouse | `driver` | `url` or `path`, `enabled_tables` |
|
|
| `sqlserver` | Warehouse | `driver` | `url`, `enabled_tables` |
|
|
| `bigquery` | Warehouse | `driver` | `credentials_json`, `dataset_ids`, `enabled_tables`, `historicSql` |
|
|
| `snowflake` | Warehouse | `driver` | `schema_names`, `enabled_tables`, `historicSql` |
|
|
| `clickhouse` | Warehouse | `driver` | `url`, `database`, `databases`, `enabled_tables` |
|
|
| `metabase` | Context source | `driver`, `api_url` | `api_key_ref`, `mappings` |
|
|
| `looker` | Context source | `driver`, `base_url`, `client_id` | `client_secret_ref`, `mappings` |
|
|
| `lookml` | Context source | `driver`, `repoUrl` | `branch`, `path`, `auth_token_ref`, `mappings` |
|
|
| `dbt` | Context source | `driver`, one of `source_dir` or `repo_url` | `branch`, `path`, `profiles_path`, `target`, `project_name` |
|
|
| `metricflow` | Context source | `driver`, `metricflow.repoUrl` | `metricflow.branch`, `metricflow.path`, `metricflow.auth_token_ref` |
|
|
| `notion` | Context source | `driver`, `auth_token_ref` | `crawl_mode`, `root_*_ids`, `max_*_per_run` |
|
|
|
|
### Warehouse drivers
|
|
|
|
Warehouse connections are open objects: the listed fields are validated, and
|
|
any other field is preserved and passed through to the connector. Use
|
|
`enabled_tables` to scope deep ingest to a specific list of
|
|
`schema.table` names - useful for smoke tests.
|
|
|
|
```yaml
|
|
connections:
|
|
warehouse:
|
|
driver: postgres
|
|
url: env:DATABASE_URL
|
|
enabled_tables:
|
|
- public.orders
|
|
- public.customers
|
|
```
|
|
|
|
Connector-specific scope fields let setup and scan use the same warehouse
|
|
boundary:
|
|
|
|
```yaml
|
|
connections:
|
|
mysql-warehouse:
|
|
driver: mysql
|
|
url: env:MYSQL_URL
|
|
schemas: [analytics, mart]
|
|
clickhouse-warehouse:
|
|
driver: clickhouse
|
|
url: env:CLICKHOUSE_URL
|
|
database: analytics
|
|
databases: [analytics, mart]
|
|
bigquery-warehouse:
|
|
driver: bigquery
|
|
credentials_json: file:./service-account.json
|
|
location: US
|
|
dataset_ids: [analytics, mart]
|
|
```
|
|
|
|
For Snowflake connections, set `maxSessions` when deep ingest needs more or
|
|
fewer concurrent warehouse sessions. The default is `4`. This caps all
|
|
concurrent Snowflake SQL work for that connector instance, including schema
|
|
introspection, table sampling, relationship profiling, relationship
|
|
validation, and read-only SQL execution.
|
|
|
|
For Postgres, BigQuery, and Snowflake, `historicSql` and `context.queryHistory`
|
|
toggle query-history ingest. The shape is connector-specific; the setup wizard
|
|
writes these fields when you pass `--enable-query-history`.
|
|
|
|
```yaml
|
|
connections:
|
|
warehouse:
|
|
driver: postgres
|
|
url: env:DATABASE_URL
|
|
context:
|
|
queryHistory:
|
|
enabled: true
|
|
minExecutions: 5
|
|
```
|
|
|
|
### Metabase
|
|
|
|
```yaml
|
|
connections:
|
|
metabase:
|
|
driver: metabase
|
|
api_url: https://metabase.example.com
|
|
api_key_ref: env:METABASE_API_KEY
|
|
mappings:
|
|
databaseMappings:
|
|
"1": warehouse # Metabase DB id "1" -> ktx connection "warehouse"
|
|
syncMode: ALL # ALL | ONLY | EXCEPT
|
|
```
|
|
|
|
| Field | Purpose |
|
|
|-------|---------|
|
|
| `api_url` | Metabase instance URL. Required. |
|
|
| `api_key` | Literal token. Prefer `api_key_ref`. |
|
|
| `api_key_ref` | Reference to the token (`env:` or `file:`). |
|
|
| `mappings.databaseMappings` | Map of Metabase database ID (positive-integer string) to a `ktx` warehouse connection ID. `null` explicitly unmaps. |
|
|
| `mappings.syncEnabled` | Per-database boolean toggle, keyed by Metabase DB ID. |
|
|
| `mappings.syncMode` | `ALL` (all mapped DBs), `ONLY` (those with `syncEnabled: true`), or `EXCEPT` (skip those with `syncEnabled: true`). Default `ALL`. |
|
|
| `mappings.selections.collections` / `items` | Optional Metabase collection or item IDs to scope ingest. |
|
|
| `mappings.defaultTagNames` | Default tag names attached to ingested artifacts. |
|
|
| `network_proxy` / `networkProxy` | Optional proxy configuration. |
|
|
|
|
### Looker
|
|
|
|
```yaml
|
|
connections:
|
|
looker:
|
|
driver: looker
|
|
base_url: https://looker.example.com
|
|
client_id: ktx-integration
|
|
client_secret_ref: env:LOOKER_CLIENT_SECRET
|
|
mappings:
|
|
connectionMappings:
|
|
prod_warehouse: warehouse
|
|
```
|
|
|
|
| Field | Purpose |
|
|
|-------|---------|
|
|
| `base_url` | Looker instance URL. Required. |
|
|
| `client_id` | Looker OAuth client ID. Required. |
|
|
| `client_secret` / `client_secret_ref` | Literal secret or reference. Prefer the `_ref`. |
|
|
| `mappings.connectionMappings` | Map of Looker connection name to `ktx` warehouse connection ID. |
|
|
|
|
### LookML
|
|
|
|
```yaml
|
|
connections:
|
|
lookml:
|
|
driver: lookml
|
|
repoUrl: git@github.com:org/lookml.git
|
|
branch: main
|
|
path: lookml/
|
|
auth_token_ref: env:GITHUB_TOKEN
|
|
mappings:
|
|
expectedLookerConnectionName: prod_warehouse
|
|
```
|
|
|
|
| Field | Purpose |
|
|
|-------|---------|
|
|
| `repoUrl` | Git URL of the LookML project (`https`, `ssh`, or `file:`). Required. Camel-case by convention. |
|
|
| `branch` | Branch to fetch. Defaults to `main`. |
|
|
| `path` | Subdirectory inside the repo when LookML lives in a monorepo. |
|
|
| `auth_token_ref` | Reference to a Git auth token for private repos. |
|
|
| `mappings.expectedLookerConnectionName` | Looker connection name LookML models must declare. Mismatches block semantic-layer writes during ingest. |
|
|
|
|
### dbt
|
|
|
|
```yaml
|
|
connections:
|
|
dbt_main:
|
|
driver: dbt
|
|
source_dir: ../dbt-project
|
|
target: prod
|
|
```
|
|
|
|
| Field | Purpose |
|
|
|-------|---------|
|
|
| `source_dir` | Absolute or project-relative path to a local dbt project. |
|
|
| `repo_url` | Git URL of the dbt project. Use this instead of `source_dir` when fetching remotely. |
|
|
| `branch` | Branch to fetch when using `repo_url`. |
|
|
| `path` | Subdirectory inside the repo. |
|
|
| `auth_token_ref` | Git auth reference for private repos. |
|
|
| `profiles_path` | Override path to `profiles.yml`. |
|
|
| `target` | dbt target name (for example `dev`, `prod`). |
|
|
| `project_name` | Override the auto-detected dbt project name. |
|
|
|
|
### MetricFlow
|
|
|
|
```yaml
|
|
connections:
|
|
metricflow:
|
|
driver: metricflow
|
|
metricflow:
|
|
repoUrl: git@github.com:org/sl-config.git
|
|
branch: main
|
|
path: semantic_models/
|
|
auth_token_ref: env:GITHUB_TOKEN
|
|
```
|
|
|
|
The MetricFlow connector wraps its fields in a nested `metricflow` block.
|
|
`repoUrl` is required; the rest mirrors the LookML / dbt git fields.
|
|
|
|
### Notion
|
|
|
|
```yaml
|
|
connections:
|
|
notion:
|
|
driver: notion
|
|
auth_token_ref: env:NOTION_TOKEN
|
|
crawl_mode: selected_roots
|
|
root_database_ids:
|
|
- 9f30c2c4d4f24a8d9a8d8e2c1b2a3d4e
|
|
max_pages_per_run: 500
|
|
max_knowledge_creates_per_run: 5
|
|
max_knowledge_updates_per_run: 25
|
|
```
|
|
|
|
| Field | Purpose |
|
|
|-------|---------|
|
|
| `auth_token` / `auth_token_ref` | Notion integration token. Prefer the `_ref`. |
|
|
| `crawl_mode` | `selected_roots` (requires at least one `root_*_ids`) or `all_accessible`. |
|
|
| `root_page_ids`, `root_database_ids`, `root_data_source_ids` | Notion IDs to crawl when `crawl_mode` is `selected_roots`. |
|
|
| `max_pages_per_run` | Max pages fetched per ingest run (1-10000). |
|
|
| `max_knowledge_creates_per_run` | Max new wiki pages created per run (0-25). |
|
|
| `max_knowledge_updates_per_run` | Max existing wiki pages updated per run (0-100). |
|
|
|
|
## `setup`
|
|
|
|
Captured by the setup wizard. The only field **ktx** still reads is
|
|
`database_connection_ids`, which tells the ingest layer which entries in
|
|
`connections` are primary warehouses. When omitted, every warehouse-typed
|
|
connection is treated as primary.
|
|
|
|
```yaml
|
|
setup:
|
|
database_connection_ids:
|
|
- warehouse
|
|
```
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `database_connection_ids` | `string[]` | `[]` | IDs in `connections` treated as primary warehouses by ingest and scan. |
|
|
|
|
## `storage`
|
|
|
|
`storage` controls where **ktx** keeps its own state and search index, and how
|
|
state changes are committed. Defaults work for a single-user local project.
|
|
|
|
```yaml
|
|
storage:
|
|
state: sqlite # sqlite | postgres
|
|
search: sqlite-fts5 # sqlite-fts5 | postgres-hybrid
|
|
git:
|
|
auto_commit: true
|
|
author: "ktx <ktx@example.com>"
|
|
```
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `state` | `sqlite` \| `postgres` | `sqlite` | Backend for ktx state. `sqlite` uses `.ktx/db.sqlite`; `postgres` expects a configured Postgres connection. |
|
|
| `search` | `sqlite-fts5` \| `postgres-hybrid` | `sqlite-fts5` | Backend for search indexes. `postgres-hybrid` combines lexical and vector search in Postgres. |
|
|
| `git.auto_commit` | `boolean` | `true` | When `true`, ktx auto-commits changes to the git-backed state store. |
|
|
| `git.author` | `string` | `ktx <ktx@example.com>` | Git author identity for auto-commits. Standard `Name <email>` form. |
|
|
|
|
## `llm`
|
|
|
|
The `llm` block selects the LLM provider, lets you override the model used for
|
|
specific roles, and tunes prompt caching.
|
|
|
|
```yaml
|
|
llm:
|
|
provider:
|
|
backend: anthropic
|
|
anthropic:
|
|
api_key: env:ANTHROPIC_API_KEY
|
|
models:
|
|
default: claude-sonnet-4-6
|
|
triage: claude-haiku-4-5
|
|
promptCaching:
|
|
enabled: true
|
|
systemTtl: 1h
|
|
toolsTtl: 1h
|
|
historyTtl: 5m
|
|
vertexFallbackTo5m: true
|
|
```
|
|
|
|
### Provider
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `provider.backend` | `none` \| `anthropic` \| `vertex` \| `gateway` \| `claude-code` | `none` | Selected backend. `none` disables LLM features. `claude-code` uses the local Claude Code session and needs no API key. |
|
|
| `provider.anthropic.api_key` | `string` | - | Anthropic API key. Required when `backend: anthropic`. Accepts `env:` or `file:` references. |
|
|
| `provider.anthropic.base_url` | `string` | - | Override the Anthropic API base URL (proxy, self-hosted gateway). |
|
|
| `provider.gateway.api_key` / `base_url` | `string` | - | Credentials for an AI Gateway provider. Required when `backend: gateway`. |
|
|
| `provider.vertex.project` | `string` | - | Google Cloud project ID hosting the Vertex AI endpoint. |
|
|
| `provider.vertex.location` | `string` | - | Vertex AI region (for example `us-east5`). Required when the `vertex` block is present. |
|
|
|
|
### Model roles
|
|
|
|
`models` overrides the per-role model. Keys are fixed; values are
|
|
provider-specific model identifiers.
|
|
|
|
| Role | Used for |
|
|
|------|----------|
|
|
| `default` | Catch-all when no role-specific override exists. |
|
|
| `triage` | Cheap routing decisions during ingest and scan. |
|
|
| `candidateExtraction` | Extracting relationship and entity candidates from data. |
|
|
| `curator` | Reconciling proposed context against accepted files. |
|
|
| `reconcile` | Resolving conflicts between incoming and existing context. |
|
|
| `repair` | Fixing invalid generated YAML before write. |
|
|
|
|
### Prompt caching
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `promptCaching.enabled` | `boolean` | backend default | Master switch for Anthropic-style prompt caching. |
|
|
| `promptCaching.systemTtl` | `5m` \| `1h` | backend default | Cache TTL for the system prompt segment. |
|
|
| `promptCaching.toolsTtl` | `5m` \| `1h` | backend default | Cache TTL for the tools/schema segment. |
|
|
| `promptCaching.historyTtl` | `5m` \| `1h` | backend default | Cache TTL for conversation-history breakpoints. |
|
|
| `promptCaching.vertexFallbackTo5m` | `boolean` | `false` | When `true`, downgrade `1h` TTLs to `5m` on Vertex, which does not support `1h` caching. |
|
|
|
|
## `ingest`
|
|
|
|
`ingest` controls how **ktx** builds context from your stack. It lists the
|
|
adapters to run, the embedding provider used when adapters embed documents,
|
|
and the concurrency and failure policy for work units.
|
|
|
|
```yaml
|
|
ingest:
|
|
adapters:
|
|
- live-database
|
|
- dbt
|
|
- metabase
|
|
embeddings:
|
|
backend: openai
|
|
model: text-embedding-3-small
|
|
dimensions: 1536
|
|
openai:
|
|
api_key: env:OPENAI_API_KEY
|
|
workUnits:
|
|
stepBudget: 40
|
|
maxConcurrency: 2
|
|
failureMode: continue
|
|
```
|
|
|
|
### Adapters
|
|
|
|
`adapters` is a list of adapter IDs that should run. Each ID matches a
|
|
connector that **ktx** ships locally:
|
|
|
|
| Adapter ID | What it ingests |
|
|
|------------|-----------------|
|
|
| `live-database` | Live warehouse introspection (schemas, tables, columns, samples). |
|
|
| `historic-sql` | Query history from Postgres `pg_stat_statements`, BigQuery `INFORMATION_SCHEMA.JOBS`, or Snowflake query history. |
|
|
| `dbt` | dbt manifest models, sources, tests, and exposures. |
|
|
| `metricflow` | MetricFlow / Semantic Layer models and metrics. |
|
|
| `lookml` | LookML projects (models, explores, views, joins). |
|
|
| `looker` | Looker dashboards and looks via the API. |
|
|
| `metabase` | Metabase cards, dashboards, and database mappings. |
|
|
| `notion` | Notion pages and databases for wiki context. |
|
|
| `fake` | Test/demo adapter. Useful in fixtures. |
|
|
|
|
### Embeddings
|
|
|
|
The `embeddings` block can also appear inside `scan.enrichment`; that override
|
|
wins when present.
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `backend` | `none` \| `openai` \| `sentence-transformers` | `none` | Embedding provider. `none` disables embeddings. |
|
|
| `model` | `string` | - | Provider model ID, for example `text-embedding-3-small` or `all-MiniLM-L6-v2`. |
|
|
| `dimensions` | `int > 0` | `8` | Vector size. Default `8` is a placeholder that's only valid with `backend: none`. Set explicitly to match your model (1536 for `text-embedding-3-small`, 384 for `all-MiniLM-L6-v2`). |
|
|
| `openai.api_key` / `base_url` | `string` | - | OpenAI credentials. Required when `backend: openai`. |
|
|
| `sentenceTransformers.base_url` | `string` | `""` | URL of the sentence-transformers server. Empty when ktx manages the local daemon for you. |
|
|
| `sentenceTransformers.pathPrefix` | `string` | - | Optional URL path prefix prepended to embedding requests. |
|
|
| `batchSize` | `int > 0` | provider default | Texts per embedding API call. |
|
|
|
|
### Work units
|
|
|
|
A work unit is one unit of agent-driven ingest work (for example one table or
|
|
one Metabase question). These knobs bound how long it runs and how the run
|
|
handles failures.
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `workUnits.stepBudget` | `int > 0` | `40` | Maximum agent steps allowed per work unit before it's force-terminated. |
|
|
| `workUnits.maxConcurrency` | `int > 0` | `1` | How many work units run in parallel. |
|
|
| `workUnits.failureMode` | `abort` \| `continue` | `continue` | `abort` stops the whole ingest run on the first failure; `continue` records it and keeps going. |
|
|
|
|
## `scan`
|
|
|
|
`scan` configures how schema-level inputs become structured context:
|
|
column-level enrichment and inferred relationships between tables.
|
|
|
|
```yaml
|
|
scan:
|
|
enrichment:
|
|
mode: llm # none | deterministic | llm
|
|
relationships:
|
|
enabled: true
|
|
llmProposals: true
|
|
validationRequiredForManifest: true
|
|
acceptThreshold: 0.85
|
|
reviewThreshold: 0.55
|
|
maxLlmTablesPerBatch: 40
|
|
maxCandidatesPerColumn: 25
|
|
profileSampleRows: 10000
|
|
profileConcurrency: 4
|
|
validationConcurrency: 4
|
|
validationBudget: all
|
|
```
|
|
|
|
### Enrichment
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `enrichment.mode` | `none` \| `deterministic` \| `llm` | `none` | How columns and tables get described. `deterministic` uses local heuristics; `llm` calls the configured provider. |
|
|
| `enrichment.embeddings` | embedding block | - | Optional override for enrichment-time vectorization. Falls back to `ingest.embeddings`. |
|
|
|
|
### Relationships
|
|
|
|
The relationship discovery step proposes joins between tables, scores them,
|
|
and optionally validates each one against the database before writing it to
|
|
the manifest.
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `relationships.enabled` | `boolean` | `true` | Master switch for relationship discovery. |
|
|
| `relationships.llmProposals` | `boolean` | `true` | When `true`, propose relationships using the LLM in addition to deterministic candidates. |
|
|
| `relationships.validationRequiredForManifest` | `boolean` | `true` | When `true`, only proposals that pass database-side validation reach the manifest. |
|
|
| `relationships.acceptThreshold` | `number 0-1` | `0.85` | Confidence at or above which a proposal is auto-accepted. |
|
|
| `relationships.reviewThreshold` | `number 0-1` | `0.55` | Confidence at or above which a proposal is surfaced for human review (but not auto-accepted). |
|
|
| `relationships.maxLlmTablesPerBatch` | `int > 0` | `40` | Max tables included in a single LLM relationship-proposal batch. |
|
|
| `relationships.maxCandidatesPerColumn` | `int > 0` | `25` | Max join partners considered per column. |
|
|
| `relationships.profileSampleRows` | `int > 0` | `10000` | Rows sampled per table when profiling values for relationship inference. |
|
|
| `relationships.profileConcurrency` | `int > 0` | `4` | Parallel relationship-profile queries against the database. For Snowflake, effective database concurrency is also bounded by the connection's `maxSessions`. |
|
|
| `relationships.validationConcurrency` | `int > 0` | `4` | Parallel relationship validation queries against the database. |
|
|
| `relationships.validationBudget` | `all` \| `int ≥ 0` | runtime default | Cap on validation queries per scan. `all` means unlimited. |
|
|
|
|
## `agent`
|
|
|
|
`agent` carries feature flags for **ktx**-side agent behavior. Today the only
|
|
block is `run_research`, which gates the research agent invoked by
|
|
`ktx mcp` and CLI research tools.
|
|
|
|
```yaml
|
|
agent:
|
|
run_research:
|
|
enabled: true
|
|
max_iterations: 20
|
|
default_toolset:
|
|
- sl_query
|
|
- wiki_search
|
|
- sl_read_source
|
|
```
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `run_research.enabled` | `boolean` | `false` | Master switch for the research agent. |
|
|
| `run_research.max_iterations` | `int ≥ 0` | `20` | Maximum tool-call iterations per research run. |
|
|
| `run_research.default_toolset` | `string[]` | `[sl_query, wiki_search, sl_read_source]` | Tool identifiers exposed to the research agent. |
|
|
|
|
## `memory`
|
|
|
|
`memory` controls the agent memory subsystem.
|
|
|
|
```yaml
|
|
memory:
|
|
auto_commit: true
|
|
```
|
|
|
|
| Field | Type | Default | Purpose |
|
|
|-------|------|---------|---------|
|
|
| `auto_commit` | `boolean` | `true` | When `true`, ktx auto-commits memory updates to the git-backed store. |
|
|
|
|
## A full example
|
|
|
|
Combining the blocks above:
|
|
|
|
```yaml
|
|
connections:
|
|
warehouse:
|
|
driver: postgres
|
|
url: env:DATABASE_URL
|
|
metabase:
|
|
driver: metabase
|
|
api_url: https://metabase.example.com
|
|
api_key_ref: env:METABASE_API_KEY
|
|
mappings:
|
|
databaseMappings:
|
|
"1": warehouse
|
|
syncMode: ALL
|
|
setup:
|
|
database_connection_ids:
|
|
- warehouse
|
|
storage:
|
|
state: sqlite
|
|
search: sqlite-fts5
|
|
git:
|
|
auto_commit: true
|
|
author: "ktx <ktx@example.com>"
|
|
llm:
|
|
provider:
|
|
backend: claude-code
|
|
models:
|
|
default: sonnet
|
|
ingest:
|
|
adapters:
|
|
- live-database
|
|
- metabase
|
|
embeddings:
|
|
backend: openai
|
|
model: text-embedding-3-small
|
|
dimensions: 1536
|
|
openai:
|
|
api_key: env:OPENAI_API_KEY
|
|
workUnits:
|
|
maxConcurrency: 2
|
|
scan:
|
|
enrichment:
|
|
mode: llm
|
|
relationships:
|
|
acceptThreshold: 0.85
|
|
reviewThreshold: 0.55
|
|
agent:
|
|
run_research:
|
|
enabled: true
|
|
memory:
|
|
auto_commit: true
|
|
```
|
|
|
|
## Validating your config
|
|
|
|
**ktx** validates `ktx.yaml` strictly: unknown keys at the top level or inside
|
|
strict blocks cause setup and CLI commands to fail with a precise path
|
|
(`scan.relationships.acceptThreshhold: Unrecognized key`). Warehouse
|
|
connections accept extra driver-specific fields, so passthrough values like
|
|
`historicSql` and `context.queryHistory` are allowed.
|
|
|
|
To re-validate without running anything else:
|
|
|
|
```bash
|
|
ktx status
|
|
```
|
|
|
|
`ktx status` parses `ktx.yaml`, surfaces validation issues, and reports which
|
|
inputs are ready.
|
|
|
|
## Related references
|
|
|
|
- [`ktx setup`](/docs/cli-reference/ktx-setup) - the guided flow that writes
|
|
most of these fields for you.
|
|
- [`ktx status`](/docs/cli-reference/ktx-status) - readiness check for the
|
|
current `ktx.yaml`.
|
|
- [LLM configuration](/docs/guides/llm-configuration) - provider-specific
|
|
setup notes.
|
|
- [Primary sources](/docs/integrations/primary-sources) and
|
|
[Context sources](/docs/integrations/context-sources) - connector-specific
|
|
details and credentials.
|