ktx/docs-site/content/docs/configuration/ktx-yaml.mdx

705 lines
28 KiB
Text
Raw Normal View History

---
title: ktx.yaml reference
description: Every top-level block of the ktx.yaml project file, what it controls, accepted values, and defaults.
---
`ktx.yaml` is the single source of truth for a **ktx** project. The file lives
at the project root and tells **ktx** which databases to read, which context
sources to ingest, which LLM and embedding providers to use, how to store
state, and how the scan and agent layers behave. Every block below is optional
and falls back to a documented default, so a minimal `ktx.yaml` is just one
connection.
This page is the canonical reference for the file. For the guided flow that
writes it, see [`ktx setup`](/docs/cli-reference/ktx-setup).
## Where blocks fit
`ktx.yaml` has eight top-level keys. They group into three layers: what to
read, how to think, and where to put the results.
<figure
className="not-prose my-8 overflow-hidden rounded-lg border border-fd-border bg-fd-card shadow-sm"
aria-label="ktx.yaml block layout"
>
<div className="border-b border-fd-border bg-fd-muted/35 px-4 py-3">
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
ktx.yaml at a glance
</p>
<p className="mt-1 text-sm leading-6 text-fd-muted-foreground">
Inputs flow left to right. Storage and memory persist the result.
</p>
</div>
<div className="grid gap-3 p-4 md:grid-cols-[1.1fr_1.1fr_1fr]">
<div className="rounded-md border border-fd-border bg-fd-background p-4">
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
Inputs
</p>
<ul className="mt-3 space-y-2 text-sm leading-6 text-fd-foreground">
<li><code className="text-[13px] font-semibold">connections</code> - warehouses, BI tools, dbt, Notion</li>
<li><code className="text-[13px] font-semibold">setup</code> - which connections are primary databases</li>
</ul>
</div>
<div className="rounded-md border-2 border-fd-primary bg-fd-background p-4">
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-primary">
Compute
</p>
<ul className="mt-3 space-y-2 text-sm leading-6 text-fd-foreground">
<li><code className="text-[13px] font-semibold">llm</code> - provider, models, prompt cache</li>
<li><code className="text-[13px] font-semibold">ingest</code> - connectors, embeddings, work units</li>
<li><code className="text-[13px] font-semibold">scan</code> - enrichment, relationships</li>
<li><code className="text-[13px] font-semibold">agent</code> - research-agent feature flags</li>
</ul>
</div>
<div className="rounded-md border border-fd-border bg-fd-background p-4">
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
Persistence
</p>
<ul className="mt-3 space-y-2 text-sm leading-6 text-fd-foreground">
<li><code className="text-[13px] font-semibold">storage</code> - state and search backends, git policy</li>
<li><code className="text-[13px] font-semibold">memory</code> - agent memory commit policy</li>
</ul>
</div>
</div>
</figure>
## Minimal config
A working `ktx.yaml` needs one entry in `connections`. Everything else accepts
defaults. The example below registers a local Postgres connection; building
context with `ktx ingest warehouse` also needs a model and embeddings, which
`ktx setup` configures.
```yaml
connections:
warehouse:
driver: postgres
url: env:DATABASE_URL
```
## Secret references
Several fields accept either a literal value or a reference. References keep
secrets out of `ktx.yaml` so the file can stay in git.
| Form | Resolved to | Used for |
|------|-------------|----------|
| `env:VAR_NAME` | The value of the environment variable `VAR_NAME` at runtime | API keys, connection URLs, OAuth secrets |
| `file:/abs/path` or `file:~/path` | The first line of the referenced file, with `~` expanded to your home directory | Long-lived credentials kept under `.ktx/secrets/` |
| Literal string | Used as-is | Non-secret values such as `base_url` |
References work in: warehouse `url`, Metabase `api_key` / `api_key_ref`, Looker
`client_secret` / `client_secret_ref`, Notion / dbt / LookML / MetricFlow
`auth_token` / `auth_token_ref`, and any `api_key` under the `llm` and
`ingest.embeddings` blocks.
## `connections`
The `connections` block is a map from a connection ID you choose to the
configuration for that connector. The connection ID is what every other part
of **ktx** uses to address a connector - `ktx ingest warehouse`,
`ktx sql --connection warehouse`, the semantic-layer path
`semantic-layer/warehouse/`, and so on.
Each entry is discriminated by the `driver` field. Warehouse drivers and
context-source drivers share the map.
| Driver | Kind | Required fields | Common optional fields |
|--------|------|-----------------|------------------------|
| `postgres` | Warehouse | `driver` | `url`, `enabled_tables`, `historicSql`, `context.queryHistory` |
| `mysql` | Warehouse | `driver` | `url`, `enabled_tables` |
| `sqlite` | Warehouse | `driver` | `url` or `path`, `enabled_tables` |
| `sqlserver` | Warehouse | `driver` | `url`, `enabled_tables` |
| `bigquery` | Warehouse | `driver` | `credentials_json`, `dataset_ids`, `enabled_tables`, `historicSql` |
| `snowflake` | Warehouse | `driver` | `schema_names`, `enabled_tables`, `historicSql` |
| `clickhouse` | Warehouse | `driver` | `url`, `database`, `databases`, `enabled_tables` |
| `metabase` | Context source | `driver`, `api_url` | `api_key_ref`, `mappings` |
| `looker` | Context source | `driver`, `base_url`, `client_id` | `client_secret_ref`, `mappings` |
| `lookml` | Context source | `driver`, `repoUrl` | `branch`, `path`, `auth_token_ref`, `mappings` |
| `dbt` | Context source | `driver`, one of `source_dir` or `repo_url` | `branch`, `path`, `profiles_path`, `target`, `project_name` |
| `metricflow` | Context source | `driver`, `metricflow.repoUrl` | `metricflow.branch`, `metricflow.path`, `metricflow.auth_token_ref` |
| `notion` | Context source | `driver`, `auth_token_ref` | `crawl_mode`, `root_*_ids`, `max_*_per_run` |
### Warehouse drivers
Warehouse connections are open objects: the listed fields are validated, and
any other field is preserved and passed through to the connector. Use
`enabled_tables` to scope ingest to a specific list of
`schema.table` names - useful for smoke tests.
```yaml
connections:
warehouse:
driver: postgres
url: env:DATABASE_URL
enabled_tables:
- public.orders
- public.customers
```
Connector-specific scope fields let setup and scan use the same warehouse
boundary:
```yaml
connections:
mysql-warehouse:
driver: mysql
url: env:MYSQL_URL
schemas: [analytics, mart]
clickhouse-warehouse:
driver: clickhouse
url: env:CLICKHOUSE_URL
database: analytics
databases: [analytics, mart]
bigquery-warehouse:
driver: bigquery
credentials_json: file:./service-account.json
location: US
dataset_ids: [analytics, mart]
```
For Postgres, MySQL, SQL Server, and Snowflake connections, set
`maxConnections` when scan or ingest work needs to stay below the target's
connection cap. Postgres, MySQL, and SQL Server default to `10`; Snowflake
defaults to `4`. This caps all concurrent SQL work for that connector instance,
including schema introspection, table sampling, relationship profiling,
relationship validation, and read-only SQL execution. BigQuery and ClickHouse
do not expose `maxConnections` because their connectors don't use client-side
connection pools.
fix(snowflake): unblock multi-schema ingest and relationship discovery (#204) * feat(setup): drop redundant Snowflake schema prompt; fall back to free-text on listSchemas failure Snowflake setup previously asked for a single schema as free text, then ran a multiselect against the discovered schemas — two schema questions back-to-back, with the first being only a session bootstrap. The SDK's `schema` is optional, so the bootstrap step is unnecessary. - Remove the free-text Snowflake schema prompt; only pass `schema` to snowflake-sdk when one is configured. - When `listSchemas()` fails (e.g. role lacks SHOW SCHEMAS), prompt the user for a comma-separated list, persist it as `schema_names`, and use it as both the table-list filter and the multiselect default. Applies to every driver with a scope-discovery spec, not just Snowflake. - Update docs to lead with `schema_names`; keep `schema_name` as a documented single-schema shorthand. * fix(snowflake): keep introspecting when primary-key discovery is denied The PK query joins INFORMATION_SCHEMA.TABLE_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE, which require grants the connection role may not have. Previously a 'SQL compilation error: Object ANALYTICS.INFORMATION_SCHEMA.KEY_COLUMN_USAGE does not exist or not authorized' aborted the entire introspect — schemas, columns, and row counts were all discarded over a missing nice-to-have. Wrap the constraint query in try/catch, log a one-line warning per schema, and return an empty PK map. Columns end up with primaryKey=false; relationship inference still has FK and profiling to fall back on. * fix(scan): unblock relationship discovery on Snowflake Two adjacent bugs prevented the scan's relationship pipeline from producing any joins on a Snowflake warehouse: - relationship-profiling.ts fell through to a default `GROUP_CONCAT` branch for unknown drivers. Snowflake has no GROUP_CONCAT, so every per-table profile query failed with "Unknown function GROUP_CONCAT". Add an explicit Snowflake branch that uses LISTAGG with a literal '\x1f' delimiter (Snowflake requires the delimiter to be a constant, so CHR(31) is rejected). - description-generation.ts destructured `connector.sampleTable` and `connector.sampleColumn` into bare locals, losing the `this` binding when the class-method connectors (Snowflake, Postgres, MySQL) were invoked. Every sample call threw "Cannot read properties of undefined (reading 'assertConnection')" and degraded LLM descriptions to metadata-only prompts. Call the methods through the connector instead. Without these, even after the primary-key probe is allowed to fail softly, the scan ends up with 0 validated relationships and an empty `joins:` block in every shard YAML. * test(scan): cover table-ref helpers * feat(scan): plumb tableScope through live-database introspection port * feat(scan): apply tableScope during metadata fetch * feat(scan): enforce table scope at fetch boundary * feat(scan): pool Snowflake sessions and batch enrichment for faster ingest (#206) * feat(cli): add RSA key-pair auth option to Snowflake setup wizard Extends the interactive Snowflake setup flow with an authentication-method prompt (password vs RSA/JWT key-pair). The RSA branch collects a private-key path (env/file/absolute) and an optional passphrase; the resulting connection config records `authMethod: 'rsa'` with `privateKey` and `passphrase` instead of `password`. * feat(scan): pool Snowflake sessions * fix(scan): reuse structural snapshots and cleanup connectors * feat(scan): parallelize relationship profiling * feat(scan): batch table description generation * docs: document Snowflake ingest concurrency knobs * fix(scan): close Snowflake ingest perf verification gaps * fix(scan): keep batched description failure bounded * feat(scan): dispatch query-history probes by connection driver Extract historic-sql dialect resolution into a shared helper so the status-project readiness check and the local ingest factory agree on which connections enable query history and which probe to run. The status command now picks the postgres/snowflake/bigquery probe based on the connection's driver instead of always reporting against postgres, which previously caused snowflake connections with queryHistory.enabled to surface a misleading "driver is snowflake" failure. Also drops a noisy console.warn from Snowflake primary-key discovery — INFORMATION_SCHEMA.KEY_COLUMN_USAGE is commonly ungranted for read-only roles and the FK + profiling paths handle the empty PK map already. * fix(llm): allow StructuredOutput tool and raise maxTurns for generateObject The Claude Code agent SDK announces an internal pseudo-tool named StructuredOutput in the system/init message whenever outputFormat is set to { type: 'json_schema' }. The runtime's isolation check built its allowedToolIds set only from MCP tool ids and treated StructuredOutput as an unexpected host-injected tool, so every generateObject call threw "Claude Code runtime isolation failed: tools=StructuredOutput ..." and the table-descriptions and relationship-LLM-proposal enrichment stages recorded null output across the board. Whitelist StructuredOutput specifically in generateObject's allowedToolIds — the check also enforces missing_tools symmetry, so generateText and runAgentLoop, which do not see StructuredOutput, must not require it. generateObject also ran with maxTurns: 1, which the model intermittently breached when it emitted thinking text before the structured response. Raised to 5 to give the schema-bound call enough headroom without allowing unbounded loops. The existing tests now exercise the path with an init message that announces StructuredOutput so the regression cannot slip back in. * chore(scripts): add ktx-reset.sh project-cleanup helper Convenience script for repeatable ingest testing: takes a project directory and prunes everything except ktx.yaml and .ktx/secrets/, so the next ktx setup or ktx ingest run starts from a known-clean state.
2026-05-23 10:41:30 +02:00
For Postgres, BigQuery, and Snowflake, `historicSql` and `context.queryHistory`
toggle query-history ingest. The shape is connector-specific; the setup wizard
writes these fields when you pass `--enable-query-history`.
```yaml
connections:
warehouse:
driver: postgres
url: env:DATABASE_URL
context:
queryHistory:
enabled: true
enabledSchemas:
- orbit_raw
- orbit_analytics
minExecutions: 5
```
- `enabledSchemas`: Optional list of schema or dataset names that query-history
ingest may mine. Omit it to let **ktx** derive the modeled schema floor from
the connection and semantic-layer sources. Use `["*"]` to disable the floor
for discovery runs.
- `filters.serviceAccounts`: Optional service-account filter block. During
setup, when query history is enabled and no service-account block already
exists, **ktx** can propose exact role patterns such as `^svc_loader$` from
observed in-scope query history. The block uses `mode: exclude` and remains
hand-editable.
### Metabase
```yaml
connections:
metabase:
driver: metabase
api_url: https://metabase.example.com
api_key_ref: env:METABASE_API_KEY
mappings:
databaseMappings:
"1": warehouse # Metabase DB id "1" -> ktx connection "warehouse"
syncMode: ALL # ALL | ONLY | EXCEPT
```
| Field | Purpose |
|-------|---------|
| `api_url` | Metabase instance URL. Required. |
| `api_key` | Literal token. Prefer `api_key_ref`. |
| `api_key_ref` | Reference to the token (`env:` or `file:`). |
| `mappings.databaseMappings` | Map of Metabase database ID (positive-integer string) to a `ktx` warehouse connection ID. `null` explicitly unmaps. |
| `mappings.syncEnabled` | Per-database boolean toggle, keyed by Metabase DB ID. |
| `mappings.syncMode` | `ALL` (all mapped DBs), `ONLY` (those with `syncEnabled: true`), or `EXCEPT` (skip those with `syncEnabled: true`). Default `ALL`. |
| `mappings.selections.collections` / `items` | Optional Metabase collection or item IDs to scope ingest. |
| `mappings.defaultTagNames` | Default tag names attached to ingested artifacts. |
| `network_proxy` / `networkProxy` | Optional proxy configuration. |
### Looker
```yaml
connections:
looker:
driver: looker
base_url: https://looker.example.com
client_id: ktx-integration
client_secret_ref: env:LOOKER_CLIENT_SECRET
mappings:
connectionMappings:
prod_warehouse: warehouse
```
| Field | Purpose |
|-------|---------|
| `base_url` | Looker instance URL. Required. |
| `client_id` | Looker OAuth client ID. Required. |
| `client_secret` / `client_secret_ref` | Literal secret or reference. Prefer the `_ref`. |
| `mappings.connectionMappings` | Map of Looker connection name to `ktx` warehouse connection ID. |
### LookML
```yaml
connections:
lookml:
driver: lookml
repoUrl: git@github.com:org/lookml.git
branch: main
path: lookml/
auth_token_ref: env:GITHUB_TOKEN
mappings:
expectedLookerConnectionName: prod_warehouse
```
| Field | Purpose |
|-------|---------|
| `repoUrl` | Git URL of the LookML project (`https`, `ssh`, or `file:`). Required. Camel-case by convention. |
| `branch` | Branch to fetch. Defaults to `main`. |
| `path` | Subdirectory inside the repo when LookML lives in a monorepo. |
| `auth_token_ref` | Reference to a Git auth token for private repos. |
| `mappings.expectedLookerConnectionName` | Looker connection name LookML models must declare. Mismatches block semantic-layer writes during ingest. |
### dbt
```yaml
connections:
dbt_main:
driver: dbt
source_dir: ../dbt-project
target: prod
```
| Field | Purpose |
|-------|---------|
| `source_dir` | Absolute or project-relative path to a local dbt project. |
| `repo_url` | Git URL of the dbt project. Use this instead of `source_dir` when fetching remotely. |
| `branch` | Branch to fetch when using `repo_url`. |
| `path` | Subdirectory inside the repo. |
| `auth_token_ref` | Git auth reference for private repos. |
| `profiles_path` | Override path to `profiles.yml`. |
| `target` | dbt target name (for example `dev`, `prod`). |
| `project_name` | Override the auto-detected dbt project name. |
### MetricFlow
```yaml
connections:
metricflow:
driver: metricflow
metricflow:
repoUrl: git@github.com:org/sl-config.git
branch: main
path: semantic_models/
auth_token_ref: env:GITHUB_TOKEN
```
The MetricFlow connector wraps its fields in a nested `metricflow` block.
`repoUrl` is required; the rest mirrors the LookML / dbt git fields.
### Notion
```yaml
connections:
notion:
driver: notion
auth_token_ref: env:NOTION_TOKEN
crawl_mode: selected_roots
root_database_ids:
- 9f30c2c4d4f24a8d9a8d8e2c1b2a3d4e
max_pages_per_run: 500
max_knowledge_creates_per_run: 5
max_knowledge_updates_per_run: 25
```
| Field | Purpose |
|-------|---------|
| `auth_token` / `auth_token_ref` | Notion integration token. Prefer the `_ref`. |
| `crawl_mode` | `selected_roots` (requires at least one `root_*_ids`) or `all_accessible`. |
| `root_page_ids`, `root_database_ids`, `root_data_source_ids` | Notion IDs to crawl when `crawl_mode` is `selected_roots`. |
| `max_pages_per_run` | Max pages fetched per ingest run (1-10000). |
| `max_knowledge_creates_per_run` | Max new wiki pages created per run (0-25). |
| `max_knowledge_updates_per_run` | Max existing wiki pages updated per run (0-100). |
## `setup`
Captured by the setup wizard. The only field **ktx** still reads is
`database_connection_ids`, which tells the ingest layer which entries in
`connections` are primary warehouses. When omitted, every warehouse-typed
connection is treated as primary.
```yaml
setup:
database_connection_ids:
- warehouse
```
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `database_connection_ids` | `string[]` | `[]` | IDs in `connections` treated as primary warehouses by ingest and scan. |
## `storage`
fix(cli): isolate ktx-owned project repositories (#283) * fix(cli): isolate ktx project git repos * fix(cli): remove inert auto commit config * test(cli): drop stale auto commit fixtures * docs: document isolated ktx project repos * test(cli): keep stale config grep clean * fix(cli): guide setup away from foreign repos at the project dir ktx owns the git repo rooted at the project dir and refuses to adopt one it did not create (the Finding 3 isolation invariant). But setup steered users straight into that failure: the interactive menu offers "Current directory" first, and `--no-input --yes --project-dir <repo-root>` created directly in place — both then threw a generic "Failed to initialize git repository:" wrapper from deep in GitService.initialize(). Extract the ownership rule into a shared `classifyKtxRepoOwnership(dir)` used by both GitService.initialize() (the invariant) and the setup wizard (pre-flight guidance), so the decision derives from one rule. Setup now detects a foreign repo before constructing GitService and: interactively re-prompts (the user picks the existing `ktx-project` subfolder), or non-interactively returns a clean missing-input with the actionable message. The typed foreign-repo error is also surfaced verbatim instead of being buried under the generic wrapper. Empty/non-repo current directories still work — only foreign repos are blocked. * fix(cli): keep classifyKtxRepoOwnership total for non-directory paths The setup ownership guard runs before the existing not-a-directory check, so pointing a custom/--project-dir path at a file made classifyKtxRepoOwnership lstat `<file>/.git`, hit ENOTDIR, and throw — crashing the setup step instead of returning the friendly "path exists and is not a directory" result. A path that is a file (or missing) holds no git repo for ktx to avoid, so treat ENOTDIR like ENOENT and return 'unowned'. The downstream existingFolderState check still rejects a non-directory with its friendly message, and the classifier no longer throws raw errno for any caller.
2026-06-10 14:12:25 +02:00
`storage` controls where **ktx** keeps its own state and search index. Defaults
work for a single-user local project.
```yaml
storage:
state: sqlite # sqlite | postgres
search: sqlite-fts5 # sqlite-fts5 | postgres-hybrid
git:
author: "ktx <ktx@example.com>"
```
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `state` | `sqlite` \| `postgres` | `sqlite` | Backend for ktx state. `sqlite` uses `.ktx/db.sqlite`; `postgres` expects a configured Postgres connection. |
| `search` | `sqlite-fts5` \| `postgres-hybrid` | `sqlite-fts5` | Backend for search indexes. `postgres-hybrid` combines lexical and vector search in Postgres. |
fix(cli): isolate ktx-owned project repositories (#283) * fix(cli): isolate ktx project git repos * fix(cli): remove inert auto commit config * test(cli): drop stale auto commit fixtures * docs: document isolated ktx project repos * test(cli): keep stale config grep clean * fix(cli): guide setup away from foreign repos at the project dir ktx owns the git repo rooted at the project dir and refuses to adopt one it did not create (the Finding 3 isolation invariant). But setup steered users straight into that failure: the interactive menu offers "Current directory" first, and `--no-input --yes --project-dir <repo-root>` created directly in place — both then threw a generic "Failed to initialize git repository:" wrapper from deep in GitService.initialize(). Extract the ownership rule into a shared `classifyKtxRepoOwnership(dir)` used by both GitService.initialize() (the invariant) and the setup wizard (pre-flight guidance), so the decision derives from one rule. Setup now detects a foreign repo before constructing GitService and: interactively re-prompts (the user picks the existing `ktx-project` subfolder), or non-interactively returns a clean missing-input with the actionable message. The typed foreign-repo error is also surfaced verbatim instead of being buried under the generic wrapper. Empty/non-repo current directories still work — only foreign repos are blocked. * fix(cli): keep classifyKtxRepoOwnership total for non-directory paths The setup ownership guard runs before the existing not-a-directory check, so pointing a custom/--project-dir path at a file made classifyKtxRepoOwnership lstat `<file>/.git`, hit ENOTDIR, and throw — crashing the setup step instead of returning the friendly "path exists and is not a directory" result. A path that is a file (or missing) holds no git repo for ktx to avoid, so treat ENOTDIR like ENOENT and return 'unowned'. The downstream existingFolderState check still rejects a non-directory with its friendly message, and the classifier no longer throws raw errno for any caller.
2026-06-10 14:12:25 +02:00
| `git.author` | `string` | `ktx <ktx@example.com>` | Git author identity for commits. Standard `Name <email>` form. |
## `llm`
The `llm` block selects the LLM provider, lets you override the model used for
specific roles, and tunes prompt caching.
```yaml
llm:
provider:
backend: anthropic
anthropic:
api_key: env:ANTHROPIC_API_KEY
models:
default: claude-sonnet-4-6
triage: claude-haiku-4-5
candidateExtraction: claude-sonnet-4-6
curator: claude-opus-4-7
reconcile: claude-opus-4-7
repair: claude-haiku-4-5
promptCaching:
enabled: true
systemTtl: 1h
toolsTtl: 1h
historyTtl: 5m
vertexFallbackTo5m: true
```
### Provider
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
feat: add codex llm backend for ktx runtime work (#253) * feat: add codex sdk runner foundation * feat: parse codex runtime events * feat: expose codex runtime mcp tools * feat: add codex llm runtime * feat: wire codex llm backend * test: avoid Array.fromAsync in codex runner test * docs: document codex llm backend * fix: tighten codex runtime config ownership * fix: use codex sdk env and thread options * fix: parse codex sdk event shapes * test: add codex backend live smoke * docs: clarify codex backend isolation * fix: drive codex loop metrics from mcp events * fix: enforce codex local step budget * docs: disclose codex isolation limits * fix: count all codex agent steps and stream step callbacks live The agent-loop step budget only counted completed mcp_tool_call items, so built-in command_execution steps (which the public Codex SDK/CLI surface can still expose) never decremented the budget, letting ingest/reconciliation run past stepBudget until Codex stopped on its own. onStepFinish was also replayed only after the whole stream drained, so live work_unit_step / reconciliation progress appeared stuck until the Codex process exited. collectEvents is now the single live step accumulator: it counts every completed agent-action item via a shared isCompletedAgentStep predicate (command_execution, mcp_tool_call, file_change, web_search), fires onStepFinish as each step completes, and enforces the budget on that broader count. A no-tool turn still counts as one step. toolFailures stays MCP-specific, since a non-zero command exit is normal agent exploration, not a loop failure. * test: align ingest llm-guard assertions with codex backend The skip-llm ingest guard message now lists codex as a valid backend and mentions a Claude Code/Codex session plus a codex setup hint, but this slow suite test still asserted the pre-codex wording. Update it to match the production message (already covered by the local-bundle-runtime unit test) and add the codex setup-line assertion. * fix: treat codex error:null tool calls as success The Codex SDK serializes error: null on successful mcp_tool_call items, so the failure check (item.error !== undefined) flagged every successful tool call as failed with the empty-payload default "Codex turn failed". This killed every ingest work unit under the codex backend before it could produce a patch. Key on status === 'failed' (authoritative, always set) and only treat a populated error object as a failure. Add a regression test built from a verbatim real-SDK event capture. * fix: default codex backend to gpt-5.5 and report real probe errors The previous default gpt-5.3-codex is an API-key-only model that the OpenAI API rejects under ChatGPT-account (subscription) auth, so codex status/setup failed with a misleading "authentication is not usable" message even though auth was fine. - Default codex model is now gpt-5.5 (works on both subscription and API-key auth); the curated setup picker offers gpt-5.5 / gpt-5.4 / gpt-5.4-mini and keeps free-form entry for account-specific ids (e.g. gpt-5.3-codex-spark). - runCodexAuthProbe now distinguishes "model not available" from an auth failure and surfaces the real API error: collectEvents retains stream events when the SDK throws on a non-zero exit, and the API error JSON envelope is unwrapped to its human-readable message. - The Codex isolation warning now renders inside the clack setup frame. - Docs updated to gpt-5.5 with a note that *-codex ids require API-key auth. * fix: require llm.models.default in status and match codex probe remediation Status reported a project ready when a non-none LLM backend was configured without llm.models.default, but the runtime (resolveModelSlots) hard-requires it, so ingest/scan/memory threw after `ktx status` said the project was usable. buildLlmStatus now fails for any non-none backend missing models.default and no longer invents a fallback model for claude-code/codex. Codex probe failures now carry a category-matched fix: a model-access failure steers the user at llm.models.default instead of the auth/install remediation. runCodexAuthProbe returns the fix and status consumes it; the message stays self-sufficient so setup output is unchanged. Docs: README now lists the codex backend and local Codex auth; ktx-setup.mdx states --llm-model only accepts codex/default or gpt-*/codex-* ids. Repaired four doctor fixtures that configured a backend without models.default (the now-correctly-blocked config) and added coverage for the new behavior.
2026-06-02 13:57:11 +02:00
| `provider.backend` | `none` \| `anthropic` \| `vertex` \| `gateway` \| `claude-code` \| `codex` | `none` | Selected backend. `none` disables LLM features. `claude-code` uses the local Claude Code session and needs no API key. `codex` uses local Codex authentication and needs no API key. |
| `provider.anthropic.api_key` | `string` | - | Anthropic API key. Required when `backend: anthropic`. Accepts `env:` or `file:` references. |
| `provider.anthropic.base_url` | `string` | - | Override the Anthropic API base URL (proxy, self-hosted gateway). |
| `provider.gateway.api_key` / `base_url` | `string` | - | Credentials for an AI Gateway provider. Required when `backend: gateway`. |
| `provider.vertex.project` | `string` | - | Google Cloud project ID hosting the Vertex AI endpoint. |
| `provider.vertex.location` | `string` | - | Vertex AI region (for example `us-east5`). Required when the `vertex` block is present. |
feat: add codex llm backend for ktx runtime work (#253) * feat: add codex sdk runner foundation * feat: parse codex runtime events * feat: expose codex runtime mcp tools * feat: add codex llm runtime * feat: wire codex llm backend * test: avoid Array.fromAsync in codex runner test * docs: document codex llm backend * fix: tighten codex runtime config ownership * fix: use codex sdk env and thread options * fix: parse codex sdk event shapes * test: add codex backend live smoke * docs: clarify codex backend isolation * fix: drive codex loop metrics from mcp events * fix: enforce codex local step budget * docs: disclose codex isolation limits * fix: count all codex agent steps and stream step callbacks live The agent-loop step budget only counted completed mcp_tool_call items, so built-in command_execution steps (which the public Codex SDK/CLI surface can still expose) never decremented the budget, letting ingest/reconciliation run past stepBudget until Codex stopped on its own. onStepFinish was also replayed only after the whole stream drained, so live work_unit_step / reconciliation progress appeared stuck until the Codex process exited. collectEvents is now the single live step accumulator: it counts every completed agent-action item via a shared isCompletedAgentStep predicate (command_execution, mcp_tool_call, file_change, web_search), fires onStepFinish as each step completes, and enforces the budget on that broader count. A no-tool turn still counts as one step. toolFailures stays MCP-specific, since a non-zero command exit is normal agent exploration, not a loop failure. * test: align ingest llm-guard assertions with codex backend The skip-llm ingest guard message now lists codex as a valid backend and mentions a Claude Code/Codex session plus a codex setup hint, but this slow suite test still asserted the pre-codex wording. Update it to match the production message (already covered by the local-bundle-runtime unit test) and add the codex setup-line assertion. * fix: treat codex error:null tool calls as success The Codex SDK serializes error: null on successful mcp_tool_call items, so the failure check (item.error !== undefined) flagged every successful tool call as failed with the empty-payload default "Codex turn failed". This killed every ingest work unit under the codex backend before it could produce a patch. Key on status === 'failed' (authoritative, always set) and only treat a populated error object as a failure. Add a regression test built from a verbatim real-SDK event capture. * fix: default codex backend to gpt-5.5 and report real probe errors The previous default gpt-5.3-codex is an API-key-only model that the OpenAI API rejects under ChatGPT-account (subscription) auth, so codex status/setup failed with a misleading "authentication is not usable" message even though auth was fine. - Default codex model is now gpt-5.5 (works on both subscription and API-key auth); the curated setup picker offers gpt-5.5 / gpt-5.4 / gpt-5.4-mini and keeps free-form entry for account-specific ids (e.g. gpt-5.3-codex-spark). - runCodexAuthProbe now distinguishes "model not available" from an auth failure and surfaces the real API error: collectEvents retains stream events when the SDK throws on a non-zero exit, and the API error JSON envelope is unwrapped to its human-readable message. - The Codex isolation warning now renders inside the clack setup frame. - Docs updated to gpt-5.5 with a note that *-codex ids require API-key auth. * fix: require llm.models.default in status and match codex probe remediation Status reported a project ready when a non-none LLM backend was configured without llm.models.default, but the runtime (resolveModelSlots) hard-requires it, so ingest/scan/memory threw after `ktx status` said the project was usable. buildLlmStatus now fails for any non-none backend missing models.default and no longer invents a fallback model for claude-code/codex. Codex probe failures now carry a category-matched fix: a model-access failure steers the user at llm.models.default instead of the auth/install remediation. runCodexAuthProbe returns the fix and status consumes it; the message stays self-sufficient so setup output is unchanged. Docs: README now lists the codex backend and local Codex auth; ktx-setup.mdx states --llm-model only accepts codex/default or gpt-*/codex-* ids. Repaired four doctor fixtures that configured a backend without models.default (the now-correctly-blocked config) and added coverage for the new behavior.
2026-06-02 13:57:11 +02:00
Use `codex` when local Codex authentication should power **ktx** LLM work:
```yaml
llm:
provider:
backend: codex
models:
default: gpt-5.5
triage: gpt-5.5
candidateExtraction: gpt-5.5
curator: gpt-5.5
reconcile: gpt-5.5
repair: gpt-5.5
feat: add codex llm backend for ktx runtime work (#253) * feat: add codex sdk runner foundation * feat: parse codex runtime events * feat: expose codex runtime mcp tools * feat: add codex llm runtime * feat: wire codex llm backend * test: avoid Array.fromAsync in codex runner test * docs: document codex llm backend * fix: tighten codex runtime config ownership * fix: use codex sdk env and thread options * fix: parse codex sdk event shapes * test: add codex backend live smoke * docs: clarify codex backend isolation * fix: drive codex loop metrics from mcp events * fix: enforce codex local step budget * docs: disclose codex isolation limits * fix: count all codex agent steps and stream step callbacks live The agent-loop step budget only counted completed mcp_tool_call items, so built-in command_execution steps (which the public Codex SDK/CLI surface can still expose) never decremented the budget, letting ingest/reconciliation run past stepBudget until Codex stopped on its own. onStepFinish was also replayed only after the whole stream drained, so live work_unit_step / reconciliation progress appeared stuck until the Codex process exited. collectEvents is now the single live step accumulator: it counts every completed agent-action item via a shared isCompletedAgentStep predicate (command_execution, mcp_tool_call, file_change, web_search), fires onStepFinish as each step completes, and enforces the budget on that broader count. A no-tool turn still counts as one step. toolFailures stays MCP-specific, since a non-zero command exit is normal agent exploration, not a loop failure. * test: align ingest llm-guard assertions with codex backend The skip-llm ingest guard message now lists codex as a valid backend and mentions a Claude Code/Codex session plus a codex setup hint, but this slow suite test still asserted the pre-codex wording. Update it to match the production message (already covered by the local-bundle-runtime unit test) and add the codex setup-line assertion. * fix: treat codex error:null tool calls as success The Codex SDK serializes error: null on successful mcp_tool_call items, so the failure check (item.error !== undefined) flagged every successful tool call as failed with the empty-payload default "Codex turn failed". This killed every ingest work unit under the codex backend before it could produce a patch. Key on status === 'failed' (authoritative, always set) and only treat a populated error object as a failure. Add a regression test built from a verbatim real-SDK event capture. * fix: default codex backend to gpt-5.5 and report real probe errors The previous default gpt-5.3-codex is an API-key-only model that the OpenAI API rejects under ChatGPT-account (subscription) auth, so codex status/setup failed with a misleading "authentication is not usable" message even though auth was fine. - Default codex model is now gpt-5.5 (works on both subscription and API-key auth); the curated setup picker offers gpt-5.5 / gpt-5.4 / gpt-5.4-mini and keeps free-form entry for account-specific ids (e.g. gpt-5.3-codex-spark). - runCodexAuthProbe now distinguishes "model not available" from an auth failure and surfaces the real API error: collectEvents retains stream events when the SDK throws on a non-zero exit, and the API error JSON envelope is unwrapped to its human-readable message. - The Codex isolation warning now renders inside the clack setup frame. - Docs updated to gpt-5.5 with a note that *-codex ids require API-key auth. * fix: require llm.models.default in status and match codex probe remediation Status reported a project ready when a non-none LLM backend was configured without llm.models.default, but the runtime (resolveModelSlots) hard-requires it, so ingest/scan/memory threw after `ktx status` said the project was usable. buildLlmStatus now fails for any non-none backend missing models.default and no longer invents a fallback model for claude-code/codex. Codex probe failures now carry a category-matched fix: a model-access failure steers the user at llm.models.default instead of the auth/install remediation. runCodexAuthProbe returns the fix and status consumes it; the message stays self-sufficient so setup output is unchanged. Docs: README now lists the codex backend and local Codex auth; ktx-setup.mdx states --llm-model only accepts codex/default or gpt-*/codex-* ids. Repaired four doctor fixtures that configured a backend without models.default (the now-correctly-blocked config) and added coverage for the new behavior.
2026-06-02 13:57:11 +02:00
```
### Model roles
`models` overrides the per-role model. Keys are fixed; values are
provider-specific model identifiers.
| Role | Used for |
|------|----------|
| `default` | Catch-all when no role-specific override exists. |
| `triage` | Cheap routing decisions during ingest and scan. |
| `candidateExtraction` | Extracting relationship and entity candidates from data. |
| `curator` | Reconciling proposed context against accepted files. |
| `reconcile` | Resolving conflicts between incoming and existing context. |
| `repair` | Fixing invalid generated YAML before write. |
### Prompt caching
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `promptCaching.enabled` | `boolean` | backend default | Master switch for Anthropic-style prompt caching. |
| `promptCaching.systemTtl` | `5m` \| `1h` | backend default | Cache TTL for the system prompt segment. |
| `promptCaching.toolsTtl` | `5m` \| `1h` | backend default | Cache TTL for the tools/schema segment. |
| `promptCaching.historyTtl` | `5m` \| `1h` | backend default | Cache TTL for conversation-history breakpoints. |
| `promptCaching.vertexFallbackTo5m` | `boolean` | `false` | When `true`, downgrade `1h` TTLs to `5m` on Vertex, which does not support `1h` caching. |
## `ingest`
`ingest` controls how **ktx** builds context from your stack. It lists the
connectors to run, the embedding provider used when connectors embed documents,
and the concurrency and failure policy for work units.
```yaml
ingest:
adapters:
- live-database
- dbt
- metabase
embeddings:
backend: openai
model: text-embedding-3-small
dimensions: 1536
openai:
api_key: env:OPENAI_API_KEY
workUnits:
stepBudget: 40
maxConcurrency: 2
failureMode: continue
feat(cli): add ingest LLM rate-limit governor with paced retries (#261) * feat(cli): add ingest rate limit governor * feat(cli): wire ingest rate-limit config * feat(cli): report provider rate-limit signals * feat(cli): show ingest rate-limit waits * fix(cli): complete rate-limit event coverage * fix(cli): abort ingest provider calls cleanly * fix(cli): propagate ingest cancellation * fix(cli): reject pre-aborted ingest rate-limit waits * fix(cli): honor Claude rate-limit reset waits * fix(cli): retry thrown Codex rate-limit failures * fix(cli): type Claude rate-limit result details * fix(cli): emit ingest rate-limit countdowns from rejected signals * fix(cli): report ai sdk rate-limit header utilization * fix(cli): gate LLM rate-limit retries on the governor budget The AI SDK and Codex runtimes retried 429 / opaque rate-limit failures up to 6-7 times with no backoff when constructed without a RateLimitGovernor (scan, memory, setup) or with pacing disabled, ignoring Retry-After and worsening the limit. The outer retry loop only cooperates with the governor's pause, so without active pacing there is no backoff to apply. Route the retry bound through a single source: RateLimitGovernor .maxRetryAttempts(), which returns retry.maxAttempts when enabled and 1 (no outer retry) when absent or disabled. All three runtimes (ai-sdk, codex, claude-code) now use it, so ingest.rateLimit.retry.maxAttempts genuinely controls attempts and the hard-coded 6 (plus Codex's off-by-one extra attempt) is gone. Backend-native retry (e.g. the AI SDK's maxRetries) still handles transient 429s. Also correct the ktx.yaml docs for maxWaitMs (caps each wait, not the whole run) and maxAttempts, and sync uv.lock ktx-sl/ktx-daemon to 0.9.0.
2026-06-05 12:10:27 +02:00
rateLimit:
enabled: true
throttleThreshold: 0.8
minConcurrencyUnderPressure: 1
maxWaitMs: 600000
retry:
maxAttempts: 6
baseDelayMs: 1000
maxDelayMs: 60000
jitter: true
```
### Connectors
`adapters` is a list of connector IDs that should run. Each ID matches a
connector that **ktx** ships locally:
| Connector ID | What it ingests |
|------------|-----------------|
| `live-database` | Live warehouse introspection (schemas, tables, columns, samples). |
| `historic-sql` | Query history from Postgres `pg_stat_statements`, BigQuery `INFORMATION_SCHEMA.JOBS`, or Snowflake query history. |
| `dbt` | dbt manifest models, sources, tests, and exposures. |
| `metricflow` | MetricFlow / Semantic Layer models and metrics. |
| `lookml` | LookML projects (models, explores, views, joins). |
| `looker` | Looker dashboards and looks via the API. |
| `metabase` | Metabase cards, dashboards, and database mappings. |
| `notion` | Notion pages and databases for wiki context. |
| `fake` | Test/demo connector. Useful in fixtures. |
### Embeddings
The `embeddings` block can also appear inside `scan.enrichment`; that override
wins when present.
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `backend` | `none` \| `openai` \| `sentence-transformers` | `none` | Embedding provider. `none` disables embeddings. |
| `model` | `string` | - | Provider model ID, for example `text-embedding-3-small` or `all-MiniLM-L6-v2`. |
| `dimensions` | `int > 0` | `8` | Vector size. Default `8` is a placeholder that's only valid with `backend: none`. Set explicitly to match your model (1536 for `text-embedding-3-small`, 384 for `all-MiniLM-L6-v2`). |
| `openai.api_key` / `base_url` | `string` | - | OpenAI credentials. Required when `backend: openai`. |
| `sentenceTransformers.base_url` | `string` | `""` | URL of the sentence-transformers server. Empty when ktx manages the local daemon for you. |
| `sentenceTransformers.pathPrefix` | `string` | - | Optional URL path prefix prepended to embedding requests. |
| `batchSize` | `int > 0` | provider default | Texts per embedding API call. |
### Work units
A work unit is one unit of agent-driven ingest work (for example one table or
one Metabase question). These knobs bound how long it runs and how the run
handles failures.
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `workUnits.stepBudget` | `int > 0` | `40` | Maximum agent steps allowed per work unit before it's force-terminated. |
| `workUnits.maxConcurrency` | `int > 0` | `1` | How many work units run in parallel. |
| `workUnits.failureMode` | `abort` \| `continue` | `continue` | `abort` stops the whole ingest run on the first failure; `continue` records it and keeps going. |
feat(cli): add ingest LLM rate-limit governor with paced retries (#261) * feat(cli): add ingest rate limit governor * feat(cli): wire ingest rate-limit config * feat(cli): report provider rate-limit signals * feat(cli): show ingest rate-limit waits * fix(cli): complete rate-limit event coverage * fix(cli): abort ingest provider calls cleanly * fix(cli): propagate ingest cancellation * fix(cli): reject pre-aborted ingest rate-limit waits * fix(cli): honor Claude rate-limit reset waits * fix(cli): retry thrown Codex rate-limit failures * fix(cli): type Claude rate-limit result details * fix(cli): emit ingest rate-limit countdowns from rejected signals * fix(cli): report ai sdk rate-limit header utilization * fix(cli): gate LLM rate-limit retries on the governor budget The AI SDK and Codex runtimes retried 429 / opaque rate-limit failures up to 6-7 times with no backoff when constructed without a RateLimitGovernor (scan, memory, setup) or with pacing disabled, ignoring Retry-After and worsening the limit. The outer retry loop only cooperates with the governor's pause, so without active pacing there is no backoff to apply. Route the retry bound through a single source: RateLimitGovernor .maxRetryAttempts(), which returns retry.maxAttempts when enabled and 1 (no outer retry) when absent or disabled. All three runtimes (ai-sdk, codex, claude-code) now use it, so ingest.rateLimit.retry.maxAttempts genuinely controls attempts and the hard-coded 6 (plus Codex's off-by-one extra attempt) is gone. Backend-native retry (e.g. the AI SDK's maxRetries) still handles transient 429s. Also correct the ktx.yaml docs for maxWaitMs (caps each wait, not the whole run) and maxAttempts, and sync uv.lock ktx-sl/ktx-daemon to 0.9.0.
2026-06-05 12:10:27 +02:00
### Rate limits
`rateLimit` controls provider-neutral pacing for LLM calls during ingest. When a
provider reports a subscription window, retry-after delay, or HTTP 429,
**ktx** pauses new work-unit model calls, shows a transient wait in the CLI,
and reduces work-unit concurrency while the provider is under pressure.
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `rateLimit.enabled` | `boolean` | `true` | Master switch for ingest LLM rate-limit pacing and visible waits. |
| `rateLimit.throttleThreshold` | `number between 0 and 1` | `0.8` | Fraction of a known provider window at which **ktx** starts reducing concurrency. |
| `rateLimit.minConcurrencyUnderPressure` | `int > 0` | `1` | Effective work-unit concurrency while a provider is under rate-limit pressure. |
| `rateLimit.maxWaitMs` | `int > 0` | unset | Caps how long a single provider-reset wait can last. This bounds each wait, not the whole run: after a capped wait elapses **ktx** retries and may pause again. Omit to wait until the provider's reset time. |
| `rateLimit.retry.maxAttempts` | `int > 0` | `6` | Maximum attempts for a single rate-limited LLM call before the failure surfaces (counts the first try). Also bounds how far opaque backoff grows for responses without a reset time or retry-after value. |
| `rateLimit.retry.baseDelayMs` | `int > 0` | `1000` | Initial opaque retry delay in milliseconds. |
| `rateLimit.retry.maxDelayMs` | `int > 0` | `60000` | Maximum opaque retry delay in milliseconds. |
| `rateLimit.retry.jitter` | `boolean` | `true` | Add jitter to opaque retry delays. |
## `scan`
`scan` configures how schema-level inputs become structured context:
column-level enrichment and inferred relationships between tables.
```yaml
scan:
enrichment:
mode: llm # none | deterministic | llm
relationships:
enabled: true
llmProposals: true
validationRequiredForManifest: true
acceptThreshold: 0.85
reviewThreshold: 0.55
maxLlmTablesPerBatch: 40
maxCandidatesPerColumn: 25
profileSampleRows: 10000
fix(snowflake): unblock multi-schema ingest and relationship discovery (#204) * feat(setup): drop redundant Snowflake schema prompt; fall back to free-text on listSchemas failure Snowflake setup previously asked for a single schema as free text, then ran a multiselect against the discovered schemas — two schema questions back-to-back, with the first being only a session bootstrap. The SDK's `schema` is optional, so the bootstrap step is unnecessary. - Remove the free-text Snowflake schema prompt; only pass `schema` to snowflake-sdk when one is configured. - When `listSchemas()` fails (e.g. role lacks SHOW SCHEMAS), prompt the user for a comma-separated list, persist it as `schema_names`, and use it as both the table-list filter and the multiselect default. Applies to every driver with a scope-discovery spec, not just Snowflake. - Update docs to lead with `schema_names`; keep `schema_name` as a documented single-schema shorthand. * fix(snowflake): keep introspecting when primary-key discovery is denied The PK query joins INFORMATION_SCHEMA.TABLE_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE, which require grants the connection role may not have. Previously a 'SQL compilation error: Object ANALYTICS.INFORMATION_SCHEMA.KEY_COLUMN_USAGE does not exist or not authorized' aborted the entire introspect — schemas, columns, and row counts were all discarded over a missing nice-to-have. Wrap the constraint query in try/catch, log a one-line warning per schema, and return an empty PK map. Columns end up with primaryKey=false; relationship inference still has FK and profiling to fall back on. * fix(scan): unblock relationship discovery on Snowflake Two adjacent bugs prevented the scan's relationship pipeline from producing any joins on a Snowflake warehouse: - relationship-profiling.ts fell through to a default `GROUP_CONCAT` branch for unknown drivers. Snowflake has no GROUP_CONCAT, so every per-table profile query failed with "Unknown function GROUP_CONCAT". Add an explicit Snowflake branch that uses LISTAGG with a literal '\x1f' delimiter (Snowflake requires the delimiter to be a constant, so CHR(31) is rejected). - description-generation.ts destructured `connector.sampleTable` and `connector.sampleColumn` into bare locals, losing the `this` binding when the class-method connectors (Snowflake, Postgres, MySQL) were invoked. Every sample call threw "Cannot read properties of undefined (reading 'assertConnection')" and degraded LLM descriptions to metadata-only prompts. Call the methods through the connector instead. Without these, even after the primary-key probe is allowed to fail softly, the scan ends up with 0 validated relationships and an empty `joins:` block in every shard YAML. * test(scan): cover table-ref helpers * feat(scan): plumb tableScope through live-database introspection port * feat(scan): apply tableScope during metadata fetch * feat(scan): enforce table scope at fetch boundary * feat(scan): pool Snowflake sessions and batch enrichment for faster ingest (#206) * feat(cli): add RSA key-pair auth option to Snowflake setup wizard Extends the interactive Snowflake setup flow with an authentication-method prompt (password vs RSA/JWT key-pair). The RSA branch collects a private-key path (env/file/absolute) and an optional passphrase; the resulting connection config records `authMethod: 'rsa'` with `privateKey` and `passphrase` instead of `password`. * feat(scan): pool Snowflake sessions * fix(scan): reuse structural snapshots and cleanup connectors * feat(scan): parallelize relationship profiling * feat(scan): batch table description generation * docs: document Snowflake ingest concurrency knobs * fix(scan): close Snowflake ingest perf verification gaps * fix(scan): keep batched description failure bounded * feat(scan): dispatch query-history probes by connection driver Extract historic-sql dialect resolution into a shared helper so the status-project readiness check and the local ingest factory agree on which connections enable query history and which probe to run. The status command now picks the postgres/snowflake/bigquery probe based on the connection's driver instead of always reporting against postgres, which previously caused snowflake connections with queryHistory.enabled to surface a misleading "driver is snowflake" failure. Also drops a noisy console.warn from Snowflake primary-key discovery — INFORMATION_SCHEMA.KEY_COLUMN_USAGE is commonly ungranted for read-only roles and the FK + profiling paths handle the empty PK map already. * fix(llm): allow StructuredOutput tool and raise maxTurns for generateObject The Claude Code agent SDK announces an internal pseudo-tool named StructuredOutput in the system/init message whenever outputFormat is set to { type: 'json_schema' }. The runtime's isolation check built its allowedToolIds set only from MCP tool ids and treated StructuredOutput as an unexpected host-injected tool, so every generateObject call threw "Claude Code runtime isolation failed: tools=StructuredOutput ..." and the table-descriptions and relationship-LLM-proposal enrichment stages recorded null output across the board. Whitelist StructuredOutput specifically in generateObject's allowedToolIds — the check also enforces missing_tools symmetry, so generateText and runAgentLoop, which do not see StructuredOutput, must not require it. generateObject also ran with maxTurns: 1, which the model intermittently breached when it emitted thinking text before the structured response. Raised to 5 to give the schema-bound call enough headroom without allowing unbounded loops. The existing tests now exercise the path with an init message that announces StructuredOutput so the regression cannot slip back in. * chore(scripts): add ktx-reset.sh project-cleanup helper Convenience script for repeatable ingest testing: takes a project directory and prunes everything except ktx.yaml and .ktx/secrets/, so the next ktx setup or ktx ingest run starts from a known-clean state.
2026-05-23 10:41:30 +02:00
profileConcurrency: 4
validationConcurrency: 4
validationBudget: all
```
### Enrichment
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `enrichment.mode` | `none` \| `deterministic` \| `llm` | `none` | How columns and tables get described. `deterministic` uses local heuristics; `llm` calls the configured provider. |
| `enrichment.embeddings` | embedding block | - | Optional override for enrichment-time vectorization. Falls back to `ingest.embeddings`. |
### Relationships
The relationship discovery step proposes joins between tables, scores them,
and optionally validates each one against the database before writing it to
the manifest.
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `relationships.enabled` | `boolean` | `true` | Master switch for relationship discovery. |
| `relationships.llmProposals` | `boolean` | `true` | When `true`, propose relationships using the LLM in addition to deterministic candidates. |
| `relationships.validationRequiredForManifest` | `boolean` | `true` | When `true`, only proposals that pass database-side validation reach the manifest. |
| `relationships.acceptThreshold` | `number 0-1` | `0.85` | Confidence at or above which a proposal is auto-accepted. |
| `relationships.reviewThreshold` | `number 0-1` | `0.55` | Confidence at or above which a proposal is surfaced for human review (but not auto-accepted). |
| `relationships.maxLlmTablesPerBatch` | `int > 0` | `40` | Max tables included in a single LLM relationship-proposal batch. |
| `relationships.maxCandidatesPerColumn` | `int > 0` | `25` | Max join partners considered per column. |
| `relationships.profileSampleRows` | `int > 0` | `10000` | Rows sampled per table when profiling values for relationship inference. |
| `relationships.profileConcurrency` | `int > 0` | `4` | Parallel relationship-profile queries against the database. For pooled connectors, effective database concurrency is also bounded by the connection's `maxConnections`. |
| `relationships.validationConcurrency` | `int > 0` | `4` | Parallel relationship validation queries against the database. |
| `relationships.validationBudget` | `all` \| `int ≥ 0` | runtime default | Cap on validation queries per scan. `all` means unlimited. |
## `agent`
`agent` carries feature flags for **ktx**-side agent behavior. Today the only
block is `run_research`, which gates the research agent invoked by
`ktx mcp` and CLI research tools.
```yaml
agent:
run_research:
enabled: true
max_iterations: 20
default_toolset:
- sl_query
- wiki_search
- sl_read_source
```
| Field | Type | Default | Purpose |
|-------|------|---------|---------|
| `run_research.enabled` | `boolean` | `false` | Master switch for the research agent. |
| `run_research.max_iterations` | `int ≥ 0` | `20` | Maximum tool-call iterations per research run. |
| `run_research.default_toolset` | `string[]` | `[sl_query, wiki_search, sl_read_source]` | Tool identifiers exposed to the research agent. |
## A full example
Combining the blocks above:
```yaml
connections:
warehouse:
driver: postgres
url: env:DATABASE_URL
metabase:
driver: metabase
api_url: https://metabase.example.com
api_key_ref: env:METABASE_API_KEY
mappings:
databaseMappings:
"1": warehouse
syncMode: ALL
setup:
database_connection_ids:
- warehouse
storage:
state: sqlite
search: sqlite-fts5
git:
author: "ktx <ktx@example.com>"
llm:
provider:
backend: claude-code
models:
default: sonnet
triage: haiku
candidateExtraction: sonnet
curator: opus
reconcile: opus
repair: haiku
ingest:
adapters:
- live-database
- metabase
embeddings:
backend: openai
model: text-embedding-3-small
dimensions: 1536
openai:
api_key: env:OPENAI_API_KEY
workUnits:
maxConcurrency: 2
scan:
enrichment:
mode: llm
relationships:
acceptThreshold: 0.85
reviewThreshold: 0.55
agent:
run_research:
enabled: true
```
## Validating your config
**ktx** validates `ktx.yaml` when it loads, and treats two kinds of problems
differently:
- **An invalid value on a field ktx recognizes** (for example
`llm.provider.backend: nope`) is a hard error. Setup and CLI commands stop and
report the exact path so you can fix it.
- **An unrecognized key** — one left over from a different **ktx** version, or a
typo such as `scan.relationships.acceptThreshhold` — is tolerated, not fatal.
**ktx** ignores the key and keeps running, so a misspelled field quietly falls
back to its default instead of taking effect. `ktx status` lists each ignored
key as a warning (and exits `0`) so you can remove or correct it when
convenient.
Warehouse connections accept extra driver-specific fields, so passthrough values
like `historicSql` and `context.queryHistory` are allowed.
To re-validate without running anything else:
```bash
ktx status
```
`ktx status` parses `ktx.yaml`, surfaces validation issues, and reports which
inputs are ready.
## Related references
- [`ktx setup`](/docs/cli-reference/ktx-setup) - the guided flow that writes
most of these fields for you.
- [`ktx status`](/docs/cli-reference/ktx-status) - readiness check for the
current `ktx.yaml`.
- [LLM configuration](/docs/guides/llm-configuration) - provider-specific
setup notes.
- [Primary sources](/docs/integrations/primary-sources) and
[Context sources](/docs/integrations/context-sources) - connector-specific
details and credentials.