2026-05-11 00:45:43 -07:00
---
title: "ktx ingest"
2026-05-20 17:33:38 +02:00
description: "Build or refresh ktx context, or capture text into ktx memory."
2026-05-11 00:45:43 -07:00
---
2026-05-20 17:33:38 +02:00
`ktx ingest` builds or refreshes **ktx** context from configured connections, and
can also capture free-form text into **ktx** memory. Database connections build
2026-05-29 17:41:04 +02:00
enriched context — schema plus AI-generated descriptions, embeddings, and
relationship evidence — and require a configured model and embeddings.
Context-source connections ingest metadata from tools such as dbt, Looker,
Metabase, MetricFlow, LookML, and Notion. Pass `--text` or `--file` to capture
inline text or text files into memory instead.
2026-05-11 00:45:43 -07:00
2026-05-11 16:43:08 -07:00
## Command signature
2026-05-11 00:45:43 -07:00
```bash
2026-05-14 01:43:06 +02:00
ktx ingest [options] [connectionId]
2026-05-11 00:45:43 -07:00
```
2026-05-20 01:52:37 +02:00
- Bare `ktx ingest` (no positional, no `--all`) ingests every configured
connection.
- `ktx ingest <connectionId>` ingests one configured connection.
2026-05-20 17:33:38 +02:00
- `ktx ingest --text "..."` (or `--file <path>`) captures notes into **ktx**
2026-05-20 01:52:37 +02:00
memory instead of ingesting a connection.
2026-05-11 00:45:43 -07:00
2026-05-20 01:52:37 +02:00
Database connections run before context-source connections when more than one
connection is selected.
## Options
2026-05-11 00:45:43 -07:00
| Flag | Description | Default |
|------|-------------|---------|
2026-05-20 01:52:37 +02:00
| `--all` | Ingest all configured connections (same as bare invocation) | `false` |
2026-05-14 01:43:06 +02:00
| `--query-history` | Include database query-history usage patterns | Stored connection default |
| `--no-query-history` | Skip database query-history usage patterns for this run | Stored connection default |
2026-05-15 15:31:51 -04:00
| `--query-history-window-days <days>` | BigQuery/Snowflake query-history lookback window for this run | Stored connection default |
2026-05-20 17:33:38 +02:00
| `--text <content>` | Capture inline text into **ktx** memory; repeatable | `[]` |
| `--file <path>` | Capture a text file into **ktx** memory; use `-` for stdin; repeatable | `[]` |
| `--connection-id <connectionId>` | **ktx** connection id to tag captured text/file notes | - |
2026-05-20 01:52:37 +02:00
| `--user-id <id>` | Memory user id for text/file capture attribution | `local-cli` |
| `--fail-fast` | Stop after the first failed text/file item | `false` |
2026-05-13 12:00:08 +02:00
| `--plain` | Print plain text output | `true` |
2026-05-11 00:45:43 -07:00
| `--json` | Print JSON output | `false` |
2026-05-17 10:27:29 +02:00
| `--yes` | Install required managed runtime features without prompting | `false` |
2026-05-18 19:22:19 +02:00
| `--no-input` | Disable interactive terminal input | - |
2026-05-11 00:45:43 -07:00
2026-05-29 17:41:04 +02:00
Database ingest always builds enriched context and requires a configured model
and embeddings (run `ktx setup`); connections without that configuration fail
before any work starts. Query-history flags apply only to database connections
2026-05-15 15:31:51 -04:00
that support query history. The window flag applies to BigQuery and Snowflake;
Postgres reads the current `pg_stat_statements` aggregate data instead of a
2026-05-29 17:41:04 +02:00
time-windowed history table. Query-history ingest runs after the schema scan.
2026-05-14 12:53:55 -04:00
2026-05-20 01:52:37 +02:00
When more than one connection is selected, database ingest runs first, then
2026-05-20 17:33:38 +02:00
context-source ingest and memory updates run for context-source connections.
2026-05-14 12:53:55 -04:00
2026-05-20 17:33:38 +02:00
Some ingest paths use the managed **ktx** Python runtime. Query-history ingest uses
it for SQL analysis, and Looker context-source ingest uses it for Looker identifier
2026-05-17 10:27:29 +02:00
parsing. In an interactive terminal, `ktx ingest` prompts before installing the
required runtime features. Use `--yes` to install them without prompting, or
use `--no-input` to fail fast with install guidance.
2026-05-20 01:52:37 +02:00
`--text` and `--file` cannot be combined with a positional `connectionId` or
`--all`; pass `--connection-id <id>` instead to tag captured notes.
2026-05-11 00:45:43 -07:00
## Examples
```bash
2026-05-20 01:52:37 +02:00
# Build every configured connection (bare = --all)
ktx ingest
2026-05-20 17:33:38 +02:00
# Build one database or context-source connection
2026-05-14 01:43:06 +02:00
ktx ingest warehouse
2026-05-14 12:53:55 -04:00
# Include query-history usage patterns
2026-05-29 17:41:04 +02:00
ktx ingest warehouse --query-history
2026-05-15 15:31:51 -04:00
# Set the lookback window for BigQuery or Snowflake query history
2026-05-14 01:43:06 +02:00
ktx ingest warehouse --query-history-window-days 30
2026-05-14 12:53:55 -04:00
2026-05-20 17:33:38 +02:00
# Build a context-source connection
2026-05-14 01:43:06 +02:00
ktx ingest notion
2026-05-14 12:53:55 -04:00
2026-05-20 01:52:37 +02:00
# Capture inline text into memory
ktx ingest --text "Refunds are excluded from net revenue."
# Capture multiple text snippets in one call
ktx ingest --text "Revenue is gross receipts." --text "Orders are completed purchases."
2026-05-14 12:53:55 -04:00
2026-05-20 01:52:37 +02:00
# Capture a local Markdown file into memory and tag it to a connection
ktx ingest --file docs/revenue-notes.md --connection-id warehouse
2026-05-14 12:53:55 -04:00
# Capture one stdin item
2026-05-20 01:52:37 +02:00
printf "Refunds are excluded from net revenue." | ktx ingest --file -
2026-05-14 12:53:55 -04:00
```
## Output
Plain output summarizes each target and the operations that ran.
```text
Ingest finished
Source Database schema Query history Source ingest Memory update
warehouse done done skipped skipped
notion skipped skipped done done
2026-05-11 16:43:08 -07:00
```
2026-05-14 12:53:55 -04:00
Use `--json` when a script or agent needs the selected plan and per-target
results.
2026-05-20 17:33:38 +02:00
## Inspect context-source ingest traces
2026-05-18 13:38:06 +02:00
2026-05-20 17:33:38 +02:00
Context-source ingest writes persistent JSONL traces for postmortem debugging.
Plain ingest output prints the trace path near the report, run, and job
identifiers when a trace is available:
2026-05-18 13:38:06 +02:00
```text
Report: report-abc123
Run: run-abc123
Job: job-abc123
Trace: .ktx/ingest-traces/job-abc123/trace.jsonl
```
The trace file lives under the project directory at
`.ktx/ingest-traces/<jobId>/trace.jsonl`. Each line is a JSON event with the
job id, run id, sync id, connection id, source key, phase, event name, timing,
state snapshot, decision context, and error details. Failed runs also write a
stored ingest report with `status: "failed"`, `failure.phase`,
`failure.message`, and the same trace path.
Use `jq` or line-oriented tools to inspect a trace:
```bash
jq -c '. | {at, level, phase, event, durationMs, data, error}' \
.ktx/ingest-traces/<jobId>/trace.jsonl
```
2026-05-20 17:33:38 +02:00
**ktx** writes `debug` trace events by default. Set `KTX_INGEST_TRACE_LEVEL` to
2026-05-18 13:38:06 +02:00
`error`, `info`, `debug`, or `trace` before running ingest to change the trace
verbosity:
```bash
KTX_INGEST_TRACE_LEVEL=trace ktx ingest metabase
```
feat(cli): profile ingest runs and split model vs tool time (#249)
* feat(cli): profile ingest runs to find where wall-clock time goes
Add opt-in profiling for `ktx ingest`. Each timed phase, work unit, and
agent loop now records durationMs / step count / token usage in the
trace, and a post-run aggregator rolls them up into a "where did the
time go" report printed to stderr.
Enable per run with KTX_PROFILE_INGEST (1/true -> human table, json ->
raw structured profile) or persistently via `ingest.profile` in
ktx.yaml. The json form emits raw milliseconds, token counts, and a
summary.headline one-line diagnosis so coding agents can parse it
directly; json wins when both env and config request profiling.
- runtime-port: RunLoopMetrics (totalMs, usage, stepCount,
stepBoundariesMs) plus onMetrics callbacks on text/object generation
- ai-sdk + claude-code runtimes: capture per-loop timing and token usage
- work-unit-executor and stages 3/4: thread metrics into trace events
- ingest-bundle.runner: time worktree / triage / clustering / index /
reconcile / squash phases and emit the profile in a finally block
(best-effort; never affects the run outcome)
- ingest-profile: new trace+transcript aggregator with table/json formatters
- config: ingest.profile flag; docs: profiling section in ktx-ingest.mdx
* fix(cli): flush tool-call logs before reading ingest profile
Tool transcripts are appended fire-and-forget so the agent hot path never
blocks on logging. The ingest profiler read them before the writes settled,
so per-work-unit toolMs (and the model-vs-tool split derived from it) could
be incomplete. Track in-flight appends and expose flushToolCallLogs() —
bounded by a timeout so it can never hang — and flush before the profiler
reads the transcript.
2026-06-01 15:49:17 +02:00
### Profiling a slow ingest
Each timed phase and work unit records a `durationMs` in the trace, and each
agent loop records its step count and token usage. To see where wall-clock time
went, enable profiling and **ktx** prints a rolled-up breakdown to stderr at the
end of the run. There are two ways to turn it on, and two output formats.
Turn it on per run with the `KTX_PROFILE_INGEST` environment variable, or
persistently with `ingest.profile` in `ktx.yaml` (useful for CI or while
iterating on a slow source):
```bash
KTX_PROFILE_INGEST=1 ktx ingest metabase # human-readable table
KTX_PROFILE_INGEST=json ktx ingest metabase # raw JSON for coding agents
```
```yaml
ingest:
profile: true # human table; use "json" for the machine-readable form
```
Both formats report total wall time, time per phase, and the slowest work units,
splitting each work unit's agent-loop time into model time versus tool-execution
time. The `json` form emits the full structured profile (raw milliseconds and
token counts, stable keys) plus a `summary.headline` one-line diagnosis, so a
coding agent can parse it directly instead of scraping the table. If both the env
var and the config request profiling, `json` wins. Example headline:
```text
Slowest phase: reconciliation (2m 05s, 48% of wall time). 2 work units (1 failed), ~88% model generation vs ~12% tools.
```
Work units run serially by default (`ingest.workUnits.maxConcurrency` is `1`);
raise it in `ktx.yaml` if the profile shows the run is bound by serialized
feat(cli): add ingest LLM rate-limit governor with paced retries (#261)
* feat(cli): add ingest rate limit governor
* feat(cli): wire ingest rate-limit config
* feat(cli): report provider rate-limit signals
* feat(cli): show ingest rate-limit waits
* fix(cli): complete rate-limit event coverage
* fix(cli): abort ingest provider calls cleanly
* fix(cli): propagate ingest cancellation
* fix(cli): reject pre-aborted ingest rate-limit waits
* fix(cli): honor Claude rate-limit reset waits
* fix(cli): retry thrown Codex rate-limit failures
* fix(cli): type Claude rate-limit result details
* fix(cli): emit ingest rate-limit countdowns from rejected signals
* fix(cli): report ai sdk rate-limit header utilization
* fix(cli): gate LLM rate-limit retries on the governor budget
The AI SDK and Codex runtimes retried 429 / opaque rate-limit failures up
to 6-7 times with no backoff when constructed without a RateLimitGovernor
(scan, memory, setup) or with pacing disabled, ignoring Retry-After and
worsening the limit. The outer retry loop only cooperates with the
governor's pause, so without active pacing there is no backoff to apply.
Route the retry bound through a single source: RateLimitGovernor
.maxRetryAttempts(), which returns retry.maxAttempts when enabled and 1
(no outer retry) when absent or disabled. All three runtimes (ai-sdk,
codex, claude-code) now use it, so ingest.rateLimit.retry.maxAttempts
genuinely controls attempts and the hard-coded 6 (plus Codex's off-by-one
extra attempt) is gone. Backend-native retry (e.g. the AI SDK's maxRetries)
still handles transient 429s.
Also correct the ktx.yaml docs for maxWaitMs (caps each wait, not the whole
run) and maxAttempts, and sync uv.lock ktx-sl/ktx-daemon to 0.9.0.
2026-06-05 12:10:27 +02:00
work-unit agent loops. If the provider reports an LLM rate limit, **ktx** shows
a transient wait message and temporarily reduces effective work-unit concurrency
according to `ingest.rateLimit`.
feat(cli): profile ingest runs and split model vs tool time (#249)
* feat(cli): profile ingest runs to find where wall-clock time goes
Add opt-in profiling for `ktx ingest`. Each timed phase, work unit, and
agent loop now records durationMs / step count / token usage in the
trace, and a post-run aggregator rolls them up into a "where did the
time go" report printed to stderr.
Enable per run with KTX_PROFILE_INGEST (1/true -> human table, json ->
raw structured profile) or persistently via `ingest.profile` in
ktx.yaml. The json form emits raw milliseconds, token counts, and a
summary.headline one-line diagnosis so coding agents can parse it
directly; json wins when both env and config request profiling.
- runtime-port: RunLoopMetrics (totalMs, usage, stepCount,
stepBoundariesMs) plus onMetrics callbacks on text/object generation
- ai-sdk + claude-code runtimes: capture per-loop timing and token usage
- work-unit-executor and stages 3/4: thread metrics into trace events
- ingest-bundle.runner: time worktree / triage / clustering / index /
reconcile / squash phases and emit the profile in a finally block
(best-effort; never affects the run outcome)
- ingest-profile: new trace+transcript aggregator with table/json formatters
- config: ingest.profile flag; docs: profiling section in ktx-ingest.mdx
* fix(cli): flush tool-call logs before reading ingest profile
Tool transcripts are appended fire-and-forget so the agent hot path never
blocks on logging. The ingest profiler read them before the writes settled,
so per-work-unit toolMs (and the model-vs-tool split derived from it) could
be incomplete. Track in-flight appends and expose flushToolCallLogs() —
bounded by a timeout so it can never hang — and flush before the profiler
reads the transcript.
2026-06-01 15:49:17 +02:00
2026-05-11 16:43:08 -07:00
## Common errors
| Error | Cause | Recovery |
|-------|-------|----------|
2026-05-14 01:43:06 +02:00
| Connection not configured | The connection id is not present in `ktx.yaml` | Add the connection with `ktx setup` or update `ktx.yaml` |
2026-05-29 17:41:04 +02:00
| Enrichment is not configured | Database ingest needs a model, embeddings, and scan-enrichment configuration | Run `ktx setup` to configure a model and embeddings |
| Query history is unsupported | The selected database driver does not support query history | Run ingest without query-history flags |
2026-05-20 01:36:54 +02:00
| Python runtime is missing | The selected ingest target needs runtime-backed SQL analysis or source parsing | Accept the interactive prompt, rerun with `--yes`, or run the suggested `ktx admin runtime install` command |
2026-05-29 17:41:04 +02:00
| Context-source options were ignored | Query-history flags were supplied for a context-source connection | Omit database-only flags when ingesting context-source connections |
2026-05-14 12:53:55 -04:00
| Text ingest stops early | `--fail-fast` was used and one item failed | Fix the failed item or rerun without `--fail-fast` to collect all failures |