ktx/docs-site/content/docs/cli-reference/ktx-ingest.mdx
Matt Senick (Sigma) acd20ac248
feat(sigma): add Sigma Computing context-source adapter (#316)
* feat(sigma): add Sigma Computing context-source adapter

Closes #168

Adds a full ingest adapter for Sigma Computing so `ktx ingest` can pull
data model specs and workbook summaries into the ktx context layer. The
implementation follows the same fetch → chunk → project → LLM pattern
used by the Looker, Metabase, and MetricFlow adapters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(sigma): address PR review comments

- Remove manifest from rawFiles; moves to peerFileIndex so fetchedAt
  changes don't mark all work units dirty every run
- Fix workbookFilter.updatedSince eviction bug: fetch full universe first,
  apply filter client-side, evict only on archived/deleted
- Remove measure projection entirely; project() writes measures: [] and
  the sigma_ingest skill surfaces Lookup/aggregation formulas as wiki prose
- Remove joins projection (v1 limitation); project() writes joins: [] and
  Lookup relationships are described in wiki prose instead
- Remove write-back dead code: createDataModel, updateDataModel,
  SigmaDataModelPushResult, mutate/post/put
- Fix emitBatches notes pluralization bug ('2 data modelss' → '2 data models')
- Add tokenInflight dedup on ensureToken to coalesce concurrent auth requests
- Retry spec fetch when existing staged spec is null (transient failure cache)
- Drop unused WorkbookFilter import from client-port.ts
- Note in docs that joins are not projected from Sigma data models in this release

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* updates

* fix(sigma): restore sigma in local adapter test + small cleanups

The gdrive↔sigma merge dropped 'sigma' from the expected adapter source
list in local-adapters.test.ts while keeping gdrive, so the slow TS suite
failed even though the source registers both. Add 'sigma' back at its
registration position (after metabase, before gdrive).

Also:
- Move the orphaned SigmaPullConfig docstring onto the schema it documents
  and drop the stale BullMQ reference (standalone ktx has no BullMQ; the
  config lives in the ingest job's bundleRef.config).
- Drop an O(n^2) find() round-trip in fetch() when building the active
  data-model list; filter once and reuse for the eviction id set.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com>
2026-07-01 01:14:57 +02:00

284 lines
14 KiB
Text

---
title: "ktx ingest"
description: "Build or refresh ktx context, or capture text into ktx memory."
---
`ktx ingest` builds or refreshes **ktx** context from configured connections, and
can also capture free-form text into **ktx** memory. Database connections build
enriched context — schema plus AI-generated descriptions, embeddings, and
relationship evidence — and require a configured model and embeddings.
Context-source connections ingest metadata from tools such as dbt, Looker,
Metabase, MetricFlow, LookML, Notion, and Sigma. Pass `--text` or `--file` to capture
inline text or text files into memory instead.
## Command signature
```bash
ktx ingest [options] [connectionId]
```
- Bare `ktx ingest` (no positional, no `--all`) ingests every configured
connection.
- `ktx ingest <connectionId>` ingests one configured connection.
- `ktx ingest --text "..."` (or `--file <path>`) captures notes into **ktx**
memory instead of ingesting a connection.
Database connections run before context-source connections when more than one
connection is selected.
## Options
| Flag | Description | Default |
|------|-------------|---------|
| `--all` | Ingest all configured connections (same as bare invocation) | `false` |
| `--query-history` | Include database query-history usage patterns | Stored connection default |
| `--no-query-history` | Skip database query-history usage patterns for this run | Stored connection default |
| `--query-history-window-days <days>` | BigQuery/Snowflake query-history lookback window for this run | Stored connection default |
| `--stages <list>` | Comma-separated enrichment stages to (re)run: `descriptions`, `embeddings`, `relationships` | All three |
| `--text <content>` | Capture inline text into **ktx** memory; repeatable | `[]` |
| `--file <path>` | Capture a text file into **ktx** memory; use `-` for stdin; repeatable | `[]` |
| `--verbatim` | Store each `--text`/`--file` document body unchanged as a `GLOBAL` wiki page; the LLM derives metadata only | `false` |
| `--connection-id <connectionId>` | **ktx** connection id to tag captured text/file notes | - |
| `--user-id <id>` | Memory user id for text/file capture attribution | `local-cli` |
| `--fail-fast` | Stop after the first failed text/file item | `false` |
| `--plain` | Print plain text output | `true` |
| `--json` | Print JSON output | `false` |
| `--yes` | Install required managed runtime features without prompting | `false` |
| `--no-input` | Disable interactive terminal input | - |
Database ingest always builds enriched context and requires a configured model
and embeddings (run `ktx setup`); connections without that configuration fail
before any work starts. Query-history flags apply only to database connections
that support query history. The window flag applies to BigQuery and Snowflake;
Postgres reads the current `pg_stat_statements` aggregate data instead of a
time-windowed history table. Query-history ingest runs after the schema scan.
When more than one connection is selected, database ingest runs first, then
context-source ingest and memory updates run for context-source connections.
Some ingest paths use the managed **ktx** Python runtime. Query-history ingest uses
it for SQL analysis, and Looker context-source ingest uses it for Looker identifier
parsing. In an interactive terminal, `ktx ingest` prompts before installing the
required runtime features. Use `--yes` to install them without prompting, or
use `--no-input` to fail fast with install guidance.
`--text` and `--file` cannot be combined with a positional `connectionId` or
`--all`; pass `--connection-id <id>` instead to tag captured notes.
### Verbatim ingest
By default, captured text is routed through the memory agent, which decides what
to persist and may rewrite, condense, split, or re-title it. For *authoritative*
documents — metric definitions, formula specs, runbooks, compliance text — that
paraphrasing is a defect. Add `--verbatim` to store each `--text`/`--file`
document body **unchanged** as a `GLOBAL` wiki page:
- The stored body is the input document, written by code; the LLM never edits it.
It is used only to derive page metadata (`summary`, `tags`, `sl_refs`), and even
that is skipped for fields the document's own frontmatter already sets.
- The page key is deterministic: a `--file` derives it from the filename, inline
`--text` from the document's leading Markdown heading (inline text without a
heading is rejected — pass it as `--file` instead).
- Ingest is idempotent. Re-running the same document is a safe no-op; a different
body at the same key fails loudly rather than overwriting.
- `--verbatim` works with `llm.provider.backend: none` — the only ingest path that
does. With no backend the `summary` is derived from the heading or first
sentence and `tags`/`sl_refs` are left empty; the full body is still stored.
- Existing frontmatter passes through untouched (including fields **ktx** does not
model, such as `effective_date` or `version`); generated metadata only fills
absent fields. `--connection-id <id>` scopes the page to that connection by
setting its `connections` frontmatter.
### Selecting enrichment stages
Database enrichment runs three stages: `descriptions` (one LLM call per table),
`embeddings` (vectors over the schema and descriptions), and `relationships`
(join detection, optionally LLM-proposed). Each stage is cached on a **per-stage
hash of only its own inputs**, so changing one stage's inputs invalidates only
that stage. Switching the description LLM re-runs only `descriptions`; upgrading
the embeddings model re-runs only `embeddings`; turning on
`scan.relationships.llmProposals` re-runs only `relationships`. The expensive
per-table descriptions are never thrown away because an unrelated setting moved.
`--stages <list>` re-runs a chosen subset on an already-ingested connection. A
named stage is **force-recomputed** (it bypasses the completed-stage cache),
while unselected stages are left exactly as they are on disk:
- `ktx ingest warehouse --stages embeddings` — re-embed on a new model, keeping
descriptions and joins.
- `ktx ingest --all --stages relationships --no-query-history` — backfill joins
across every database after enabling `llmProposals`, without re-paying for
descriptions.
- `ktx ingest warehouse --stages descriptions` — re-run thin descriptions (for
example after raising `KTX_ENRICH_LLM_TIMEOUT_MS`). When nothing the
descriptions depend on changed, the per-table resume record means only the
tables that previously failed are re-sent to the LLM.
Stage names are validated: an unknown or empty name (`--stages foo`, `--stages
descriptions,foo`, `--stages ""`) is a hard parse error. Naming all three
(`--stages descriptions,embeddings,relationships`) forces a full enrichment
recompute, which is **not** the same as omitting the flag (omitting resumes
whatever is already done). After a selective run, **ktx** warns
(`enrichment_stage_stale`) when an unselected stage's inputs no longer match what
it was last built from — for example, re-running `descriptions` flags
`embeddings` as stale until you re-run `--stages embeddings`. The warning is
informational; **ktx** never silently cascades the extra work.
## Examples
```bash
# Build every configured connection (bare = --all)
ktx ingest
# Build one database or context-source connection
ktx ingest warehouse
# Include query-history usage patterns
ktx ingest warehouse --query-history
# Set the lookback window for BigQuery or Snowflake query history
ktx ingest warehouse --query-history-window-days 30
# Re-embed one connection on a new embeddings model (descriptions/joins untouched)
ktx ingest warehouse --stages embeddings
# Backfill LLM-proposed joins across every database without re-describing
ktx ingest --all --stages relationships --no-query-history
# Build a context-source connection
ktx ingest notion
# Capture inline text into memory
ktx ingest --text "Refunds are excluded from net revenue."
# Capture multiple text snippets in one call
ktx ingest --text "Revenue is gross receipts." --text "Orders are completed purchases."
# Capture a local Markdown file into memory and tag it to a connection
ktx ingest --file docs/revenue-notes.md --connection-id warehouse
# Capture one stdin item
printf "Refunds are excluded from net revenue." | ktx ingest --file -
# Store an authoritative document verbatim (body preserved exactly)
ktx ingest --file docs/rfm-bucket-definitions.md --verbatim
# Store it verbatim and scope it to one connection
ktx ingest --file docs/haversine-formula.md --verbatim --connection-id warehouse
```
## Output
Plain output summarizes each target and the operations that ran.
```text
Ingest finished
Source Database schema Query history Source ingest Memory update
warehouse done done skipped skipped
notion skipped skipped done done
```
Use `--json` when a script or agent needs the selected plan and per-target
results.
## Final validation pruning
At the end of a context-source ingest, **ktx** validates the composed semantic
layer and wiki before saving it. If the final validation finds dangling
references, **ktx** removes the reference instead of failing accepted work. This
can remove joins that point at missing semantic sources, wiki `refs`, wiki
`sl_refs`, and inline wiki body references. If a generated semantic source is
invalid, **ktx** drops that source from the final save.
The stored ingest report records these changes as `finalGatePrunedReferences`
and `finalGateDroppedSources`. The trace emits `final_gate_reference_pruned`,
`final_gate_source_dropped`, `final_gate_prune_committed`, and
`final_gate_prune_finished` events when pruning runs. If validation still fails
after pruning, the ingest fails and the report keeps the final validation error.
## Inspect context-source ingest traces
Context-source ingest writes persistent JSONL traces for postmortem debugging.
Plain ingest output prints the trace path near the report, run, and job
identifiers when a trace is available:
```text
Report: report-abc123
Run: run-abc123
Job: job-abc123
Trace: .ktx/ingest-traces/job-abc123/trace.jsonl
```
The trace file lives under the project directory at
`.ktx/ingest-traces/<jobId>/trace.jsonl`. Each line is a JSON event with the
job id, run id, sync id, connection id, source key, phase, event name, timing,
state snapshot, decision context, and error details. Failed runs also write a
stored ingest report with `status: "failed"`, `failure.phase`,
`failure.message`, and the same trace path.
Use `jq` or line-oriented tools to inspect a trace:
```bash
jq -c '. | {at, level, phase, event, durationMs, data, error}' \
.ktx/ingest-traces/<jobId>/trace.jsonl
```
**ktx** writes `debug` trace events by default. Set `KTX_INGEST_TRACE_LEVEL` to
`error`, `info`, `debug`, or `trace` before running ingest to change the trace
verbosity:
```bash
KTX_INGEST_TRACE_LEVEL=trace ktx ingest metabase
```
### Profiling a slow ingest
Each timed phase and work unit records a `durationMs` in the trace, and each
agent loop records its step count and token usage. To see where wall-clock time
went, enable profiling and **ktx** prints a rolled-up breakdown to stderr at the
end of the run. There are two ways to turn it on, and two output formats.
Turn it on per run with the `KTX_PROFILE_INGEST` environment variable, or
persistently with `ingest.profile` in `ktx.yaml` (useful for CI or while
iterating on a slow source):
```bash
KTX_PROFILE_INGEST=1 ktx ingest metabase # human-readable table
KTX_PROFILE_INGEST=json ktx ingest metabase # raw JSON for coding agents
```
```yaml
ingest:
profile: true # human table; use "json" for the machine-readable form
```
Both formats report total wall time, time per phase, and the slowest work units,
splitting each work unit's agent-loop time into model time versus tool-execution
time. The `json` form emits the full structured profile (raw milliseconds and
token counts, stable keys) plus a `summary.headline` one-line diagnosis, so a
coding agent can parse it directly instead of scraping the table. If both the env
var and the config request profiling, `json` wins. Example headline:
```text
Slowest phase: reconciliation (2m 05s, 48% of wall time). 2 work units (1 failed), ~88% model generation vs ~12% tools.
```
Work units run serially by default (`ingest.workUnits.maxConcurrency` is `1`);
raise it in `ktx.yaml` if the profile shows the run is bound by serialized
work-unit agent loops. If the provider reports an LLM rate limit, **ktx** shows
a transient wait message and temporarily reduces effective work-unit concurrency
according to `ingest.rateLimit`.
## Common errors
| Error | Cause | Recovery |
|-------|-------|----------|
| Connection not configured | The connection id is not present in `ktx.yaml` | Add the connection with `ktx setup` or update `ktx.yaml` |
| Enrichment is not configured | Database ingest needs a model, embeddings, and scan-enrichment configuration | Run `ktx setup` to configure a model and embeddings |
| Query history is unsupported | The selected database driver does not support query history | Run ingest without query-history flags |
| Python runtime is missing | The selected ingest target needs runtime-backed SQL analysis or source parsing | Accept the interactive prompt, rerun with `--yes`, or run the suggested `ktx admin runtime install` command |
| Context-source options were ignored | Query-history flags were supplied for a context-source connection | Omit database-only flags when ingesting context-source connections |
| Text ingest stops early | `--fail-fast` was used and one item failed | Fix the failed item or rerun without `--fail-fast` to collect all failures |
| `--verbatim requires --text or --file` | `--verbatim` was passed without a document to store | Add `--text` or `--file`, or drop `--verbatim` |
| Inline verbatim text needs a leading heading | `--text --verbatim` content has no `# Heading` to derive a stable key | Add a leading Markdown heading, or pass the content as `--file <path>` |
| A different page already exists at key | A verbatim re-run targeted an existing key with a different body | Use a distinct document name/key, or remove the existing page first |
| Connection scope conflict | Frontmatter `connections` disagrees with `--connection-id` | Remove one so the intended scope is unambiguous |