ktx/docs-site/content/docs/integrations/context-sources.mdx

388 lines
14 KiB
Text
Raw Normal View History

---
title: Context Sources
description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, and Notion.
---
Context sources feed your existing analytics tooling into **ktx**. During ingestion, **ktx** extracts metadata from each source and uses a reconciliation agent to reconcile it with your existing semantic layer and knowledge base - preserving accepted edits rather than overwriting.
All context sources are configured in `ktx.yaml` under `connections` with their respective `driver` value.
## Ingestion workflow
feat: merge ingest and scan * docs: add CLI component reuse guidance * docs: add unified ingest ux design * Refine unified ingest UX design after adversarial review iteration 1 * Refine unified ingest UX design after adversarial review iteration 2 * Refine unified ingest UX design after adversarial review iteration 3 * feat(cli): route public connection ingest command * feat(cli): hide standalone scan from public help * feat(cli): plan public ingest depth and query history * feat(cli): execute public database ingest facets * feat(ingest): read connection query history config * fix(cli): use public ingest wording * fix(config): stop generating ingest adapter allow lists * docs: document public ingest command * test: align ingest surface expectations * docs: add unified ingest public CLI surface plan * feat(cli): preflight deep public ingest readiness * feat(setup): store query history in connection context * feat(setup): store database context depth * feat(setup): verify context readiness by database depth * fix(setup): keep context build foreground only * fix(config): reject reserved ingest connection ids * test: close unified ingest v1 expectations * docs: add unified ingest v1 closure plan * fix(ingest): bypass adapter allow-list for public source ingest * fix(ingest): honor query history window intent * fix(ingest): hide scan internals from public database ingest * feat(ingest): use foreground view for interactive public ingest * fix(setup): use schema context and query history wording * test(cli): verify unified ingest public output * docs: add unified ingest v1 public output closure plan * fix(setup): forward query history flags * fix(setup): prompt for postgres query history * fix(status): report query history readiness * fix(ingest): remove legacy public guidance * fix(ingest): polish foreground retry copy * docs(examples): use unified query history wording * chore(ingest): finish public query history cleanup * docs: add unified ingest v1 query history status cleanup plan * test(docs): cover unified ingest public docs * docs: align ingest CLI reference with unified UX * docs: update context build guides for unified ingest * docs: update setup and primary source ingest wording * docs: stop advertising adapter-backed example ingest * docs: close unified ingest public docs gaps * docs: add unified ingest v1 docs site closure plan * fix: render unified ingest foreground warnings * fix: explain query history schema order * fix: add public ingest retry guidance * fix: align setup next steps with unified ingest * fix: remove scan wording from demo progress * test: verify unified ingest ux closure * docs: add unified ingest v1 foreground and retry closure plan * fix(cli): preserve query-history pull config in public ingest * fix(cli): omit hidden commands from docs command tree * test(cli): close unified ingest final public surface checks * docs: add unified ingest v1 final public surface closure plan * fix(cli): use public source labels in ingest reports * fix(cli): suppress low-level public ingest output * test(cli): verify unified ingest public plain output * docs: add unified ingest v1 public plain output closure plan * fix(cli): add public ingest copy sanitizers * fix(cli): sanitize public ingest progress copy * fix(cli): rename setup schema scope prompt * docs(plan): add progress copy closure; test: align setup back-nav fixture Adds the iter9 plan and updates the setup back-navigation test fixture to pass disableQueryHistory plus listSchemas/listTables stubs that the unified ingest setup step now requires. * docs(plan): add final ux labels plan with narrowed label scans * fix(cli): aggregate unsupported query-history warnings * fix(cli): align setup database labels * test(cli): fix setup database test type-check * fix(cli): remove primary-source wording from setup output * test(cli): verify unified ingest setup closure * docs(plan): add unified ingest v1 verification copy closure plan * fix(cli): remove top-level scan command * fix(cli): remove legacy ingest and wiki commands * Merge scan into ingest flow * feat(cli): split ingest progress into per-phase rows, rename work units to tasks Each database target in the unified ingest dashboard now renders one row per real subprocess (Schema, then Query history when enabled) instead of a single combined bar. Each phase has its own monotonic 0-100% bar so the progress never snaps back to zero when historic-sql starts after scan completes. Completed phases keep their final bar, summary, and elapsed time visible as an inline audit trail; queued and skipped phases are shown explicitly. Also rename user-facing "work units" / "Failed work units" to "tasks" / "Failed tasks" in ingest output and parseIngestSummary. The parser still accepts the legacy "Work units:" wording in captured output for backward compat. Internal memory-flow event names and type fields are left alone. * Fix test harness failures * Fix CI smoke checks --------- Co-authored-by: Andrey Avtomonov <7889985+andreybavt@users.noreply.github.com>
2026-05-14 01:43:06 +02:00
Agents must configure and ingest context sources in this order:
1. Add the context source connection in `ktx.yaml` or with `ktx setup`.
2. Store tokens as `env:NAME` or `file:/path/to/secret`.
feat: merge ingest and scan * docs: add CLI component reuse guidance * docs: add unified ingest ux design * Refine unified ingest UX design after adversarial review iteration 1 * Refine unified ingest UX design after adversarial review iteration 2 * Refine unified ingest UX design after adversarial review iteration 3 * feat(cli): route public connection ingest command * feat(cli): hide standalone scan from public help * feat(cli): plan public ingest depth and query history * feat(cli): execute public database ingest facets * feat(ingest): read connection query history config * fix(cli): use public ingest wording * fix(config): stop generating ingest adapter allow lists * docs: document public ingest command * test: align ingest surface expectations * docs: add unified ingest public CLI surface plan * feat(cli): preflight deep public ingest readiness * feat(setup): store query history in connection context * feat(setup): store database context depth * feat(setup): verify context readiness by database depth * fix(setup): keep context build foreground only * fix(config): reject reserved ingest connection ids * test: close unified ingest v1 expectations * docs: add unified ingest v1 closure plan * fix(ingest): bypass adapter allow-list for public source ingest * fix(ingest): honor query history window intent * fix(ingest): hide scan internals from public database ingest * feat(ingest): use foreground view for interactive public ingest * fix(setup): use schema context and query history wording * test(cli): verify unified ingest public output * docs: add unified ingest v1 public output closure plan * fix(setup): forward query history flags * fix(setup): prompt for postgres query history * fix(status): report query history readiness * fix(ingest): remove legacy public guidance * fix(ingest): polish foreground retry copy * docs(examples): use unified query history wording * chore(ingest): finish public query history cleanup * docs: add unified ingest v1 query history status cleanup plan * test(docs): cover unified ingest public docs * docs: align ingest CLI reference with unified UX * docs: update context build guides for unified ingest * docs: update setup and primary source ingest wording * docs: stop advertising adapter-backed example ingest * docs: close unified ingest public docs gaps * docs: add unified ingest v1 docs site closure plan * fix: render unified ingest foreground warnings * fix: explain query history schema order * fix: add public ingest retry guidance * fix: align setup next steps with unified ingest * fix: remove scan wording from demo progress * test: verify unified ingest ux closure * docs: add unified ingest v1 foreground and retry closure plan * fix(cli): preserve query-history pull config in public ingest * fix(cli): omit hidden commands from docs command tree * test(cli): close unified ingest final public surface checks * docs: add unified ingest v1 final public surface closure plan * fix(cli): use public source labels in ingest reports * fix(cli): suppress low-level public ingest output * test(cli): verify unified ingest public plain output * docs: add unified ingest v1 public plain output closure plan * fix(cli): add public ingest copy sanitizers * fix(cli): sanitize public ingest progress copy * fix(cli): rename setup schema scope prompt * docs(plan): add progress copy closure; test: align setup back-nav fixture Adds the iter9 plan and updates the setup back-navigation test fixture to pass disableQueryHistory plus listSchemas/listTables stubs that the unified ingest setup step now requires. * docs(plan): add final ux labels plan with narrowed label scans * fix(cli): aggregate unsupported query-history warnings * fix(cli): align setup database labels * test(cli): fix setup database test type-check * fix(cli): remove primary-source wording from setup output * test(cli): verify unified ingest setup closure * docs(plan): add unified ingest v1 verification copy closure plan * fix(cli): remove top-level scan command * fix(cli): remove legacy ingest and wiki commands * Merge scan into ingest flow * feat(cli): split ingest progress into per-phase rows, rename work units to tasks Each database target in the unified ingest dashboard now renders one row per real subprocess (Schema, then Query history when enabled) instead of a single combined bar. Each phase has its own monotonic 0-100% bar so the progress never snaps back to zero when historic-sql starts after scan completes. Completed phases keep their final bar, summary, and elapsed time visible as an inline audit trail; queued and skipped phases are shown explicitly. Also rename user-facing "work units" / "Failed work units" to "tasks" / "Failed tasks" in ingest output and parseIngestSummary. The parser still accepts the legacy "Work units:" wording in captured output for backward compat. Internal memory-flow event names and type fields are left alone. * Fix test harness failures * Fix CI smoke checks --------- Co-authored-by: Andrey Avtomonov <7889985+andreybavt@users.noreply.github.com>
2026-05-14 01:43:06 +02:00
3. Run `ktx ingest <connectionId>` for one source or `ktx ingest --all` for
every configured source.
4. Review the foreground ingest output.
5. Review generated `semantic-layer/` YAML and `wiki/` Markdown files in git.
6. Validate changed semantic sources with `ktx sl validate`.
## Common source fields
Git repository fields are source-specific. dbt uses top-level `repo_url`,
LookML uses top-level `repoUrl`, and MetricFlow uses nested
`metricflow.repoUrl`.
| Field | Required | Description |
|-------|----------|-------------|
| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, or `notion` |
| `source_dir` | For local file sources | Absolute or project-relative source directory |
| `repo_url` | For Git-hosted dbt sources | Git repository URL |
| `repoUrl` | For Git-hosted LookML sources | Git repository URL |
| `metricflow.repoUrl` | For Git-hosted MetricFlow sources | Git repository URL |
| `branch` | No | Git branch to read |
| `path` | No | Subdirectory inside a monorepo |
| `auth_token_ref` | For private APIs/repos | `env:NAME` or `file:/path/to/secret` token reference |
## dbt
Ingests schema definitions, model descriptions, column metadata, and column test definitions from a dbt project.
### What it provides
- Model and source definitions from `schema.yml` files
- Column names, descriptions, and data types
- Column tests, mapped to semantic facts — `not_null` / `unique` become column constraints, `accepted_values` becomes enum value lists, and `relationships` becomes join / foreign-key edges
- Model and source tags, and source freshness settings
MetricFlow `semantic_models:` and `metrics:` are ingested through the separate [MetricFlow](#metricflow) source, not the dbt driver.
### Connection config
```yaml title="ktx.yaml"
connections:
my-dbt:
driver: dbt
source_dir: /path/to/dbt/project
```
For a Git-hosted project:
```yaml title="ktx.yaml"
connections:
my-dbt:
driver: dbt
repo_url: https://github.com/org/dbt-repo
branch: main
path: analytics/dbt # For monorepos
auth_token_ref: env:GITHUB_TOKEN
```
### Authentication
| Method | Config |
|--------|--------|
| Local path | `source_dir: /absolute/path/to/dbt/project` |
| Public repo | `repo_url: https://github.com/org/repo` |
| Private repo | `repo_url` + `auth_token_ref: env:GITHUB_TOKEN` |
**Optional fields:**
| Field | Description |
|-------|-------------|
| `profiles_path` | Path to `profiles.yml` (if non-standard location) |
| `target` | dbt target name (e.g., `dev`, `prod`) |
| `project_name` | Override auto-detected project name |
### What gets ingested
- **Semantic-layer overlays** (`semantic-layer/*.yaml`): descriptions, constraints, enum values, and joins from the dbt YAML are written onto the semantic source for the matching warehouse table. Overlays land on the warehouse connection that owns the table, which is usually a different connection than the dbt source itself.
- **Wiki pages** (`wiki/`): for definitions or relationships that don't map to a confirmed physical table.
- **Work units** for parallel processing: one per schema file under `models/` when the project has more than 25 YAML files, otherwise a single combined unit.
---
## MetricFlow
Ingests MetricFlow semantic models and metric definitions. Useful when your team defines metrics in MetricFlow's YAML format.
### What it provides
- Semantic model definitions (entities, dimensions, measures)
- Cross-model metric definitions
- Entity relationships between models, inferred from matching foreign and primary entities
### Connection config
```yaml title="ktx.yaml"
connections:
my-metricflow:
driver: metricflow
metricflow:
repoUrl: https://github.com/org/metricflow-repo
branch: main
path: dbt_metrics # Subdirectory for monorepos
auth_token_ref: env:GITHUB_TOKEN
```
For a local path:
```yaml
metricflow:
repoUrl: file:///absolute/path/to/project
```
### Authentication
| Method | Config |
|--------|--------|
| Public repo | `repoUrl: https://github.com/org/repo` |
| Private repo | `repoUrl` + `auth_token_ref: env:GITHUB_TOKEN` |
| Local path | `repoUrl: file:///path/to/project` |
### What gets ingested
- Semantic models with their entities, dimensions, measures, and the join edges inferred from entity relationships
- Metric definitions with their expressions and filters
- Work units organized by connected component (metrics + related semantic models grouped together)
---
## LookML
Ingests LookML view and model definitions from a Git repository. Extracts field definitions, SQL table references, and join relationships.
### What it provides
- View definitions (dimensions, measures, derived tables)
- Model explore definitions and joins
- SQL table name references
- Field-level descriptions and labels
### Connection config
```yaml title="ktx.yaml"
connections:
my-lookml:
driver: lookml
repoUrl: https://github.com/org/lookml-repo
branch: main
path: analytics # Subdirectory for monorepos
auth_token_ref: env:GITHUB_TOKEN
```
For a local path:
```yaml
repoUrl: file:///absolute/path/to/lookml
```
### Authentication
| Method | Config |
|--------|--------|
| Public repo | `repoUrl: https://github.com/org/repo` |
| Private repo | `repoUrl` + `auth_token_ref: env:GITHUB_TOKEN` |
| Local path | `repoUrl: file:///path/to/project` |
### What gets ingested
- One work unit per model, plus a unit for orphan views and one per dashboard
- Semantic-layer sources per view — overlays for thin `sql_table_name` wrappers, standalone sources for `derived_table` views
- Measures, joins (with their Looker `relationship:`), and field types mapped to column types (`yesno` → boolean, date/timestamp → time)
- Wiki pages for relationships and descriptions, with warehouse identifiers verified before writing
### Warehouse mapping
Optionally validate that LookML references match your expected Looker connection:
```yaml
mappings:
expectedLookerConnectionName: postgres_connection
```
This compares each model's `connection:` declaration against the expected name. Mismatched models are flagged, and semantic-layer writes are disabled for them during that ingest while wiki extraction still proceeds.
---
## Metabase
Ingests collections, questions, models, and metrics — with their underlying SQL — from a Metabase instance. Maps Metabase databases to your **ktx** warehouse connections.
### What it provides
- Collections and their hierarchy, used to organize ingested context
- Questions, models, and metrics — resolved SQL for both native and structured (MBQL) queries
- Each card's output schema: column types and primary/foreign-key hints
- Database-to-warehouse relationship mapping
### Connection config
```yaml title="ktx.yaml"
connections:
my-metabase:
driver: metabase
api_url: https://metabase.company.com
api_key_ref: env:METABASE_API_KEY
mappings:
databaseMappings:
"3": postgres-main # Metabase DB ID → ktx connection
syncEnabled:
"3": true
syncMode: ONLY # Only ingest mapped databases
```
### Authentication
| Method | Config |
|--------|--------|
| API key | `api_key_ref: env:METABASE_API_KEY` |
Generate an API key in Metabase: **Admin > Settings > Authentication > API Keys**.
### What gets ingested
- Semantic-layer sources generated from each card's resolved SQL and column metadata, written to the mapped warehouse connection
- Fallback wiki notes only when a referenced table can't be mapped or an identifier can't be verified
- One work unit per Metabase collection; re-syncs reprocess only collections with changed cards
### Warehouse mapping
Metabase databases must be mapped to **ktx** connections so ingested context links to the correct warehouse:
```yaml
mappings:
databaseMappings:
"<metabase_db_id>": "<ktx_connection_id>"
syncEnabled:
"<metabase_db_id>": true
syncMode: ONLY # ONLY = restrict to mapped DBs
```
2026-05-14 12:43:14 -04:00
Find Metabase database IDs in **Admin > Databases** - the ID is in the URL when editing a database.
---
## Looker
Ingests explores, looks, and dashboards from a Looker instance via the Looker API. Maps Looker connections to your **ktx** warehouse connections.
### What it provides
- Explore definitions and field metadata
- Dashboard and look configurations
- Query patterns and usage signals
- Looker folder structure for organization context
### Connection config
```yaml title="ktx.yaml"
connections:
my-looker:
driver: looker
base_url: https://looker.company.com
client_id: your-looker-client-id
client_secret_ref: env:LOOKER_CLIENT_SECRET
mappings:
connectionMappings:
postgres_connection: postgres-main # Looker conn → ktx conn
```
### Authentication
| Method | Config |
|--------|--------|
| OAuth client credentials | `client_id` + `client_secret_ref: env:LOOKER_CLIENT_SECRET` |
Generate API credentials in Looker: **Admin > Users > Edit > API Keys**.
### What gets ingested
- Semantic-layer sources from explore fields, written to the mapped warehouse connection (mapped explores only)
- Wiki pages capturing reusable metric, segment, and domain knowledge from dashboards and Looks
- Usage and recency signals that drive a triage gate, focusing processing on high-value content
- Work units per explore, per dashboard, and per Look
### Warehouse mapping
Map Looker connection names to **ktx** connections so explores link to the correct warehouse:
```yaml
mappings:
connectionMappings:
"<looker_connection_name>": "<ktx_connection_id>"
```
Find Looker connection names in **Admin > Database > Connections**.
---
## Notion
Ingests pages and databases from a Notion workspace as wiki pages. Useful for capturing business definitions, data dictionaries, and team documentation that agents need for context.
### What it provides
- Notion pages crawled from selected roots or all accessible content
- Page bodies and blocks normalized to Markdown
- Page hierarchy and cross-page links (child pages, mentions, relations)
- Notion databases and their data-source rows as individual pages
### Connection config
```yaml title="ktx.yaml"
connections:
my-notion:
driver: notion
auth_token_ref: env:NOTION_TOKEN
crawl_mode: selected_roots
root_page_ids:
- "abc123def456..."
```
For crawling all accessible pages:
```yaml title="ktx.yaml"
connections:
my-notion:
driver: notion
auth_token_ref: env:NOTION_TOKEN
crawl_mode: all_accessible
```
### Authentication
| Method | Config |
|--------|--------|
| Internal integration token | `auth_token_ref: env:NOTION_TOKEN` |
Create an integration at [notion.so/my-integrations](https://www.notion.so/my-integrations), then share target pages with the integration.
### Configuration options
| Field | Description | Default |
|-------|-------------|---------|
2026-05-14 12:43:14 -04:00
| `crawl_mode` | `all_accessible` or `selected_roots` | - |
| `root_page_ids` | Page IDs to crawl from (for `selected_roots`) | `[]` |
| `root_database_ids` | Database IDs to include | `[]` |
| `root_data_source_ids` | Data-source IDs to include (for `selected_roots`) | `[]` |
| `max_pages_per_run` | Pages processed per sync | `1000` |
| `max_knowledge_creates_per_run` | New pages created per sync | `25` |
| `max_knowledge_updates_per_run` | Pages updated per sync | `20` |
### What gets ingested
- Wiki pages synthesized from Notion content (not raw copies)
- Semantic-layer sources when a page defines a reusable dataset or metric mapped to a confirmed non-Notion target; otherwise the fact stays wiki-only
- Page-relevance triage that skips transient content (task lists, status updates, date-titled snapshots)
- Work units clustered by embedding similarity for efficient synthesis
### Notes
- Notion is wiki-first: it writes durable wiki pages by default and only emits semantic-layer sources for content mapped to a confirmed non-Notion target; unmapped facts stay wiki-only
- Rate limits apply; large workspaces may require multiple ingestion runs
- Incremental sync cursors are stored in `.ktx/db.sqlite`; don't add
`last_successful_cursor` to `ktx.yaml`
## Common errors
| Error or symptom | Likely cause | Recovery |
|------------------|--------------|----------|
| Connector cannot read source files | `source_dir`, `repo_url`, `repoUrl`, `metricflow.repoUrl`, `branch`, or `path` is wrong | Verify the path locally or clone the repo manually with the same credentials |
| Private repo/API authentication fails | Token env var or secret file is missing | Export the env var or update `auth_token_ref` to a readable file |
| Ingest creates duplicate context | Existing source names or wiki pages do not match imported terminology | Review the diff, rename duplicates, and add wiki pages with canonical names |
| Notion ingest skips pages | Integration lacks access or root ids are missing | Share pages with the Notion integration and set `root_page_ids` or use `all_accessible` carefully |
| Generated semantic sources fail validation | Tool metadata does not match the live warehouse schema | Map BI/source databases to primary warehouse connections and rerun validation |