ktx/docs-site/content/docs/integrations/context-sources.mdx

386 lines
12 KiB
Text
Raw Permalink Normal View History

---
title: Context Sources
description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, and Notion.
---
Context sources feed your existing analytics tooling into **ktx**. During ingestion, **ktx** extracts metadata from each source and uses a reconciliation agent to reconcile it with your existing semantic layer and knowledge base - preserving accepted edits rather than overwriting.
All context sources are configured in `ktx.yaml` under `connections` with their respective `driver` value.
## Ingestion workflow
feat: merge ingest and scan * docs: add CLI component reuse guidance * docs: add unified ingest ux design * Refine unified ingest UX design after adversarial review iteration 1 * Refine unified ingest UX design after adversarial review iteration 2 * Refine unified ingest UX design after adversarial review iteration 3 * feat(cli): route public connection ingest command * feat(cli): hide standalone scan from public help * feat(cli): plan public ingest depth and query history * feat(cli): execute public database ingest facets * feat(ingest): read connection query history config * fix(cli): use public ingest wording * fix(config): stop generating ingest adapter allow lists * docs: document public ingest command * test: align ingest surface expectations * docs: add unified ingest public CLI surface plan * feat(cli): preflight deep public ingest readiness * feat(setup): store query history in connection context * feat(setup): store database context depth * feat(setup): verify context readiness by database depth * fix(setup): keep context build foreground only * fix(config): reject reserved ingest connection ids * test: close unified ingest v1 expectations * docs: add unified ingest v1 closure plan * fix(ingest): bypass adapter allow-list for public source ingest * fix(ingest): honor query history window intent * fix(ingest): hide scan internals from public database ingest * feat(ingest): use foreground view for interactive public ingest * fix(setup): use schema context and query history wording * test(cli): verify unified ingest public output * docs: add unified ingest v1 public output closure plan * fix(setup): forward query history flags * fix(setup): prompt for postgres query history * fix(status): report query history readiness * fix(ingest): remove legacy public guidance * fix(ingest): polish foreground retry copy * docs(examples): use unified query history wording * chore(ingest): finish public query history cleanup * docs: add unified ingest v1 query history status cleanup plan * test(docs): cover unified ingest public docs * docs: align ingest CLI reference with unified UX * docs: update context build guides for unified ingest * docs: update setup and primary source ingest wording * docs: stop advertising adapter-backed example ingest * docs: close unified ingest public docs gaps * docs: add unified ingest v1 docs site closure plan * fix: render unified ingest foreground warnings * fix: explain query history schema order * fix: add public ingest retry guidance * fix: align setup next steps with unified ingest * fix: remove scan wording from demo progress * test: verify unified ingest ux closure * docs: add unified ingest v1 foreground and retry closure plan * fix(cli): preserve query-history pull config in public ingest * fix(cli): omit hidden commands from docs command tree * test(cli): close unified ingest final public surface checks * docs: add unified ingest v1 final public surface closure plan * fix(cli): use public source labels in ingest reports * fix(cli): suppress low-level public ingest output * test(cli): verify unified ingest public plain output * docs: add unified ingest v1 public plain output closure plan * fix(cli): add public ingest copy sanitizers * fix(cli): sanitize public ingest progress copy * fix(cli): rename setup schema scope prompt * docs(plan): add progress copy closure; test: align setup back-nav fixture Adds the iter9 plan and updates the setup back-navigation test fixture to pass disableQueryHistory plus listSchemas/listTables stubs that the unified ingest setup step now requires. * docs(plan): add final ux labels plan with narrowed label scans * fix(cli): aggregate unsupported query-history warnings * fix(cli): align setup database labels * test(cli): fix setup database test type-check * fix(cli): remove primary-source wording from setup output * test(cli): verify unified ingest setup closure * docs(plan): add unified ingest v1 verification copy closure plan * fix(cli): remove top-level scan command * fix(cli): remove legacy ingest and wiki commands * Merge scan into ingest flow * feat(cli): split ingest progress into per-phase rows, rename work units to tasks Each database target in the unified ingest dashboard now renders one row per real subprocess (Schema, then Query history when enabled) instead of a single combined bar. Each phase has its own monotonic 0-100% bar so the progress never snaps back to zero when historic-sql starts after scan completes. Completed phases keep their final bar, summary, and elapsed time visible as an inline audit trail; queued and skipped phases are shown explicitly. Also rename user-facing "work units" / "Failed work units" to "tasks" / "Failed tasks" in ingest output and parseIngestSummary. The parser still accepts the legacy "Work units:" wording in captured output for backward compat. Internal memory-flow event names and type fields are left alone. * Fix test harness failures * Fix CI smoke checks --------- Co-authored-by: Andrey Avtomonov <7889985+andreybavt@users.noreply.github.com>
2026-05-14 01:43:06 +02:00
Agents must configure and ingest context sources in this order:
1. Add the context source connection in `ktx.yaml` or with `ktx setup`.
2. Store tokens as `env:NAME` or `file:/path/to/secret`.
feat: merge ingest and scan * docs: add CLI component reuse guidance * docs: add unified ingest ux design * Refine unified ingest UX design after adversarial review iteration 1 * Refine unified ingest UX design after adversarial review iteration 2 * Refine unified ingest UX design after adversarial review iteration 3 * feat(cli): route public connection ingest command * feat(cli): hide standalone scan from public help * feat(cli): plan public ingest depth and query history * feat(cli): execute public database ingest facets * feat(ingest): read connection query history config * fix(cli): use public ingest wording * fix(config): stop generating ingest adapter allow lists * docs: document public ingest command * test: align ingest surface expectations * docs: add unified ingest public CLI surface plan * feat(cli): preflight deep public ingest readiness * feat(setup): store query history in connection context * feat(setup): store database context depth * feat(setup): verify context readiness by database depth * fix(setup): keep context build foreground only * fix(config): reject reserved ingest connection ids * test: close unified ingest v1 expectations * docs: add unified ingest v1 closure plan * fix(ingest): bypass adapter allow-list for public source ingest * fix(ingest): honor query history window intent * fix(ingest): hide scan internals from public database ingest * feat(ingest): use foreground view for interactive public ingest * fix(setup): use schema context and query history wording * test(cli): verify unified ingest public output * docs: add unified ingest v1 public output closure plan * fix(setup): forward query history flags * fix(setup): prompt for postgres query history * fix(status): report query history readiness * fix(ingest): remove legacy public guidance * fix(ingest): polish foreground retry copy * docs(examples): use unified query history wording * chore(ingest): finish public query history cleanup * docs: add unified ingest v1 query history status cleanup plan * test(docs): cover unified ingest public docs * docs: align ingest CLI reference with unified UX * docs: update context build guides for unified ingest * docs: update setup and primary source ingest wording * docs: stop advertising adapter-backed example ingest * docs: close unified ingest public docs gaps * docs: add unified ingest v1 docs site closure plan * fix: render unified ingest foreground warnings * fix: explain query history schema order * fix: add public ingest retry guidance * fix: align setup next steps with unified ingest * fix: remove scan wording from demo progress * test: verify unified ingest ux closure * docs: add unified ingest v1 foreground and retry closure plan * fix(cli): preserve query-history pull config in public ingest * fix(cli): omit hidden commands from docs command tree * test(cli): close unified ingest final public surface checks * docs: add unified ingest v1 final public surface closure plan * fix(cli): use public source labels in ingest reports * fix(cli): suppress low-level public ingest output * test(cli): verify unified ingest public plain output * docs: add unified ingest v1 public plain output closure plan * fix(cli): add public ingest copy sanitizers * fix(cli): sanitize public ingest progress copy * fix(cli): rename setup schema scope prompt * docs(plan): add progress copy closure; test: align setup back-nav fixture Adds the iter9 plan and updates the setup back-navigation test fixture to pass disableQueryHistory plus listSchemas/listTables stubs that the unified ingest setup step now requires. * docs(plan): add final ux labels plan with narrowed label scans * fix(cli): aggregate unsupported query-history warnings * fix(cli): align setup database labels * test(cli): fix setup database test type-check * fix(cli): remove primary-source wording from setup output * test(cli): verify unified ingest setup closure * docs(plan): add unified ingest v1 verification copy closure plan * fix(cli): remove top-level scan command * fix(cli): remove legacy ingest and wiki commands * Merge scan into ingest flow * feat(cli): split ingest progress into per-phase rows, rename work units to tasks Each database target in the unified ingest dashboard now renders one row per real subprocess (Schema, then Query history when enabled) instead of a single combined bar. Each phase has its own monotonic 0-100% bar so the progress never snaps back to zero when historic-sql starts after scan completes. Completed phases keep their final bar, summary, and elapsed time visible as an inline audit trail; queued and skipped phases are shown explicitly. Also rename user-facing "work units" / "Failed work units" to "tasks" / "Failed tasks" in ingest output and parseIngestSummary. The parser still accepts the legacy "Work units:" wording in captured output for backward compat. Internal memory-flow event names and type fields are left alone. * Fix test harness failures * Fix CI smoke checks --------- Co-authored-by: Andrey Avtomonov <7889985+andreybavt@users.noreply.github.com>
2026-05-14 01:43:06 +02:00
3. Run `ktx ingest <connectionId>` for one source or `ktx ingest --all` for
every configured source.
4. Review the foreground ingest output.
5. Review generated `semantic-layer/` YAML and `wiki/` Markdown files in git.
6. Validate changed semantic sources with `ktx sl validate`.
## Common source fields
Git repository fields are source-specific. dbt uses top-level `repo_url`,
LookML uses top-level `repoUrl`, and MetricFlow uses nested
`metricflow.repoUrl`.
| Field | Required | Description |
|-------|----------|-------------|
| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, or `notion` |
| `source_dir` | For local file sources | Absolute or project-relative source directory |
| `repo_url` | For Git-hosted dbt sources | Git repository URL |
| `repoUrl` | For Git-hosted LookML sources | Git repository URL |
| `metricflow.repoUrl` | For Git-hosted MetricFlow sources | Git repository URL |
| `branch` | No | Git branch to read |
| `path` | No | Subdirectory inside a monorepo |
| `auth_token_ref` | For private APIs/repos | `env:NAME` or `file:/path/to/secret` token reference |
## dbt
Ingests schema definitions, model descriptions, column metadata, and test coverage from a dbt project.
### What it provides
- Model and source definitions from `schema.yml` files
- Column descriptions and types
- Test coverage signals
- Semantic model references (if using dbt semantic layer)
- Data lineage between models
### Connection config
```yaml title="ktx.yaml"
connections:
my-dbt:
driver: dbt
source_dir: /path/to/dbt/project
```
For a Git-hosted project:
```yaml title="ktx.yaml"
connections:
my-dbt:
driver: dbt
repo_url: https://github.com/org/dbt-repo
branch: main
path: analytics/dbt # For monorepos
auth_token_ref: env:GITHUB_TOKEN
```
### Authentication
| Method | Config |
|--------|--------|
| Local path | `source_dir: /absolute/path/to/dbt/project` |
| Public repo | `repo_url: https://github.com/org/repo` |
| Private repo | `repo_url` + `auth_token_ref: env:GITHUB_TOKEN` |
**Optional fields:**
| Field | Description |
|-------|-------------|
| `profiles_path` | Path to `profiles.yml` (if non-standard location) |
| `target` | dbt target name (e.g., `dev`, `prod`) |
| `project_name` | Override auto-detected project name |
### What gets ingested
- YAML semantic sources generated from dbt schema files
- One work unit per semantic source (for projects with >25 YAML files) or all at once for smaller projects
- Column descriptions, tests, and relationships are preserved
---
## MetricFlow
Ingests MetricFlow semantic models and metric definitions. Useful when your team defines metrics in MetricFlow's YAML format.
### What it provides
- Semantic model definitions (entities, dimensions, measures)
- Cross-model metric definitions
- Dimension and entity relationships between models
### Connection config
```yaml title="ktx.yaml"
connections:
my-metricflow:
driver: metricflow
metricflow:
repoUrl: https://github.com/org/metricflow-repo
branch: main
path: dbt_metrics # Subdirectory for monorepos
auth_token_ref: env:GITHUB_TOKEN
```
For a local path:
```yaml
metricflow:
repoUrl: file:///absolute/path/to/project
```
### Authentication
| Method | Config |
|--------|--------|
| Public repo | `repoUrl: https://github.com/org/repo` |
| Private repo | `repoUrl` + `auth_token_ref: env:GITHUB_TOKEN` |
| Local path | `repoUrl: file:///path/to/project` |
### What gets ingested
- Semantic models with their entities, dimensions, and measures
- Metric definitions with their expressions and filters
- Work units organized by connected component (metrics + related semantic models grouped together)
---
## LookML
Ingests LookML view and model definitions from a Git repository. Extracts field definitions, SQL table references, and join relationships.
### What it provides
- View definitions (dimensions, measures, derived tables)
- Model explore definitions and joins
- SQL table name references
- Field-level descriptions and labels
### Connection config
```yaml title="ktx.yaml"
connections:
my-lookml:
driver: lookml
repoUrl: https://github.com/org/lookml-repo
branch: main
path: analytics # Subdirectory for monorepos
auth_token_ref: env:GITHUB_TOKEN
```
For a local path:
```yaml
repoUrl: file:///absolute/path/to/lookml
```
### Authentication
| Method | Config |
|--------|--------|
| Public repo | `repoUrl: https://github.com/org/repo` |
| Private repo | `repoUrl` + `auth_token_ref: env:GITHUB_TOKEN` |
| Local path | `repoUrl: file:///path/to/project` |
### What gets ingested
- View and model definitions organized by connected component
- LookML field types mapped to semantic layer column types
- Join definitions and relationship cardinalities
- SQL table references for warehouse mapping validation
### Warehouse mapping
Optionally validate that LookML references match your expected Looker connection:
```yaml
mappings:
expectedLookerConnectionName: postgres_connection
```
This validates that LookML model `connection:` declarations match expectations, flagging mismatches during ingestion.
---
## Metabase
Ingests dashboards, questions, and their underlying SQL queries from a Metabase instance. Maps Metabase databases to your **ktx** warehouse connections.
### What it provides
- Dashboard metadata and organization
- Question/query definitions (native SQL and structured queries)
- Table and column usage patterns from queries
- Database-to-warehouse relationship mapping
### Connection config
```yaml title="ktx.yaml"
connections:
my-metabase:
driver: metabase
api_url: https://metabase.company.com
api_key_ref: env:METABASE_API_KEY
mappings:
databaseMappings:
"3": postgres-main # Metabase DB ID → ktx connection
syncEnabled:
"3": true
syncMode: ONLY # Only ingest mapped databases
```
### Authentication
| Method | Config |
|--------|--------|
| API key | `api_key_ref: env:METABASE_API_KEY` |
Generate an API key in Metabase: **Admin > Settings > Authentication > API Keys**.
### What gets ingested
- Semantic sources generated from SQL queries in questions
- Wiki pages for dashboards (purpose, key metrics, relationships)
- Work units per dashboard and per question
### Warehouse mapping
Metabase databases must be mapped to **ktx** connections so ingested context links to the correct warehouse:
```yaml
mappings:
databaseMappings:
"<metabase_db_id>": "<ktx_connection_id>"
syncEnabled:
"<metabase_db_id>": true
syncMode: ONLY # ONLY = restrict to mapped DBs
```
2026-05-14 12:43:14 -04:00
Find Metabase database IDs in **Admin > Databases** - the ID is in the URL when editing a database.
---
## Looker
Ingests explores, looks, and dashboards from a Looker instance via the Looker API. Maps Looker connections to your **ktx** warehouse connections.
### What it provides
- Explore definitions and field metadata
- Dashboard and look configurations
- Query patterns and usage signals
- Looker folder structure for organization context
### Connection config
```yaml title="ktx.yaml"
connections:
my-looker:
driver: looker
base_url: https://looker.company.com
client_id: your-looker-client-id
client_secret_ref: env:LOOKER_CLIENT_SECRET
mappings:
connectionMappings:
postgres_connection: postgres-main # Looker conn → ktx conn
```
### Authentication
| Method | Config |
|--------|--------|
| OAuth client credentials | `client_id` + `client_secret_ref: env:LOOKER_CLIENT_SECRET` |
Generate API credentials in Looker: **Admin > Users > Edit > API Keys**.
### What gets ingested
- Semantic sources from explore field definitions
- Wiki pages for dashboards (purpose, audience, key metrics)
- Triage signals for automated content classification
- Work units per explore and per dashboard
### Warehouse mapping
Map Looker connection names to **ktx** connections so explores link to the correct warehouse:
```yaml
mappings:
connectionMappings:
"<looker_connection_name>": "<ktx_connection_id>"
```
Find Looker connection names in **Admin > Database > Connections**.
---
## Notion
Ingests pages and databases from a Notion workspace as wiki pages. Useful for capturing business definitions, data dictionaries, and team documentation that agents need for context.
### What it provides
- Wiki pages synthesized from Notion content
- Page hierarchy and relationships
- Database schemas (when Notion databases describe primary sources)
- Semantic clustering for organized ingestion
### Connection config
```yaml title="ktx.yaml"
connections:
my-notion:
driver: notion
auth_token_ref: env:NOTION_TOKEN
crawl_mode: selected_roots
root_page_ids:
- "abc123def456..."
```
For crawling all accessible pages:
```yaml title="ktx.yaml"
connections:
my-notion:
driver: notion
auth_token_ref: env:NOTION_TOKEN
crawl_mode: all_accessible
```
### Authentication
| Method | Config |
|--------|--------|
| Internal integration token | `auth_token_ref: env:NOTION_TOKEN` |
Create an integration at [notion.so/my-integrations](https://www.notion.so/my-integrations), then share target pages with the integration.
### Configuration options
| Field | Description | Default |
|-------|-------------|---------|
2026-05-14 12:43:14 -04:00
| `crawl_mode` | `all_accessible` or `selected_roots` | - |
| `root_page_ids` | Page IDs to crawl from (for `selected_roots`) | `[]` |
| `root_database_ids` | Database IDs to include | `[]` |
| `max_pages_per_run` | Pages processed per sync | `1000` |
| `max_knowledge_creates_per_run` | New pages created per sync | `25` |
| `max_knowledge_updates_per_run` | Pages updated per sync | `20` |
### What gets ingested
- Wiki pages synthesized from Notion content (not raw copies)
- Domain context extracted and organized by topic
- Triage signals for classifying page relevance
- Work units clustered by semantic similarity for efficient processing
### Notes
2026-05-14 12:43:14 -04:00
- Notion is knowledge-only - it does not produce semantic layer sources
- Rate limits apply; large workspaces may require multiple ingestion runs
- Incremental sync cursors are stored in `.ktx/db.sqlite`; don't add
`last_successful_cursor` to `ktx.yaml`
## Common errors
| Error or symptom | Likely cause | Recovery |
|------------------|--------------|----------|
| Connector cannot read source files | `source_dir`, `repo_url`, `repoUrl`, `metricflow.repoUrl`, `branch`, or `path` is wrong | Verify the path locally or clone the repo manually with the same credentials |
| Private repo/API authentication fails | Token env var or secret file is missing | Export the env var or update `auth_token_ref` to a readable file |
| Ingest creates duplicate context | Existing source names or wiki pages do not match imported terminology | Review the diff, rename duplicates, and add wiki pages with canonical names |
| Notion ingest skips pages | Integration lacks access or root ids are missing | Share pages with the Notion integration and set `root_page_ids` or use `all_accessible` carefully |
| Generated semantic sources fail validation | Tool metadata does not match the live warehouse schema | Map BI/source databases to primary warehouse connections and rerun validation |