mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-07 07:55:13 +02:00
385 lines
12 KiB
Text
385 lines
12 KiB
Text
---
|
|
title: Context Sources
|
|
description: Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, and Notion.
|
|
---
|
|
|
|
Context sources feed your existing analytics tooling into **ktx**. During ingestion, **ktx** extracts metadata from each source and uses a reconciliation agent to reconcile it with your existing semantic layer and knowledge base - preserving accepted edits rather than overwriting.
|
|
|
|
All context sources are configured in `ktx.yaml` under `connections` with their respective `driver` value.
|
|
|
|
## Ingestion workflow
|
|
|
|
Agents must configure and ingest context sources in this order:
|
|
|
|
1. Add the context source connection in `ktx.yaml` or with `ktx setup`.
|
|
2. Store tokens as `env:NAME` or `file:/path/to/secret`.
|
|
3. Run `ktx ingest <connectionId>` for one source or `ktx ingest --all` for
|
|
every configured source.
|
|
4. Review the foreground ingest output.
|
|
5. Review generated `semantic-layer/` YAML and `wiki/` Markdown files in git.
|
|
6. Validate changed semantic sources with `ktx sl validate`.
|
|
|
|
## Common source fields
|
|
|
|
Git repository fields are source-specific. dbt uses top-level `repo_url`,
|
|
LookML uses top-level `repoUrl`, and MetricFlow uses nested
|
|
`metricflow.repoUrl`.
|
|
|
|
| Field | Required | Description |
|
|
|-------|----------|-------------|
|
|
| `driver` | Yes | Source connector: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, or `notion` |
|
|
| `source_dir` | For local file sources | Absolute or project-relative source directory |
|
|
| `repo_url` | For Git-hosted dbt sources | Git repository URL |
|
|
| `repoUrl` | For Git-hosted LookML sources | Git repository URL |
|
|
| `metricflow.repoUrl` | For Git-hosted MetricFlow sources | Git repository URL |
|
|
| `branch` | No | Git branch to read |
|
|
| `path` | No | Subdirectory inside a monorepo |
|
|
| `auth_token_ref` | For private APIs/repos | `env:NAME` or `file:/path/to/secret` token reference |
|
|
|
|
## dbt
|
|
|
|
Ingests schema definitions, model descriptions, column metadata, and test coverage from a dbt project.
|
|
|
|
### What it provides
|
|
|
|
- Model and source definitions from `schema.yml` files
|
|
- Column descriptions and types
|
|
- Test coverage signals
|
|
- Semantic model references (if using dbt semantic layer)
|
|
- Data lineage between models
|
|
|
|
### Connection config
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-dbt:
|
|
driver: dbt
|
|
source_dir: /path/to/dbt/project
|
|
```
|
|
|
|
For a Git-hosted project:
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-dbt:
|
|
driver: dbt
|
|
repo_url: https://github.com/org/dbt-repo
|
|
branch: main
|
|
path: analytics/dbt # For monorepos
|
|
auth_token_ref: env:GITHUB_TOKEN
|
|
```
|
|
|
|
### Authentication
|
|
|
|
| Method | Config |
|
|
|--------|--------|
|
|
| Local path | `source_dir: /absolute/path/to/dbt/project` |
|
|
| Public repo | `repo_url: https://github.com/org/repo` |
|
|
| Private repo | `repo_url` + `auth_token_ref: env:GITHUB_TOKEN` |
|
|
|
|
**Optional fields:**
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `profiles_path` | Path to `profiles.yml` (if non-standard location) |
|
|
| `target` | dbt target name (e.g., `dev`, `prod`) |
|
|
| `project_name` | Override auto-detected project name |
|
|
|
|
### What gets ingested
|
|
|
|
- YAML semantic sources generated from dbt schema files
|
|
- One work unit per semantic source (for projects with >25 YAML files) or all at once for smaller projects
|
|
- Column descriptions, tests, and relationships are preserved
|
|
|
|
---
|
|
|
|
## MetricFlow
|
|
|
|
Ingests MetricFlow semantic models and metric definitions. Useful when your team defines metrics in MetricFlow's YAML format.
|
|
|
|
### What it provides
|
|
|
|
- Semantic model definitions (entities, dimensions, measures)
|
|
- Cross-model metric definitions
|
|
- Dimension and entity relationships between models
|
|
|
|
### Connection config
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-metricflow:
|
|
driver: metricflow
|
|
metricflow:
|
|
repoUrl: https://github.com/org/metricflow-repo
|
|
branch: main
|
|
path: dbt_metrics # Subdirectory for monorepos
|
|
auth_token_ref: env:GITHUB_TOKEN
|
|
```
|
|
|
|
For a local path:
|
|
|
|
```yaml
|
|
metricflow:
|
|
repoUrl: file:///absolute/path/to/project
|
|
```
|
|
|
|
### Authentication
|
|
|
|
| Method | Config |
|
|
|--------|--------|
|
|
| Public repo | `repoUrl: https://github.com/org/repo` |
|
|
| Private repo | `repoUrl` + `auth_token_ref: env:GITHUB_TOKEN` |
|
|
| Local path | `repoUrl: file:///path/to/project` |
|
|
|
|
### What gets ingested
|
|
|
|
- Semantic models with their entities, dimensions, and measures
|
|
- Metric definitions with their expressions and filters
|
|
- Work units organized by connected component (metrics + related semantic models grouped together)
|
|
|
|
---
|
|
|
|
## LookML
|
|
|
|
Ingests LookML view and model definitions from a Git repository. Extracts field definitions, SQL table references, and join relationships.
|
|
|
|
### What it provides
|
|
|
|
- View definitions (dimensions, measures, derived tables)
|
|
- Model explore definitions and joins
|
|
- SQL table name references
|
|
- Field-level descriptions and labels
|
|
|
|
### Connection config
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-lookml:
|
|
driver: lookml
|
|
repoUrl: https://github.com/org/lookml-repo
|
|
branch: main
|
|
path: analytics # Subdirectory for monorepos
|
|
auth_token_ref: env:GITHUB_TOKEN
|
|
```
|
|
|
|
For a local path:
|
|
|
|
```yaml
|
|
repoUrl: file:///absolute/path/to/lookml
|
|
```
|
|
|
|
### Authentication
|
|
|
|
| Method | Config |
|
|
|--------|--------|
|
|
| Public repo | `repoUrl: https://github.com/org/repo` |
|
|
| Private repo | `repoUrl` + `auth_token_ref: env:GITHUB_TOKEN` |
|
|
| Local path | `repoUrl: file:///path/to/project` |
|
|
|
|
### What gets ingested
|
|
|
|
- View and model definitions organized by connected component
|
|
- LookML field types mapped to semantic layer column types
|
|
- Join definitions and relationship cardinalities
|
|
- SQL table references for warehouse mapping validation
|
|
|
|
### Warehouse mapping
|
|
|
|
Optionally validate that LookML references match your expected Looker connection:
|
|
|
|
```yaml
|
|
mappings:
|
|
expectedLookerConnectionName: postgres_connection
|
|
```
|
|
|
|
This validates that LookML model `connection:` declarations match expectations, flagging mismatches during ingestion.
|
|
|
|
---
|
|
|
|
## Metabase
|
|
|
|
Ingests dashboards, questions, and their underlying SQL queries from a Metabase instance. Maps Metabase databases to your **ktx** warehouse connections.
|
|
|
|
### What it provides
|
|
|
|
- Dashboard metadata and organization
|
|
- Question/query definitions (native SQL and structured queries)
|
|
- Table and column usage patterns from queries
|
|
- Database-to-warehouse relationship mapping
|
|
|
|
### Connection config
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-metabase:
|
|
driver: metabase
|
|
api_url: https://metabase.company.com
|
|
api_key_ref: env:METABASE_API_KEY
|
|
mappings:
|
|
databaseMappings:
|
|
"3": postgres-main # Metabase DB ID → ktx connection
|
|
syncEnabled:
|
|
"3": true
|
|
syncMode: ONLY # Only ingest mapped databases
|
|
```
|
|
|
|
### Authentication
|
|
|
|
| Method | Config |
|
|
|--------|--------|
|
|
| API key | `api_key_ref: env:METABASE_API_KEY` |
|
|
|
|
Generate an API key in Metabase: **Admin > Settings > Authentication > API Keys**.
|
|
|
|
### What gets ingested
|
|
|
|
- Semantic sources generated from SQL queries in questions
|
|
- Wiki pages for dashboards (purpose, key metrics, relationships)
|
|
- Work units per dashboard and per question
|
|
|
|
### Warehouse mapping
|
|
|
|
Metabase databases must be mapped to **ktx** connections so ingested context links to the correct warehouse:
|
|
|
|
```yaml
|
|
mappings:
|
|
databaseMappings:
|
|
"<metabase_db_id>": "<ktx_connection_id>"
|
|
syncEnabled:
|
|
"<metabase_db_id>": true
|
|
syncMode: ONLY # ONLY = restrict to mapped DBs
|
|
```
|
|
|
|
Find Metabase database IDs in **Admin > Databases** - the ID is in the URL when editing a database.
|
|
|
|
---
|
|
|
|
## Looker
|
|
|
|
Ingests explores, looks, and dashboards from a Looker instance via the Looker API. Maps Looker connections to your **ktx** warehouse connections.
|
|
|
|
### What it provides
|
|
|
|
- Explore definitions and field metadata
|
|
- Dashboard and look configurations
|
|
- Query patterns and usage signals
|
|
- Looker folder structure for organization context
|
|
|
|
### Connection config
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-looker:
|
|
driver: looker
|
|
base_url: https://looker.company.com
|
|
client_id: your-looker-client-id
|
|
client_secret_ref: env:LOOKER_CLIENT_SECRET
|
|
mappings:
|
|
connectionMappings:
|
|
postgres_connection: postgres-main # Looker conn → ktx conn
|
|
```
|
|
|
|
### Authentication
|
|
|
|
| Method | Config |
|
|
|--------|--------|
|
|
| OAuth client credentials | `client_id` + `client_secret_ref: env:LOOKER_CLIENT_SECRET` |
|
|
|
|
Generate API credentials in Looker: **Admin > Users > Edit > API Keys**.
|
|
|
|
### What gets ingested
|
|
|
|
- Semantic sources from explore field definitions
|
|
- Wiki pages for dashboards (purpose, audience, key metrics)
|
|
- Triage signals for automated content classification
|
|
- Work units per explore and per dashboard
|
|
|
|
### Warehouse mapping
|
|
|
|
Map Looker connection names to **ktx** connections so explores link to the correct warehouse:
|
|
|
|
```yaml
|
|
mappings:
|
|
connectionMappings:
|
|
"<looker_connection_name>": "<ktx_connection_id>"
|
|
```
|
|
|
|
Find Looker connection names in **Admin > Database > Connections**.
|
|
|
|
---
|
|
|
|
## Notion
|
|
|
|
Ingests pages and databases from a Notion workspace as wiki pages. Useful for capturing business definitions, data dictionaries, and team documentation that agents need for context.
|
|
|
|
### What it provides
|
|
|
|
- Wiki pages synthesized from Notion content
|
|
- Page hierarchy and relationships
|
|
- Database schemas (when Notion databases describe primary sources)
|
|
- Semantic clustering for organized ingestion
|
|
|
|
### Connection config
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-notion:
|
|
driver: notion
|
|
auth_token_ref: env:NOTION_TOKEN
|
|
crawl_mode: selected_roots
|
|
root_page_ids:
|
|
- "abc123def456..."
|
|
```
|
|
|
|
For crawling all accessible pages:
|
|
|
|
```yaml title="ktx.yaml"
|
|
connections:
|
|
my-notion:
|
|
driver: notion
|
|
auth_token_ref: env:NOTION_TOKEN
|
|
crawl_mode: all_accessible
|
|
```
|
|
|
|
### Authentication
|
|
|
|
| Method | Config |
|
|
|--------|--------|
|
|
| Internal integration token | `auth_token_ref: env:NOTION_TOKEN` |
|
|
|
|
Create an integration at [notion.so/my-integrations](https://www.notion.so/my-integrations), then share target pages with the integration.
|
|
|
|
### Configuration options
|
|
|
|
| Field | Description | Default |
|
|
|-------|-------------|---------|
|
|
| `crawl_mode` | `all_accessible` or `selected_roots` | - |
|
|
| `root_page_ids` | Page IDs to crawl from (for `selected_roots`) | `[]` |
|
|
| `root_database_ids` | Database IDs to include | `[]` |
|
|
| `max_pages_per_run` | Pages processed per sync | `1000` |
|
|
| `max_knowledge_creates_per_run` | New pages created per sync | `25` |
|
|
| `max_knowledge_updates_per_run` | Pages updated per sync | `20` |
|
|
|
|
### What gets ingested
|
|
|
|
- Wiki pages synthesized from Notion content (not raw copies)
|
|
- Domain context extracted and organized by topic
|
|
- Triage signals for classifying page relevance
|
|
- Work units clustered by semantic similarity for efficient processing
|
|
|
|
### Notes
|
|
|
|
- Notion is knowledge-only - it does not produce semantic layer sources
|
|
- Rate limits apply; large workspaces may require multiple ingestion runs
|
|
- Incremental sync cursors are stored in `.ktx/db.sqlite`; don't add
|
|
`last_successful_cursor` to `ktx.yaml`
|
|
|
|
## Common errors
|
|
|
|
| Error or symptom | Likely cause | Recovery |
|
|
|------------------|--------------|----------|
|
|
| Connector cannot read source files | `source_dir`, `repo_url`, `repoUrl`, `metricflow.repoUrl`, `branch`, or `path` is wrong | Verify the path locally or clone the repo manually with the same credentials |
|
|
| Private repo/API authentication fails | Token env var or secret file is missing | Export the env var or update `auth_token_ref` to a readable file |
|
|
| Ingest creates duplicate context | Existing source names or wiki pages do not match imported terminology | Review the diff, rename duplicates, and add wiki pages with canonical names |
|
|
| Notion ingest skips pages | Integration lacks access or root ids are missing | Share pages with the Notion integration and set `root_page_ids` or use `all_accessible` carefully |
|
|
| Generated semantic sources fail validation | Tool metadata does not match the live warehouse schema | Map BI/source databases to primary warehouse connections and rerun validation |
|