Merge origin/main into merge-scan-into-ingest-v1

This commit is contained in:
Andrey Avtomonov 2026-05-14 01:40:11 +02:00
commit e501d1d81c
28 changed files with 432 additions and 71 deletions

View file

@ -19,13 +19,19 @@ Agents must configure and ingest context sources in this order:
5. Review generated `semantic-layer/` YAML and `wiki/` Markdown files in git.
6. Validate changed semantic sources with `ktx sl validate`.
## Shared source fields
## Common source fields
Git repository fields are source-specific. dbt uses top-level `repo_url`,
LookML uses top-level `repoUrl`, and MetricFlow uses nested
`metricflow.repoUrl`.
| Field | Required | Description |
|-------|----------|-------------|
| `driver` | Yes | Source adapter: `dbt`, `metricflow`, `lookml`, `metabase`, `looker`, or `notion` |
| `source_dir` | For local file sources | Absolute or project-relative source directory |
| `repo_url` | For Git-hosted sources | Git repository URL |
| `repo_url` | For Git-hosted dbt sources | Git repository URL |
| `repoUrl` | For Git-hosted LookML sources | Git repository URL |
| `metricflow.repoUrl` | For Git-hosted MetricFlow sources | Git repository URL |
| `branch` | No | Git branch to read |
| `path` | No | Subdirectory inside a monorepo |
| `auth_token_ref` | For private APIs/repos | `env:NAME` or `file:/path/to/secret` token reference |
@ -351,7 +357,7 @@ Create an integration at [notion.so/my-integrations](https://www.notion.so/my-in
| `root_page_ids` | Page IDs to crawl from (for `selected_roots`) | `[]` |
| `root_database_ids` | Database IDs to include | `[]` |
| `max_pages_per_run` | Pages processed per sync | `1000` |
| `max_knowledge_creates_per_run` | New pages created per sync | `5` |
| `max_knowledge_creates_per_run` | New pages created per sync | `25` |
| `max_knowledge_updates_per_run` | Pages updated per sync | `20` |
### What gets ingested
@ -365,13 +371,14 @@ Create an integration at [notion.so/my-integrations](https://www.notion.so/my-in
- Notion is knowledge-only — it does not produce semantic layer sources
- Rate limits apply; large workspaces may require multiple ingestion runs
- `last_successful_cursor` is auto-managed for incremental sync
- Incremental sync cursors are stored in `.ktx/db.sqlite`; don't add
`last_successful_cursor` to `ktx.yaml`
## Common errors
| Error or symptom | Likely cause | Recovery |
|------------------|--------------|----------|
| Adapter cannot read source files | `source_dir`, `repo_url`, `branch`, or `path` is wrong | Verify the path locally or clone the repo manually with the same credentials |
| Adapter cannot read source files | `source_dir`, `repo_url`, `repoUrl`, `metricflow.repoUrl`, `branch`, or `path` is wrong | Verify the path locally or clone the repo manually with the same credentials |
| Private repo/API authentication fails | Token env var or secret file is missing | Export the env var or update `auth_token_ref` to a readable file |
| Ingest creates duplicate context | Existing source names or wiki pages do not match imported terminology | Review the diff, rename duplicates, and add wiki pages with canonical names |
| Notion ingest skips pages | Integration lacks access or root ids are missing | Share pages with the Notion integration and set `root_page_ids` or use `all_accessible` carefully |

View file

@ -27,6 +27,9 @@ Agents should prefer environment or file references over literal secrets.
| `schema` or `schemas` | No | schema-aware warehouses | Single schema or list of schemas to scan |
| `context.queryHistory` | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
| `path` | Yes for path-style SQLite | SQLite | Local SQLite database path or `env:NAME` reference |
| `max_bytes_billed` | No | BigQuery | Maximum bytes billed per query job |
| `job_timeout_ms` | No | BigQuery | BigQuery query job timeout in milliseconds |
| `project_id` | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
## PostgreSQL
@ -216,6 +219,9 @@ For multiple datasets:
| Environment variable | `credentials_json: env:BIGQUERY_CREDENTIALS_JSON` |
The project ID is extracted automatically from the service account JSON file.
If you set `project_id` in `ktx.yaml`, KTX treats it as local descriptor and
mapping metadata. The BigQuery connector still authenticates with the
`project_id` inside `credentials_json`.
### Features
@ -254,7 +260,7 @@ staged artifact shape as Postgres and Snowflake.
- Parameter binding uses named `@param` syntax
- Arrays flattened to comma-separated strings in results
- Location specified at query execution time
- Supports `maxBytesBilled` and `jobTimeoutMs` limits
- Supports `max_bytes_billed` and `job_timeout_ms` limits from `ktx.yaml`
---