apunkt/ktx

mirror of https://github.com/Kaelio/ktx.git synced 2026-06-16 08:25:14 +02:00

Andrey Avtomonov b759a4a286

* docs(specs): design research-agent MCP tools and ktx mcp daemon

Adds the 2026-05-14 design spec for exposing four new MCP tools
(discover_data, entity_details, dictionary_search, sql_execution),
shipping a ktx-research skill, and introducing an HTTP-only ktx mcp
daemon so external agents can use KTX as a research-capable context
layer.

* Refine research-agent MCP tools spec after adversarial review iteration 1

* Refine research-agent MCP tools spec after adversarial review iteration 2

* Refine research-agent MCP tools spec after adversarial review iteration 3

* Refine spec: drop connectionName compat carve-out and ground summary/snippet provenance per kind

* feat(daemon): validate read-only SQL with sqlglot

* feat(context): expose read-only SQL validation port

* feat(context): register MCP sql execution tool

* feat(context): execute MCP SQL through validated connector path

* test(context): update SQL analysis port fixtures

* docs: add research-agent MCP sql execution foundation plan

* feat(context): add scan-backed entity details service

* feat(context): register MCP entity details tool

* feat(context): expose local MCP entity details

* test(context): align entity details scan fixtures

* docs: add research-agent MCP entity_details plan

* feat(context): add dictionary search service

* feat(context): register MCP dictionary search tool

* feat(context): expose local MCP dictionary search

* docs: add research-agent MCP dictionary_search plan

* feat: add MCP discover data service

* feat: expose discover data MCP tool

* feat: wire local discover data MCP port

* docs: add research-agent MCP discover_data plan

* feat(cli): add mcp http security helpers

* feat(cli): host mcp over streamable http

* feat(cli): manage mcp daemon lifecycle

* feat(cli): add ktx mcp commands

* fix(cli): stabilize mcp daemon verification

* docs: add research-agent MCP http daemon plan

* feat(cli): install KTX research skill

* feat(cli): configure MCP clients in setup agents

* feat(cli): support Claude local MCP setup scope

* docs: add research-agent MCP setup-agents plan

* refactor(context): use connectionId in warehouse verification tools

* docs(context): update ingest verification prompts for connectionId

* docs: add research-agent MCP ingest contract convergence plan

* chore: build runtime artifacts in conductor setup

---------

Co-authored-by: Andrey Avtomonov <7889985+andreybavt@users.noreply.github.com>

2026-05-15 02:35:09 +02:00

6.6 KiB

Raw Permalink Blame History

name

description

callers

dbt_ingest

Map dbt `schema.yml` / `properties.yml` models and sources into KTX semantic-layer overlays and column notes. Covers `sources:` vs `models:`, column `data_tests` (not_null, unique, accepted_values, relationships), and how bundle-time writes complement manifest backfill from git sync. Load when the WorkUnit's `skillNames` includes `dbt_ingest` or when raw files are dbt YAML under `models/` / `sources/`.

memory_agent

dbt → KTX (bundle ingest)

Use this skill for uploaded dbt projects (dbt_project.yml at stage root, models/**, sources/**, schema.yml). There is no fetch() in v1 - scheduled dbt parse / manifest.json pulls are out of scope; host-provided dbt sync may still backfill structured test metadata into _schema on the next sync.

Mapping (models / sources → SL)

dbt	KTX	Notes
`models:` entry with `columns:`	Overlay on the manifest table with the same name (after `discover_data` / `entity_details`)	One SL source per physical table; model name may differ from DB name - resolve with `read_raw_file` + warehouse context.
`sources:` → `tables:`	Same as models; use `identifier` when present instead of logical `name`.	Schema + name must match how the connection sees tables.
Column `description`	`column_overrides[].descriptions.user` on the overlay	Do not overwrite `dbt` description keys from sync.
`data_tests: not_null` / `unique`	Short hint in column `descriptions` or notes: “dbt: not null”, “dbt: unique”	Full structured metadata lands in manifest via sync; the skill keeps bundle-time SL text useful for the agent.
`accepted_values`	Add a brief line in the column description: allowed values (truncate long lists)	Also mention enum-like use in `discover_data` / filters.
`relationships`	Add or confirm `joins:` on the overlay only when `to` resolves to a real table via `read_raw_file` + `discover_data` / `entity_details`	If the ref cannot be resolved, capture the intent in a wiki page instead.

Physical schema grounding

dbt YAML is documentation and test metadata; it is not permission to invent physical columns. Before writing any table-backed SL source, confirm the real warehouse shape with discover_data, sl_discover, or entity_details and use only confirmed column names in column_overrides:, computed-only columns:, grain:, joins:, segments:, and measures[].expr.

For dbt context-source ingest, the dbt connection is usually not the warehouse connection. Call sl_discover without connectionId first, then write overlays to the connection that owns the matching manifest-backed source (for example postgres-warehouse), not to the dbt connection (for example dbt-main). If no matching manifest-backed source is visible on any warehouse connection, do not call sl_write_source; record emit_unmapped_fallback and keep the fact wiki-only.

If a models: entry has no columns: block, or the available raw files do not confirm the physical column names, do not synthesize a full standalone source. Write a wiki note or a description-only overlay for the resolved manifest table instead. If a business metric is described but its referenced column is not confirmed in the warehouse schema, omit the measure and capture the unresolved intent in the wiki.

Include rawPaths on every wiki_write, sl_write_source, and sl_edit_source call with only the dbt YAML files that directly support the action.

After every sl_write_source, call sl_validate. A validation error saying a declared column or measure reference is absent from the physical table is a hard stop: re-read the warehouse-backed source and rewrite with confirmed names, or remove the invalid SL fields.

Identifier Verification Protocol

Before writing a wiki page or SL source on any topic:

discover_data({query: "<topic>"}) - see what wikis, SL sources, and raw tables already exist. Prefer updating existing pages over creating new ones.

Before emitting any schema.table or schema.table.column into a wiki body, SL source, tables: frontmatter, sl_refs, or emit_unmapped_fallback:

entity_details({connectionId, targets: [{display: "<identifier>"}]}) - confirm the identifier resolves; inspect native types, FK/PK, and sampleValues.
For literal values from the source, such as status codes or plan tiers, check whether they appear in entity_details sampleValues for the relevant column. If sampleValues is short or the sample may have missed real values, run a sql_execution probe with the same warehouse connection id: sql_execution({connectionId, sql: "SELECT DISTINCT <col> FROM <ref> LIMIT 50"}).
If the candidate identifier still does not resolve, do one of:
- Use sql_execution({connectionId, sql: "SELECT 1 FROM <ref> LIMIT 0"}). If it errors, the identifier is fictional.
- Wrap the identifier in [unverified - from <rawPath>] in the wiki body, citing the exact raw path that mentioned it.
- When recording emit_unmapped_fallback with no_physical_table, include the failing probe error in clarification.
Never copy <schema>.<table> placeholder strings from these instructions into output.

1.1 test hints (descriptions / meta)

When YAML shows accepted_values or not_null, add short hints into column_overrides[].descriptions (for example under user) or freeform column notes so chat and validation see intent before the next git sync refreshes constraints / enum_values in _schema. Keep hints under a few words when possible.

Overlap with MetricFlow

If the same bundle also has MetricFlow semantic_models: / metrics:, the metricflow_ingest skill owns semantic/metric shapes. This skill focuses on raw dbt schema YAML (models, sources, tests). If both apply, load metricflow_ingest first when the file is clearly MetricFlow; otherwise use dbt_ingest for schema.yml without semantic_models.

Do not

Do not run dbt CLI or assume target/ / manifest.json exists in the upload.
Do not invent column names, grain keys, or measure expressions from dbt model names, descriptions, tests, or common naming patterns.
Do not write computed columns:, column_overrides:, grain:, or measures: for a dbt model unless those exact column names are confirmed by dbt YAML columns or warehouse schema discovery.
Do not invent joins from relationships tests if the target model/table is not found in SL or the warehouse.
Do not read peerFileIndex paths - use read_raw_file only on rawFiles and dependencyPaths from the WorkUnit.

6.6 KiB Raw Permalink Blame History