feat(context): add warehouse verification tools (#46)

* feat(context): add warehouse dialect dispatch

* feat(context): read warehouse scan catalog

* feat(context): add entity details verification tool

* feat(context): add ingest SQL verification tool

* feat(context): add raw warehouse discovery tool

* feat(context): expose warehouse verification tools to ingest

* docs(context): add ingest identifier verification protocol

* test(context): guard ingest identifier verification prompts

* chore(context): verify warehouse verification tools

* docs: add warehouse verification tools plan and spec

* fix(context): expose target warehouses to Notion ingest

* fix(context): update ingest prompts for warehouse verification tools

* fix(context): scope raw schema discovery to allowed connections

* fix(context): verify warehouse column display targets

* docs: add notion warehouse verification gap closure plan

* fix(context): include raw discovery connection names

* fix(context): expose warehouse targets for LookML and MetricFlow

* fix(context): pass connection config to ingest query executors

* fix(cli): enable read-only SQL probes for local ingest

* docs: add warehouse verification final v1 closure plan

* fix(context): align warehouse sql probe prompt shape

* docs: add warehouse verification prompt shape closure plan

* test(context): catch connectionless sql execution prompt examples

* fix(context): include connection name in sl capture sql example

* docs: add warehouse verification sql example closure plan

* fix(context): report structured entity detail misses

* docs: add warehouse verification structured target miss closure plan

* fix: report untracked squash merge conflicts

* feat: require ingest verification ledger

* fix: stabilize ingest wiki references
This commit is contained in:
Andrey Avtomonov 2026-05-13 13:43:23 +02:00 committed by GitHub
parent bcb0d2f8f7
commit c22248dabf
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
89 changed files with 7818 additions and 191 deletions

View file

@ -0,0 +1,785 @@
# Notion Warehouse Verification Gap Closure Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Close the remaining v1 gaps that prevent ingest agents, especially
Notion WorkUnits, from reliably verifying warehouse table and column
identifiers before writing wiki or semantic-layer output.
**Architecture:** Keep the existing warehouse verification tool module and
runner wiring. Add Notion target-warehouse scoping through the local adapter
factory, make the active WorkUnit prompt name the shipped tools, enforce
`allowedConnectionNames` in `discover_data`, and teach `entity_details` to
resolve and reject column-level display targets.
**Tech Stack:** TypeScript, Node 22, Vitest, AI SDK v6 tools, Zod, KTX local
ingest adapters, KTX file store.
---
## Audit summary
The previous implementation plan landed the main tool module and prompt
protocol, but four v1-blocking gaps remain:
- Notion ingest sessions still allow only the Notion connection unless a
specific adapter supplies target IDs. `NotionSourceAdapter` does not supply
target warehouse IDs, so the original Notion hallucination case cannot use
`entity_details` or raw-schema `discover_data` for the warehouse connection.
- The active WorkUnit framing prompt still tells agents to call
`wiki_sl_search` and `sl_describe_table`, which are not shipped KTX tools.
- `discover_data` accepts an explicit out-of-scope `connectionName` and still
searches raw schema for that connection.
- `entity_details({ targets: [{ display: "schema.table.column" }] })` does not
resolve column display strings and does not fail explicit missing-column
targets.
Non-blocking gaps remain out of scope for this plan:
- Full DDL-style `entity_details` formatting with FK and profile summaries.
- AST-backed SQL read-only validation for data-modifying CTEs.
- Search over `enrichment/descriptions.json` for generated descriptions.
- Lexicographic latest-sync edge cases for non-timestamp sync IDs.
- Hard write-time validation in `wiki_write` and `emit_unmapped_fallback`.
## File structure
Modify these files:
- `packages/context/src/ingest/adapters/notion/notion.adapter.ts`: add
configured target warehouse IDs and implement `listTargetConnectionIds()`.
- `packages/context/src/ingest/adapters/notion/notion.adapter.test.ts`: cover
Notion target connection ID fan-out.
- `packages/context/src/ingest/local-adapters.ts`: pass primary warehouse IDs
into `NotionSourceAdapter`.
- `packages/context/src/ingest/local-adapters.test.ts`: cover local Notion
adapter target IDs.
- `packages/context/src/ingest/adapters/notion/chunk.ts`: update Notion
WorkUnit notes to prefer the warehouse verification tools.
- `packages/context/src/ingest/adapters/notion/notion.adapter.test.ts`: update
Notion note expectations.
- `packages/context/prompts/memory_agent_bundle_ingest_work_unit.md`: replace
stale tool names in the active WorkUnit prompt.
- `packages/context/src/ingest/ingest-prompts.test.ts`: guard the WorkUnit
prompt against stale tool names.
- `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts`:
refuse explicit out-of-scope connection names.
- `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts`:
cover `discover_data` scoping.
- `packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts`:
add column-aware display-target resolution.
- `packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts`:
cover column display resolution.
- `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts`:
use column-aware resolution and report missing columns.
- `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts`:
cover column display and missing-column behavior.
### Task 1: Give Notion ingest access to target warehouses
**Files:**
- Modify: `packages/context/src/ingest/adapters/notion/notion.adapter.ts`
- Modify: `packages/context/src/ingest/adapters/notion/notion.adapter.test.ts`
- Modify: `packages/context/src/ingest/local-adapters.ts`
- Modify: `packages/context/src/ingest/local-adapters.test.ts`
- [ ] **Step 1: Write the failing Notion adapter test**
Add this test inside `describe('NotionSourceAdapter', ...)` in
`packages/context/src/ingest/adapters/notion/notion.adapter.test.ts`:
```ts
it('returns configured target warehouse connection ids', async () => {
const adapter = new NotionSourceAdapter({
targetConnectionIds: ['warehouse', 'warehouse', 'analytics'],
});
await expect(adapter.listTargetConnectionIds?.(stagedDir)).resolves.toEqual([
'analytics',
'warehouse',
]);
});
```
- [ ] **Step 2: Run the failing Notion adapter test**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/adapters/notion/notion.adapter.test.ts -t "target warehouse connection ids"
```
Expected: FAIL because `NotionSourceAdapterDeps` has no
`targetConnectionIds` option and `NotionSourceAdapter` does not implement
`listTargetConnectionIds()`.
- [ ] **Step 3: Implement Notion target connection IDs**
Modify `packages/context/src/ingest/adapters/notion/notion.adapter.ts`:
```ts
export interface NotionSourceAdapterDeps {
onPullSucceeded?: (ctx: NotionPullSucceededContext) => Promise<void>;
logger?: NotionFetchLogger;
targetConnectionIds?: string[];
}
function uniqueSorted(values: readonly string[] | undefined): string[] {
return [...new Set(values ?? [])].sort((left, right) =>
left.localeCompare(right),
);
}
```
Add this method to `NotionSourceAdapter`:
```ts
async listTargetConnectionIds(_stagedDir: string): Promise<string[]> {
return uniqueSorted(this.deps.targetConnectionIds);
}
```
- [ ] **Step 4: Pass primary warehouses into the local Notion adapter**
Modify the Notion adapter construction in
`packages/context/src/ingest/local-adapters.ts`:
```ts
new NotionSourceAdapter({
targetConnectionIds: primaryWarehouseConnectionIds(project),
...(options.logger ? { logger: options.logger } : {}),
}),
```
- [ ] **Step 5: Write the local adapter fan-out test**
Add this test to `packages/context/src/ingest/local-adapters.test.ts`:
```ts
it('passes primary warehouse connection ids to the local Notion adapter', async () => {
const adapters = createDefaultLocalIngestAdapters(
projectWithConnections({
notion: {
driver: 'notion',
auth_token: 'secret',
crawl_mode: 'selected_roots',
root_page_ids: ['page-1'],
},
warehouse: {
driver: 'postgres',
url: 'postgresql://readonly@db.example.test/analytics',
},
docs: {
driver: 'dbt',
source_dir: './dbt',
},
} as never),
);
const notion = adapters.find((adapter) => adapter.source === 'notion');
await expect(notion?.listTargetConnectionIds?.('/tmp/staged-notion')).resolves.toEqual([
'warehouse',
]);
});
```
- [ ] **Step 6: Run the Notion target tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run \
src/ingest/adapters/notion/notion.adapter.test.ts -t "target warehouse connection ids" \
src/ingest/local-adapters.test.ts -t "local Notion adapter"
```
Expected: PASS.
- [ ] **Step 7: Commit**
Run:
```bash
git add \
packages/context/src/ingest/adapters/notion/notion.adapter.ts \
packages/context/src/ingest/adapters/notion/notion.adapter.test.ts \
packages/context/src/ingest/local-adapters.ts \
packages/context/src/ingest/local-adapters.test.ts
git commit -m "fix(context): expose target warehouses to Notion ingest"
```
### Task 2: Remove stale tool names from active ingest prompts
**Files:**
- Modify: `packages/context/prompts/memory_agent_bundle_ingest_work_unit.md`
- Modify: `packages/context/src/ingest/ingest-prompts.test.ts`
- Modify: `packages/context/src/ingest/adapters/notion/chunk.ts`
- Modify: `packages/context/src/ingest/adapters/notion/notion.adapter.test.ts`
- [ ] **Step 1: Add failing prompt guards**
Add this test to `packages/context/src/ingest/ingest-prompts.test.ts`:
```ts
it('uses shipped warehouse verification tools in the WorkUnit prompt', async () => {
const prompt = await readFile(
new URL('../../prompts/memory_agent_bundle_ingest_work_unit.md', import.meta.url),
'utf-8',
);
expect(prompt).toContain('discover_data');
expect(prompt).toContain('entity_details');
expect(prompt).not.toContain('wiki_sl_search');
expect(prompt).not.toContain('sl_describe_table');
});
```
- [ ] **Step 2: Run the failing prompt guard**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/ingest-prompts.test.ts -t "warehouse verification tools"
```
Expected: FAIL because the WorkUnit prompt still contains `wiki_sl_search` and
`sl_describe_table`.
- [ ] **Step 3: Update the WorkUnit framing prompt**
In `packages/context/prompts/memory_agent_bundle_ingest_work_unit.md`, replace
the first `<role>` paragraph with:
```md
You are processing ONE WorkUnit of a multi-file ingest bundle. The WorkUnit gives you a slice of raw source files (LookML views, dbt/MetricFlow YAMLs, Metabase card JSONs, Notion pages, or similar) and you must translate that slice into KTX semantic-layer sources and/or knowledge wiki pages, in one pass. Prior WorkUnits in this same job may have already written SL sources and wiki pages; their writes are visible on the working branch and discoverable with `discover_data`.
```
In workflow step 2, replace the final sentence with:
```md
The triage skill tells you how to react when `discover_data` reveals that a prior WU already wrote something overlapping.
```
In workflow step 4, replace the sentence that starts
`For each raw file:` with:
```md
4. For each raw file: call `read_raw_file` (or `read_raw_span` for slicing large files) to load content. Before writing a new SL source or wiki page, call `discover_data` for each candidate source, table, metric, or topic name to find prior-WU writes, existing wiki pages, SL sources, and raw warehouse matches; apply `ingest_triage` when you hit one, and apply any matching canonical pin before deciding whether to edit, rename, or skip.
```
In the `<do_not>` block, replace the physical-column rule with:
```md
- Do not invent physical column names or grain keys. For table-backed SL sources, every `columns:`, `grain:`, `joins:`, `segments:`, and `measures[].expr` column must come from raw-file column declarations or warehouse-backed discovery (`discover_data`, `sl_discover`, `entity_details`). If column names are not confirmed, capture the business context in wiki instead of writing a full SL source.
```
- [ ] **Step 4: Update Notion WorkUnit notes**
In `packages/context/src/ingest/adapters/notion/chunk.ts`, replace
`NOTION_SL_WRITE_GUIDANCE` with:
```ts
const NOTION_SL_WRITE_GUIDANCE =
'Write wiki entries with wiki_write. Wiki keys must be flat slugs like orbit-company-overview, not orbit/company-overview. Search existing wiki pages, SL sources, and raw warehouse schema for the same tables or sl_refs with discover_data before creating a new page. Only write or edit SL sources after discover_data plus sl_discover/sl_read_source or entity_details confirms a mapped non-Notion target source; if no mapped target exists, emit_unmapped_fallback and keep the fact wiki-only. Notion dataSourceCount counts Notion databases/data sources only, not warehouse/dbt mappings. If a warehouse/dbt connection exists but the named table or source is absent, use reason no_physical_table rather than no_connection_mapping. Do not create SL sources under the Notion connection just because a page mentions a warehouse table.';
```
In the `reconcileNotes` array in the same file, replace:
```ts
'Notion dataSourceCount is Notion-only; use sl_discover for warehouse/dbt mapping decisions.',
```
with:
```ts
'Notion dataSourceCount is Notion-only; use discover_data/entity_details for warehouse/dbt mapping decisions.',
```
- [ ] **Step 5: Update Notion note expectations**
In `packages/context/src/ingest/adapters/notion/notion.adapter.test.ts`,
update the note expectations in `it('chunks changed Notion pages...')`:
```ts
expect(result.workUnits[0].notes).toContain('discover_data');
expect(result.workUnits[0].notes).toContain('entity_details');
```
Update the exact `reconcileNotes` expectation to:
```ts
expect(result.reconcileNotes).toEqual([
'Notion maxKnowledgeCreatesPerRun=25',
'Notion maxKnowledgeUpdatesPerRun=20',
'Notion dataSourceCount is Notion-only; use discover_data/entity_details for warehouse/dbt mapping decisions.',
'Reconcile Notion wiki pages sharing tables/sl_refs before creating distinct artifacts.',
]);
```
- [ ] **Step 6: Run prompt and Notion note tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run \
src/ingest/ingest-prompts.test.ts \
src/ingest/adapters/notion/notion.adapter.test.ts
```
Expected: PASS.
- [ ] **Step 7: Commit**
Run:
```bash
git add \
packages/context/prompts/memory_agent_bundle_ingest_work_unit.md \
packages/context/src/ingest/ingest-prompts.test.ts \
packages/context/src/ingest/adapters/notion/chunk.ts \
packages/context/src/ingest/adapters/notion/notion.adapter.test.ts
git commit -m "fix(context): update ingest prompts for warehouse verification tools"
```
### Task 3: Enforce allowed connection scope in discover_data
**Files:**
- Modify: `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts`
- Modify: `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts`
- [ ] **Step 1: Write the failing scoping test**
Add this test to
`packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts`:
```ts
it('refuses explicit out-of-scope connection names', async () => {
const result = await tool.call({ query: 'orders', connectionName: 'billing' }, context);
expect(result.markdown).toContain('Connection "billing" is not available to this ingest stage.');
expect(result.structured).toEqual({ wiki: null, sl: null, raw: null });
expect(wikiSearchTool.call).not.toHaveBeenCalled();
expect(slDiscoverTool.call).not.toHaveBeenCalled();
expect(catalog.searchByName).not.toHaveBeenCalled();
});
```
- [ ] **Step 2: Run the failing scoping test**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/discover-data.tool.test.ts -t "out-of-scope"
```
Expected: FAIL because `discover_data` currently searches raw schema for an
explicit `connectionName` even when it is not in `allowedConnectionNames`.
- [ ] **Step 3: Add the scope guard**
In
`packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts`,
add this helper near `totalSources()`:
```ts
function allowedConnectionNames(context: ToolContext): ReadonlySet<string> | null {
return context.session?.allowedConnectionNames ?? null;
}
```
At the top of `DiscoverDataTool.call()`, before the `sourceName` branch and
before calling any child tool, add:
```ts
const allowed = allowedConnectionNames(context);
if (input.connectionName && allowed && !allowed.has(input.connectionName)) {
return {
markdown: `Connection "${input.connectionName}" is not available to this ingest stage.`,
structured: { wiki: null, sl: null, raw: null },
};
}
```
Then replace the raw connection-list construction with:
```ts
const connections = input.connectionName ? [input.connectionName] : [...(allowed ?? [])].sort();
```
- [ ] **Step 4: Run discover_data tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/discover-data.tool.test.ts
```
Expected: PASS.
- [ ] **Step 5: Commit**
Run:
```bash
git add \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts
git commit -m "fix(context): scope raw schema discovery to allowed connections"
```
### Task 4: Fix column-level entity_details verification
**Files:**
- Modify: `packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts`
- Modify: `packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts`
- Modify: `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts`
- Modify: `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts`
- [ ] **Step 1: Write failing catalog column-target tests**
First update `seedLiveDatabaseScan()` in that test file so BigQuery tables have
a project/catalog. Replace the repeated inline table refs with:
```ts
const tableRef = {
catalog: driver === 'bigquery' ? 'analytics' : null,
db: driver === 'sqlite' ? null : 'public',
name: 'orders',
};
```
Use `tableRef.catalog`, `tableRef.db`, and `tableRef.name` for the seeded
table and profile table references.
Then add these tests to
`packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts`:
```ts
it('resolves postgres column display strings without treating the column as a table', async () => {
await seedLiveDatabaseScan();
const catalog = new WarehouseCatalogService({ fileStore: project.fileStore });
await expect(catalog.resolveDisplayTarget('warehouse', 'public.orders.status')).resolves.toMatchObject({
resolved: { catalog: null, db: 'public', name: 'orders', column: 'status' },
candidates: [],
dialect: 'postgres',
});
});
it('resolves BigQuery column display strings with four parts', async () => {
await seedLiveDatabaseScan('warehouse', 'sync-bigquery', 'bigquery');
const catalog = new WarehouseCatalogService({ fileStore: project.fileStore });
await expect(catalog.resolveDisplayTarget('warehouse', 'analytics.public.orders.status')).resolves.toMatchObject({
resolved: { catalog: 'analytics', db: 'public', name: 'orders', column: 'status' },
candidates: [],
dialect: 'bigquery',
});
});
```
- [ ] **Step 2: Run the failing catalog tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts -t "column display"
```
Expected: FAIL because `resolveDisplayTarget()` does not exist.
- [ ] **Step 3: Implement column-aware display resolution**
In
`packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts`,
add this exported interface near `RawSchemaHit`:
```ts
export interface DisplayTargetResolution {
resolved: (KtxTableRef & { column?: string }) | null;
candidates: KtxTableRef[];
dialect: string;
}
```
Add these helpers near `parseDisplay()`:
```ts
function expectedDisplayPartCount(driver: CatalogDriver): number {
if (driver === 'sqlite' || driver === 'sqlite3') {
return 1;
}
if (driver === 'bigquery' || driver === 'snowflake' || driver === 'sqlserver') {
return 3;
}
return 2;
}
function parseColumnDisplay(driver: CatalogDriver, display: string): (KtxTableRef & { column: string }) | null {
const parts = splitDisplay(display);
const tablePartCount = expectedDisplayPartCount(driver);
if (parts.length !== tablePartCount + 1) {
return null;
}
const column = parts.at(-1);
if (!column) {
return null;
}
const table = parseDisplay(driver, parts.slice(0, -1).join('.'));
return table ? { ...table, column } : null;
}
```
Add this method to `WarehouseCatalogService` after `resolveDisplay()`:
```ts
async resolveDisplayTarget(connectionName: string, display: string): Promise<DisplayTargetResolution> {
const catalog = await this.loadCatalog(connectionName);
if (!catalog) {
return { resolved: null, candidates: [], dialect: 'unknown' };
}
const dialect = getDialectForDriver(catalog.driver).type;
const tableResolution = await this.resolveDisplay(connectionName, display);
if (tableResolution.resolved) {
return tableResolution;
}
const parsedColumn = parseColumnDisplay(catalog.driver, display);
if (!parsedColumn) {
return { resolved: null, candidates: bestCandidates(catalog.tables, display), dialect };
}
const table = catalog.tables.find((candidate) => refsEqual(candidate, parsedColumn));
if (!table) {
return { resolved: null, candidates: bestCandidates(catalog.tables, display), dialect };
}
return {
resolved: {
catalog: table.catalog,
db: table.db,
name: table.name,
column: parsedColumn.column,
},
candidates: [],
dialect,
};
}
```
- [ ] **Step 4: Write failing entity_details column tests**
Add these tests to
`packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts`:
```ts
it('resolves display targets that include a column name', async () => {
const result = await tool.call(
{ connectionName: 'warehouse', targets: [{ display: 'public.orders.status' }] },
context,
);
expect(result.markdown).toContain('### public.orders');
expect(result.markdown).toContain('- status (text, nullable=false)');
expect(result.markdown).not.toContain('- id (integer');
expect(result.structured.resolved).toHaveLength(1);
expect(result.structured.resolved[0]?.columns.map((column) => column.name)).toEqual(['status']);
});
it('reports missing explicit columns instead of returning an empty column list', async () => {
const result = await tool.call(
{ connectionName: 'warehouse', targets: [{ display: 'public.orders.plan_tier' }] },
context,
);
expect(result.markdown).toContain('Column not found in scan: public.orders.plan_tier');
expect(result.markdown).toContain('Available columns: id, status');
expect(result.structured.resolved).toHaveLength(0);
expect(result.structured.missing).toHaveLength(1);
});
```
- [ ] **Step 5: Run the failing entity_details tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/entity-details.tool.test.ts -t "column"
```
Expected: FAIL because display column targets are treated as table names and
missing columns are not reported.
- [ ] **Step 6: Use column-aware resolution in entity_details**
In
`packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts`,
add this helper near `appendTableMarkdown()`:
```ts
function findColumn(detail: TableDetail, columnName: string): TableDetail['columns'][number] | null {
const normalized = columnName.toLowerCase();
return detail.columns.find((column) => column.name.toLowerCase() === normalized) ?? null;
}
```
Replace the display resolution block inside the `for (const target of
input.targets)` loop with:
```ts
const resolution =
'display' in target
? await catalog.resolveDisplayTarget(input.connectionName, target.display)
: {
resolved: { catalog: target.catalog, db: target.db, name: target.name, column: target.column },
candidates: [],
dialect: '',
};
```
After `const detail = await catalog.getTable(...)`, replace the existing
`resolved.push(detail); appendTableMarkdown(...)` lines with:
```ts
const requestedColumn = resolution.resolved.column;
if (requestedColumn) {
const column = findColumn(detail, requestedColumn);
if (!column) {
missing.push({
target,
candidates: [{ catalog: detail.catalog, db: detail.db, name: detail.name }],
});
parts.push(`Column not found in scan: ${detail.display}.${requestedColumn}`);
parts.push(`Available columns: ${detail.columns.map((candidate) => candidate.name).join(', ')}`);
continue;
}
const scopedDetail = { ...detail, columns: [column] };
resolved.push(scopedDetail);
appendTableMarkdown(parts, scopedDetail, column.name);
continue;
}
resolved.push(detail);
appendTableMarkdown(parts, detail);
```
- [ ] **Step 7: Run warehouse verification tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run \
src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts \
src/ingest/tools/warehouse-verification/entity-details.tool.test.ts
```
Expected: PASS.
- [ ] **Step 8: Commit**
Run:
```bash
git add \
packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts \
packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts \
packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts \
packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts
git commit -m "fix(context): verify warehouse column display targets"
```
### Task 5: Verify the v1 gap closure
**Files:**
- Verify all files changed by Tasks 1-4.
- [ ] **Step 1: Run focused tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run \
src/ingest/adapters/notion/notion.adapter.test.ts \
src/ingest/local-adapters.test.ts \
src/ingest/ingest-prompts.test.ts \
src/ingest/tools/warehouse-verification/discover-data.tool.test.ts \
src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts \
src/ingest/tools/warehouse-verification/entity-details.tool.test.ts
```
Expected: PASS.
- [ ] **Step 2: Run package type-check**
Run:
```bash
pnpm --filter @ktx/context run type-check
```
Expected: PASS.
- [ ] **Step 3: Run package tests**
Run:
```bash
pnpm --filter @ktx/context run test
```
Expected: PASS.
- [ ] **Step 4: Run pre-commit on changed files when configured**
Run:
```bash
uv run pre-commit run --files \
packages/context/src/ingest/adapters/notion/notion.adapter.ts \
packages/context/src/ingest/adapters/notion/notion.adapter.test.ts \
packages/context/src/ingest/local-adapters.ts \
packages/context/src/ingest/local-adapters.test.ts \
packages/context/src/ingest/adapters/notion/chunk.ts \
packages/context/prompts/memory_agent_bundle_ingest_work_unit.md \
packages/context/src/ingest/ingest-prompts.test.ts \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts \
packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts \
packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts \
packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts \
packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts
```
Expected: PASS. If the repo has no pre-commit config or the local `uv` version
cannot satisfy the project pin, record the exact error and rely on focused
tests plus type-check.
- [ ] **Step 5: Inspect final git status**
Run:
```bash
git status --short
```
Expected: only intentional files are modified. Commit any formatter-driven
changes with:
```bash
git add packages/context
git commit -m "chore(context): verify warehouse verification v1 gaps"
```
## Self-review checklist
- Spec coverage: this plan closes the remaining v1 paths for Notion warehouse
verification, active WorkUnit prompt correctness, raw discovery scoping, and
column-level identifier verification.
- Placeholder scan: no task relies on future-work markers, unnamed edge-case
handling, or cross-task shorthand.
- Type consistency: `discover_data` continues to use `connectionName`,
`sl_discover` still receives `connectionId` internally, and
`resolveDisplayTarget()` returns the same table identity plus optional
`column`.

View file

@ -0,0 +1,957 @@
# Warehouse Verification Final V1 Closure Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Close the remaining v1 gaps that still prevent ingest agents from
reliably following warehouse verification results through to `entity_details`
and `sql_execution`.
**Architecture:** Keep the existing warehouse verification module and runner
session scoping. Add connection names to raw discovery hits, expose primary
warehouse targets from the remaining source adapters, and make local ingest
SQL probes use the same scan connector read-only execution path as schema scan.
**Tech Stack:** TypeScript, Node 22, Vitest, AI SDK v6 tools, Zod, KTX local
ingest runtime, KTX scan connectors.
---
## Audit summary
The first two implementation plans landed the warehouse verification tools,
prompt protocol, Notion warehouse scoping, and stale prompt-name cleanup. The
focused audit on May 12, 2026, found three remaining v1-blocking gaps:
- `discover_data` searches multiple allowed raw warehouse scans, but raw hits do
not carry or render `connectionName`. The tool tells the agent to call
`entity_details({connectionName, targets: [...]})`, then omits the required
`connectionName` from the follow-up evidence.
- Local LookML and MetricFlow adapters do not expose primary warehouse target
IDs. The runner only adds adapter-provided targets to `allowedConnectionNames`,
so those WorkUnits cannot use raw warehouse verification unless their source
connection is itself the warehouse.
- `sql_execution` calls the local ingest connection catalog, but the catalog
either has no query executor in normal CLI ingest or calls an injected
executor without `projectDir` and connection config. The default local query
executor cannot dispatch without that config.
Non-blocking gaps remain out of scope for this v1 plan:
- Full DDL-style `entity_details` formatting with FK profile summaries.
- AST-backed SQL read-only validation for data-modifying CTE bodies.
- Search over generated `enrichment/descriptions.json`.
- Lexicographic latest-sync edge cases for non-timestamp sync IDs.
- Hard write-time validation in `wiki_write` and `emit_unmapped_fallback`.
## File structure
Modify these files:
- `packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts`:
add `connectionName` to raw schema hit records.
- `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts`:
render raw hit connection names and preserve them in structured output.
- `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts`:
cover multi-connection raw discovery follow-up data.
- `packages/context/src/ingest/adapters/lookml/lookml.adapter.ts`:
accept and return configured target warehouse connection IDs.
- `packages/context/src/ingest/adapters/lookml/lookml.adapter.test.ts`:
cover LookML target warehouse IDs.
- `packages/context/src/ingest/adapters/metricflow/metricflow.adapter.ts`:
accept and return configured target warehouse connection IDs.
- `packages/context/src/ingest/adapters/metricflow/metricflow.adapter.test.ts`:
cover MetricFlow target warehouse IDs.
- `packages/context/src/ingest/local-adapters.ts`:
pass primary warehouse IDs into LookML and MetricFlow adapters.
- `packages/context/src/ingest/local-adapters.test.ts`:
cover local adapter warehouse target fan-out.
- `packages/context/src/ingest/local-bundle-runtime.ts`:
pass full project connection config to local ingest query executors.
- `packages/context/src/ingest/local-bundle-runtime.test.ts`:
cover the local ingest query executor call shape.
- `packages/context/src/ingest/local-ingest.ts`:
use the shared query executor port type.
- `packages/context/src/mcp/local-project-ports.ts`:
no behavior change expected, but type-checks against the updated local ingest
query executor type.
- `packages/cli/src/ingest.ts`:
provide a read-only scan-connector-backed query executor for normal local
ingest runs.
Create these files:
- `packages/cli/src/ingest-query-executor.ts`: CLI query executor that adapts
scan connectors' `executeReadOnly()` method to `KtxSqlQueryExecutorPort`.
- `packages/cli/src/ingest-query-executor.test.ts`: unit coverage for the CLI
ingest query executor.
### Task 1: Preserve raw discovery connection names
**Files:**
- Modify: `packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts`
- Modify: `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts`
- Modify: `packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts`
- [ ] **Step 1: Write the failing multi-connection discovery test**
Add this test to
`packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts`:
```ts
it('includes connectionName on raw schema hits so entity_details can follow up', async () => {
const multiConnectionContext: ToolContext = {
...context,
session: { allowedConnectionNames: new Set(['warehouse', 'analytics']) } as any,
};
catalog.searchByName.mockImplementation(async (connectionName: string, query: string) => [
{
kind: 'table',
connectionName,
ref: { catalog: null, db: 'public', name: `${connectionName}_${query}` },
display: `public.${connectionName}_${query}`,
matchedOn: 'name',
},
]);
const result = await tool.call({ query: 'orders', limit: 10 }, multiConnectionContext);
expect(catalog.searchByName).toHaveBeenCalledWith('analytics', 'orders', 10);
expect(catalog.searchByName).toHaveBeenCalledWith('warehouse', 'orders', 10);
expect(result.markdown).toContain('connectionName=analytics');
expect(result.markdown).toContain('connectionName=warehouse');
expect(result.markdown).toContain(
'entity_details({connectionName: "analytics", targets: [{display: "public.analytics_orders"}]})',
);
expect(result.structured.raw?.hits.map((hit) => hit.connectionName)).toEqual([
'analytics',
'warehouse',
]);
});
```
- [ ] **Step 2: Run the failing discovery test**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/discover-data.tool.test.ts -t "connectionName on raw schema hits"
```
Expected: FAIL because `RawSchemaHit` has no `connectionName` property and the
markdown only renders the display string.
- [ ] **Step 3: Add `connectionName` to raw schema hits**
Modify the raw hit type and hit construction in
`packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts`:
```ts
export type RawSchemaHit =
| {
kind: 'table';
connectionName: string;
ref: KtxTableRef;
display: string;
matchedOn: 'name' | 'db' | 'comment' | 'description';
}
| {
kind: 'column';
connectionName: string;
ref: KtxTableRef & { column: string };
display: string;
matchedOn: 'name' | 'comment' | 'description';
};
```
In the table hit block, add `connectionName`:
```ts
hits.push({
kind: 'table',
connectionName,
ref: { catalog: table.catalog, db: table.db, name: table.name },
display: formatDisplay(catalog.driver, table),
matchedOn: tableMatch,
});
```
In the column hit block, add `connectionName`:
```ts
hits.push({
kind: 'column',
connectionName,
ref: { catalog: table.catalog, db: table.db, name: table.name, column: column.name },
display: `${formatDisplay(catalog.driver, table)}.${column.name}`,
matchedOn: columnMatch,
});
```
- [ ] **Step 4: Render follow-up-ready raw hits**
Modify the raw schema markdown in
`packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts`:
```ts
parts.push('## Raw Warehouse Schema', '> use `entity_details({connectionName, targets: [{display}]})` for full DDL + sample values');
parts.push(
rawHits
.slice(0, limit)
.map(
(hit) =>
`- ${hit.kind}: ${hit.display} [connectionName=${hit.connectionName}] (matched on ${hit.matchedOn}) — ` +
`follow up with \`entity_details({connectionName: "${hit.connectionName}", targets: [{display: "${hit.display}"}]})\``,
)
.join('\n'),
);
```
- [ ] **Step 5: Run the discovery test**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/discover-data.tool.test.ts
```
Expected: PASS.
- [ ] **Step 6: Commit**
Run:
```bash
git add \
packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts
git commit -m "fix(context): include raw discovery connection names"
```
### Task 2: Expose LookML and MetricFlow warehouse targets
**Files:**
- Modify: `packages/context/src/ingest/adapters/lookml/lookml.adapter.ts`
- Modify: `packages/context/src/ingest/adapters/lookml/lookml.adapter.test.ts`
- Modify: `packages/context/src/ingest/adapters/metricflow/metricflow.adapter.ts`
- Modify: `packages/context/src/ingest/adapters/metricflow/metricflow.adapter.test.ts`
- Modify: `packages/context/src/ingest/local-adapters.ts`
- Modify: `packages/context/src/ingest/local-adapters.test.ts`
- [ ] **Step 1: Write failing adapter target tests**
Add this test to
`packages/context/src/ingest/adapters/lookml/lookml.adapter.test.ts`:
```ts
it('returns configured target warehouse connection ids', async () => {
const adapter = new LookmlSourceAdapter({
homeDir: join(tmpRoot, 'home'),
targetConnectionIds: ['warehouse', 'analytics', 'warehouse'],
});
await expect(adapter.listTargetConnectionIds?.(join(tmpRoot, 'staged'))).resolves.toEqual([
'analytics',
'warehouse',
]);
});
```
Add this test to
`packages/context/src/ingest/adapters/metricflow/metricflow.adapter.test.ts`:
```ts
it('returns configured target warehouse connection ids', async () => {
const metricflow = new MetricflowSourceAdapter({
homeDir: join(tmpRoot, 'cache-home'),
targetConnectionIds: ['warehouse', 'analytics', 'warehouse'],
});
await expect(metricflow.listTargetConnectionIds?.(stagedDir)).resolves.toEqual([
'analytics',
'warehouse',
]);
});
```
- [ ] **Step 2: Run the failing adapter tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run \
src/ingest/adapters/lookml/lookml.adapter.test.ts -t "target warehouse connection ids" \
src/ingest/adapters/metricflow/metricflow.adapter.test.ts -t "target warehouse connection ids"
```
Expected: FAIL because neither adapter accepts `targetConnectionIds` or
implements `listTargetConnectionIds()`.
- [ ] **Step 3: Implement target ID support in LookML**
Modify `packages/context/src/ingest/adapters/lookml/lookml.adapter.ts`:
```ts
export interface LookmlSourceAdapterDeps {
homeDir: string;
targetConnectionIds?: string[];
}
function uniqueSorted(values: readonly string[] | undefined): string[] {
return [...new Set(values ?? [])].sort((left, right) => left.localeCompare(right));
}
```
Add this method to `LookmlSourceAdapter`:
```ts
async listTargetConnectionIds(_stagedDir: string): Promise<string[]> {
return uniqueSorted(this.deps.targetConnectionIds);
}
```
- [ ] **Step 4: Implement target ID support in MetricFlow**
Modify `packages/context/src/ingest/adapters/metricflow/metricflow.adapter.ts`:
```ts
export interface MetricflowSourceAdapterDeps {
homeDir: string;
targetConnectionIds?: string[];
}
function uniqueSorted(values: readonly string[] | undefined): string[] {
return [...new Set(values ?? [])].sort((left, right) => left.localeCompare(right));
}
```
Add this method to `MetricflowSourceAdapter`:
```ts
async listTargetConnectionIds(_stagedDir: string): Promise<string[]> {
return uniqueSorted(this.deps.targetConnectionIds);
}
```
- [ ] **Step 5: Pass primary warehouses from the local adapter factory**
Modify the LookML and MetricFlow adapter construction in
`packages/context/src/ingest/local-adapters.ts`:
```ts
new LookmlSourceAdapter({
homeDir: join(project.projectDir, '.ktx/cache'),
targetConnectionIds: primaryWarehouseConnectionIds(project),
}),
```
```ts
new MetricflowSourceAdapter({
homeDir: join(project.projectDir, '.ktx/cache'),
targetConnectionIds: primaryWarehouseConnectionIds(project),
}),
```
- [ ] **Step 6: Write the local adapter fan-out test**
Add this test to `packages/context/src/ingest/local-adapters.test.ts`:
```ts
it('passes primary warehouse connection ids to local LookML and MetricFlow adapters', async () => {
const adapters = createDefaultLocalIngestAdapters(
projectWithConnections({
warehouse: {
driver: 'postgres',
url: 'postgresql://readonly@db.example.test/analytics',
},
lookml_docs: {
driver: 'lookml',
lookml: {
repoUrl: 'https://github.com/acme/lookml.git',
},
},
metrics_repo: {
driver: 'metricflow',
metricflow: {
repoUrl: 'https://github.com/acme/metrics.git',
},
},
} as never),
);
const lookml = adapters.find((adapter) => adapter.source === 'lookml');
const metricflow = adapters.find((adapter) => adapter.source === 'metricflow');
await expect(lookml?.listTargetConnectionIds?.('/tmp/staged-lookml')).resolves.toEqual([
'warehouse',
]);
await expect(metricflow?.listTargetConnectionIds?.('/tmp/staged-metricflow')).resolves.toEqual([
'warehouse',
]);
});
```
- [ ] **Step 7: Run the target fan-out tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run \
src/ingest/adapters/lookml/lookml.adapter.test.ts \
src/ingest/adapters/metricflow/metricflow.adapter.test.ts \
src/ingest/local-adapters.test.ts
```
Expected: PASS.
- [ ] **Step 8: Commit**
Run:
```bash
git add \
packages/context/src/ingest/adapters/lookml/lookml.adapter.ts \
packages/context/src/ingest/adapters/lookml/lookml.adapter.test.ts \
packages/context/src/ingest/adapters/metricflow/metricflow.adapter.ts \
packages/context/src/ingest/adapters/metricflow/metricflow.adapter.test.ts \
packages/context/src/ingest/local-adapters.ts \
packages/context/src/ingest/local-adapters.test.ts
git commit -m "fix(context): expose warehouse targets for LookML and MetricFlow"
```
### Task 3: Pass full connection config to local ingest SQL execution
**Files:**
- Modify: `packages/context/src/ingest/local-bundle-runtime.ts`
- Modify: `packages/context/src/ingest/local-bundle-runtime.test.ts`
- Modify: `packages/context/src/ingest/local-ingest.ts`
- [ ] **Step 1: Write the failing local connection catalog test**
In `packages/context/src/ingest/local-bundle-runtime.test.ts`, change the
Vitest import to include `vi`:
```ts
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
```
Extend `RuntimeWithConnectionDeps`:
```ts
type RuntimeWithConnectionDeps = {
deps: {
connections: {
listEnabledConnections(ids: string[]): Promise<Array<{ id: string; name: string; connectionType: string }>>;
getConnectionById(connectionId: string): Promise<{ id: string; name: string; connectionType: string } | null>;
executeQuery(connectionId: string, sql: string): Promise<unknown>;
};
};
};
```
Add this test:
```ts
it('passes project connection config to local ingest query executors', async () => {
const agentRunner = new AgentRunnerService({ llmProvider: { getModel: () => ({}) as never } as any });
const queryExecutor = {
execute: vi.fn(async () => ({
headers: ['answer'],
rows: [[1]],
totalRows: 1,
command: 'SELECT',
rowCount: 1,
})),
};
const runtime = createLocalBundleIngestRuntime({
project,
adapters: [new FakeSourceAdapter()],
agentRunner,
queryExecutor,
});
const connections = (runtime.runner as unknown as RuntimeWithConnectionDeps).deps.connections;
await expect(connections.executeQuery('warehouse', 'select 1')).resolves.toMatchObject({
headers: ['answer'],
});
expect(queryExecutor.execute).toHaveBeenCalledWith({
connectionId: 'warehouse',
projectDir: project.projectDir,
connection: project.config.connections.warehouse,
sql: 'select 1',
});
});
```
- [ ] **Step 2: Run the failing local runtime test**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/local-bundle-runtime.test.ts -t "project connection config"
```
Expected: FAIL because `LocalConnectionCatalog.executeQuery()` only passes
`connectionId` and `sql`.
- [ ] **Step 3: Update local ingest query executor types**
In `packages/context/src/ingest/local-bundle-runtime.ts`, import the shared
query executor type:
```ts
import { localConnectionInfoFromConfig, type KtxSqlQueryExecutorPort } from '../connections/index.js';
```
Change `CreateLocalBundleIngestRuntimeOptions.queryExecutor` to:
```ts
queryExecutor?: KtxSqlQueryExecutorPort;
```
Change `LocalConnectionCatalog` to store that type:
```ts
class LocalConnectionCatalog implements SlConnectionCatalogPort {
constructor(
private readonly project: KtxLocalProject,
private readonly queryExecutor?: KtxSqlQueryExecutorPort,
) {}
```
Change `executeQuery()`:
```ts
async executeQuery(connectionId: string, sql: string): Promise<KtxQueryResult> {
if (!this.queryExecutor) {
throw new Error('Local ingest has no query executor configured');
}
return this.queryExecutor.execute({
connectionId,
projectDir: this.project.projectDir,
connection: this.project.config.connections[connectionId],
sql,
});
}
```
In `packages/context/src/ingest/local-ingest.ts`, replace the local query
executor object type with the shared port:
```ts
import type { KtxSqlQueryExecutorPort } from '../connections/index.js';
```
```ts
queryExecutor?: KtxSqlQueryExecutorPort;
```
- [ ] **Step 4: Run the local runtime test**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/local-bundle-runtime.test.ts -t "project connection config"
```
Expected: PASS.
- [ ] **Step 5: Commit**
Run:
```bash
git add \
packages/context/src/ingest/local-bundle-runtime.ts \
packages/context/src/ingest/local-bundle-runtime.test.ts \
packages/context/src/ingest/local-ingest.ts
git commit -m "fix(context): pass connection config to ingest query executors"
```
### Task 4: Supply a scan-connector query executor to CLI ingest
**Files:**
- Create: `packages/cli/src/ingest-query-executor.ts`
- Create: `packages/cli/src/ingest-query-executor.test.ts`
- Modify: `packages/cli/src/ingest.ts`
- [ ] **Step 1: Write the CLI query executor tests**
Create `packages/cli/src/ingest-query-executor.test.ts`:
```ts
import type { KtxLocalProject } from '@ktx/context/project';
import { createKtxConnectorCapabilities, type KtxScanConnector } from '@ktx/context/scan';
import { describe, expect, it, vi } from 'vitest';
import { createKtxCliIngestQueryExecutor } from './ingest-query-executor.js';
function project(): KtxLocalProject {
return {
projectDir: '/tmp/ktx-query-project',
config: {
project: 'warehouse',
connections: {
warehouse: { driver: 'postgres', url: 'postgresql://readonly@example.test/db' },
},
},
} as unknown as KtxLocalProject;
}
function connector(overrides: Partial<KtxScanConnector> = {}): KtxScanConnector {
return {
id: 'warehouse',
driver: 'postgres',
capabilities: createKtxConnectorCapabilities({ readOnlySql: true }),
async introspect() {
throw new Error('introspect is not used by this test');
},
executeReadOnly: vi.fn(async () => ({
headers: ['answer'],
rows: [[1]],
totalRows: 1,
rowCount: 1,
})),
cleanup: vi.fn(async () => {}),
...overrides,
};
}
describe('createKtxCliIngestQueryExecutor', () => {
it('executes read-only SQL through the scan connector and cleans it up', async () => {
const scanConnector = connector();
const createConnector = vi.fn(async () => scanConnector);
const executor = createKtxCliIngestQueryExecutor(project(), { createConnector });
await expect(
executor.execute({
connectionId: 'warehouse',
connection: { driver: 'postgres', url: 'postgresql://readonly@example.test/db' },
projectDir: '/tmp/ktx-query-project',
sql: 'select 1',
maxRows: 5,
}),
).resolves.toMatchObject({
headers: ['answer'],
rows: [[1]],
totalRows: 1,
command: 'SELECT',
rowCount: 1,
});
expect(createConnector).toHaveBeenCalledWith(project(), 'warehouse');
expect(scanConnector.executeReadOnly).toHaveBeenCalledWith(
{ connectionId: 'warehouse', sql: 'select 1', maxRows: 5 },
{ runId: 'ingest-sql-execution' },
);
expect(scanConnector.cleanup).toHaveBeenCalledTimes(1);
});
it('rejects connectors without read-only SQL support', async () => {
const scanConnector = connector({
capabilities: createKtxConnectorCapabilities({ readOnlySql: false }),
executeReadOnly: undefined,
});
const executor = createKtxCliIngestQueryExecutor(project(), {
createConnector: vi.fn(async () => scanConnector),
});
await expect(
executor.execute({
connectionId: 'warehouse',
connection: { driver: 'postgres' },
projectDir: '/tmp/ktx-query-project',
sql: 'select 1',
}),
).rejects.toThrow('Connection "warehouse" driver "postgres" does not support read-only SQL execution.');
expect(scanConnector.cleanup).toHaveBeenCalledTimes(1);
});
});
```
- [ ] **Step 2: Run the failing CLI query executor test**
Run:
```bash
pnpm --filter @ktx/cli exec vitest run src/ingest-query-executor.test.ts
```
Expected: FAIL because `ingest-query-executor.ts` does not exist.
- [ ] **Step 3: Add the scan-connector-backed query executor**
Create `packages/cli/src/ingest-query-executor.ts`:
```ts
import type { KtxSqlQueryExecutionInput, KtxSqlQueryExecutorPort } from '@ktx/context/connections';
import type { KtxLocalProject } from '@ktx/context/project';
import type { KtxScanConnector, KtxScanContext } from '@ktx/context/scan';
import { createKtxCliScanConnector } from './local-scan-connectors.js';
type CreateConnector = typeof createKtxCliScanConnector;
export interface KtxCliIngestQueryExecutorDeps {
createConnector?: CreateConnector;
}
async function cleanupConnector(connector: KtxScanConnector | null): Promise<void> {
await connector?.cleanup?.();
}
export function createKtxCliIngestQueryExecutor(
project: KtxLocalProject,
deps: KtxCliIngestQueryExecutorDeps = {},
): KtxSqlQueryExecutorPort {
const createConnector = deps.createConnector ?? createKtxCliScanConnector;
return {
async execute(input: KtxSqlQueryExecutionInput) {
let connector: KtxScanConnector | null = null;
try {
connector = await createConnector(project, input.connectionId);
if (!connector.capabilities.readOnlySql || !connector.executeReadOnly) {
throw new Error(
`Connection "${input.connectionId}" driver "${connector.driver}" does not support read-only SQL execution.`,
);
}
const ctx: KtxScanContext = { runId: 'ingest-sql-execution' };
const result = await connector.executeReadOnly(
{ connectionId: input.connectionId, sql: input.sql, maxRows: input.maxRows },
ctx,
);
return {
headers: result.headers,
rows: result.rows,
totalRows: result.totalRows,
command: 'SELECT',
rowCount: result.rowCount,
};
} finally {
await cleanupConnector(connector);
}
},
};
}
```
- [ ] **Step 4: Wire the CLI executor into local ingest runs**
In `packages/cli/src/ingest.ts`, import the executor and type:
```ts
import type { KtxSqlQueryExecutorPort } from '@ktx/context/connections';
import type { KtxLocalProject } from '@ktx/context/project';
import { createKtxCliIngestQueryExecutor } from './ingest-query-executor.js';
```
Extend `KtxIngestDeps`:
```ts
createQueryExecutor?: (project: KtxLocalProject) => KtxSqlQueryExecutorPort;
```
Inside the `args.command === 'run'` branch, after `localIngestOptions` is
defined, add:
```ts
const queryExecutor =
localIngestOptions.queryExecutor ??
(deps.createQueryExecutor ?? createKtxCliIngestQueryExecutor)(project);
```
Pass `queryExecutor` to both local ingest execution paths. In the Metabase
fan-out call:
```ts
...localIngestOptions,
queryExecutor,
trigger: 'manual_resync',
```
In the normal local ingest call:
```ts
...localIngestOptions,
queryExecutor,
pullConfigOptions: adapterOptions,
```
- [ ] **Step 5: Add CLI wiring coverage**
Add this test to `packages/cli/src/ingest.test.ts`:
```ts
it('supplies a scan-connector query executor to local ingest runs', async () => {
const io = makeIo();
const projectDir = join(tempDir, 'query-executor-project');
await writeWarehouseConfig(projectDir);
const queryExecutor = {
execute: vi.fn(async () => ({
headers: [],
rows: [],
totalRows: 0,
command: 'SELECT',
rowCount: 0,
})),
};
const runLocalIngest = vi.fn(async (input: RunLocalIngestOptions): Promise<LocalIngestResult> =>
completedLocalBundleRun(input, 'query-executor-run'),
);
await expect(
runKtxIngest(
{
command: 'run',
projectDir,
connectionId: 'warehouse',
adapter: 'fake',
outputMode: 'json',
},
io.io,
{
runLocalIngest,
createAdapters: () => [],
createQueryExecutor: () => queryExecutor,
},
),
).resolves.toBe(0);
expect(runLocalIngest).toHaveBeenCalledWith(expect.objectContaining({ queryExecutor }));
});
```
- [ ] **Step 6: Run CLI query executor tests**
Run:
```bash
pnpm --filter @ktx/cli exec vitest run src/ingest-query-executor.test.ts src/ingest.test.ts -t "query executor"
```
Expected: PASS.
- [ ] **Step 7: Commit**
Run:
```bash
git add \
packages/cli/src/ingest-query-executor.ts \
packages/cli/src/ingest-query-executor.test.ts \
packages/cli/src/ingest.ts \
packages/cli/src/ingest.test.ts
git commit -m "fix(cli): enable read-only SQL probes for local ingest"
```
### Task 5: Final verification
**Files:**
- Verify: all files changed by Tasks 1-4.
- [ ] **Step 1: Run focused context tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run \
src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts \
src/ingest/tools/warehouse-verification/entity-details.tool.test.ts \
src/ingest/tools/warehouse-verification/discover-data.tool.test.ts \
src/ingest/tools/warehouse-verification/sql-execution.tool.test.ts \
src/ingest/local-bundle-runtime.test.ts \
src/ingest/local-adapters.test.ts \
src/ingest/adapters/lookml/lookml.adapter.test.ts \
src/ingest/adapters/metricflow/metricflow.adapter.test.ts \
src/ingest/ingest-bundle.runner.test.ts
```
Expected: PASS.
- [ ] **Step 2: Run focused CLI tests**
Run:
```bash
pnpm --filter @ktx/cli exec vitest run src/ingest-query-executor.test.ts src/ingest.test.ts
```
Expected: PASS.
- [ ] **Step 3: Run type checks**
Run:
```bash
pnpm --filter @ktx/context run type-check
pnpm --filter @ktx/cli run type-check
```
Expected: both commands pass.
- [ ] **Step 4: Run pre-commit on changed files if configured**
Run:
```bash
uv run pre-commit run --files \
packages/context/src/ingest/tools/warehouse-verification/warehouse-catalog.service.ts \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.ts \
packages/context/src/ingest/tools/warehouse-verification/discover-data.tool.test.ts \
packages/context/src/ingest/adapters/lookml/lookml.adapter.ts \
packages/context/src/ingest/adapters/lookml/lookml.adapter.test.ts \
packages/context/src/ingest/adapters/metricflow/metricflow.adapter.ts \
packages/context/src/ingest/adapters/metricflow/metricflow.adapter.test.ts \
packages/context/src/ingest/local-adapters.ts \
packages/context/src/ingest/local-adapters.test.ts \
packages/context/src/ingest/local-bundle-runtime.ts \
packages/context/src/ingest/local-bundle-runtime.test.ts \
packages/context/src/ingest/local-ingest.ts \
packages/cli/src/ingest-query-executor.ts \
packages/cli/src/ingest-query-executor.test.ts \
packages/cli/src/ingest.ts \
packages/cli/src/ingest.test.ts \
docs/superpowers/plans/2026-05-12-warehouse-verification-final-v1-closure.md
```
Expected: PASS. If the repository has no pre-commit config or the local `uv`
version cannot satisfy the configured toolchain, record the exact error and use
the focused test and type-check results as the closest verification.
- [ ] **Step 5: Commit final verification fixes if any were needed**
If verification required edits, run:
```bash
git add <changed-files>
git commit -m "test: cover warehouse verification v1 closure"
```
If verification required no edits, do not create an empty commit.
## Self-review
Spec coverage:
- Raw warehouse discovery still covers wiki, semantic-layer, and raw schema
results, and now raw hits include the connection name needed by the required
`entity_details` follow-up.
- Every local synthesis adapter with an external source connection now has a
path to target warehouse IDs: dbt and Notion already had it, Looker resolves
staged mappings, Metabase fan-out runs under target warehouse IDs, and this
plan adds LookML and MetricFlow.
- `sql_execution` remains scoped by `allowedConnectionNames`, retains the
read-only SQL wrapper, and gains a normal local ingest execution backend.
Placeholder scan:
- This plan contains no deferred implementation placeholders.
- Every code-changing step includes the exact test or implementation snippet to
add.
Type consistency:
- `connectionName` is added to `RawSchemaHit` and used by `DiscoverDataTool`.
- `targetConnectionIds` and `listTargetConnectionIds()` match the existing dbt
and Notion adapter pattern.
- Local ingest uses `KtxSqlQueryExecutorPort` consistently from CLI to context.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,345 @@
# Warehouse Verification Prompt Shape Closure Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make every warehouse-verification prompt use KTX's shipped
`sql_execution` input shape so ingest agents include `connectionName` when they
probe warehouse identifiers.
**Architecture:** Keep the warehouse verification tool code unchanged. Add
prompt-asset tests that reject Kaelio's old session-only SQL examples, then
update the shared identifier protocol and the three remaining per-skill SQL
probe examples that still show the legacy shape.
**Tech Stack:** Markdown skill prompts, TypeScript, Vitest, pnpm workspace
commands.
---
## Audit Summary
The warehouse verification tools, runner wiring, adapter target fan-out, and
focused tests are present. Focused verification passed:
```bash
pnpm --filter @ktx/context exec vitest run src/connections/dialects.test.ts src/connections/read-only-sql.test.ts src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts src/ingest/tools/warehouse-verification/entity-details.tool.test.ts src/ingest/tools/warehouse-verification/sql-execution.tool.test.ts src/ingest/tools/warehouse-verification/discover-data.tool.test.ts src/ingest/ingest-prompts.test.ts src/ingest/ingest-runtime-assets.test.ts src/memory/memory-runtime-assets.test.ts src/ingest/local-adapters.test.ts src/ingest/adapters/notion/notion.adapter.test.ts src/ingest/adapters/lookml/lookml.adapter.test.ts src/ingest/adapters/metricflow/metricflow.adapter.test.ts
pnpm --filter @ktx/cli exec vitest run src/ingest-query-executor.test.ts src/ingest.test.ts -t "supplies a scan-connector query executor"
```
Remaining v1-blocking gap:
- `packages/context/skills/lookml_ingest/SKILL.md`,
`packages/context/skills/metricflow_ingest/SKILL.md`, and
`packages/context/skills/sl_capture/SKILL.md` still contain
`sql_execution({ sql ... })` / "session shape" guidance inherited from
Kaelio. KTX's tool contract is
`sql_execution({connectionName, sql, rowLimit?})`, so these examples can make
agents call the shipped tool with invalid input.
Non-blocking gaps remain out of scope for this v1 plan:
- Full DDL-style `entity_details` formatting with FK profile summaries.
- AST-backed SQL validation for data-modifying CTE bodies.
- Search over generated `enrichment/descriptions.json`.
- Per-WorkUnit reuse of a single `WarehouseCatalogService` instance for cache
hits across separate tool calls.
- A deterministic fake-LLM end-to-end Notion hallucination regression. Prompt
guards and tool contract tests cover the v1 contract; a broader behavior
regression can land as follow-up.
## File Structure
Modify these files:
- `packages/context/src/memory/memory-runtime-assets.test.ts`: add a prompt
guard that rejects the legacy session-only `sql_execution` shape.
- `packages/context/src/ingest/ingest-runtime-assets.test.ts`: strengthen the
shared prompt asset assertion for the KTX `connectionName` SQL shape.
- `packages/context/skills/_shared/identifier-verification.md`: make both SQL
probe instructions show the KTX `connectionName` argument.
- `packages/context/skills/notion_synthesize/SKILL.md`: inline the updated
protocol block.
- `packages/context/skills/dbt_ingest/SKILL.md`: inline the updated protocol
block.
- `packages/context/skills/lookml_ingest/SKILL.md`: inline the updated protocol
block and fix the legacy SQL fallback example.
- `packages/context/skills/looker_ingest/SKILL.md`: inline the updated
protocol block.
- `packages/context/skills/metabase_ingest/SKILL.md`: inline the updated
protocol block.
- `packages/context/skills/metricflow_ingest/SKILL.md`: inline the updated
protocol block and fix the legacy SQL fallback example.
- `packages/context/skills/live_database_ingest/SKILL.md`: inline the updated
protocol block.
- `packages/context/skills/historic_sql_table_digest/SKILL.md`: inline the
updated protocol block.
- `packages/context/skills/historic_sql_patterns/SKILL.md`: inline the updated
protocol block.
- `packages/context/skills/knowledge_capture/SKILL.md`: inline the updated
protocol block.
- `packages/context/skills/sl_capture/SKILL.md`: inline the updated protocol
block and fix the join-discovery SQL example.
### Task 1: Add Prompt Guards For The KTX SQL Tool Shape
**Files:**
- Modify: `packages/context/src/memory/memory-runtime-assets.test.ts`
- Modify: `packages/context/src/ingest/ingest-runtime-assets.test.ts`
- [ ] **Step 1: Add the failing memory asset guard**
In `packages/context/src/memory/memory-runtime-assets.test.ts`, add this test
after `does not ship stale warehouse verification tool names or fictional
identifiers`:
```ts
it('ships only the KTX connectionName sql_execution call shape in writer guidance', async () => {
const shared = await readFile(join(skillsDir, '_shared', 'identifier-verification.md'), 'utf-8');
expect(shared).toContain('sql_execution({connectionName, sql: "SELECT DISTINCT');
expect(shared).toContain('sql_execution({connectionName, sql: "SELECT 1 FROM');
for (const skillName of verificationWriterSkills) {
const body = await readFile(join(skillsDir, skillName, 'SKILL.md'), 'utf-8');
expect(body).toContain('sql_execution({connectionName');
expect(body).not.toContain('sql_execution({ sql');
expect(body).not.toContain('session shape');
expect(body).not.toContain('connection is already pinned by the ingest session');
}
});
```
- [ ] **Step 2: Strengthen the shared ingest asset guard**
In `packages/context/src/ingest/ingest-runtime-assets.test.ts`, update
`packages identifier verification prompt assets` so the final assertions are:
```ts
expect(shared).toContain('discover_data');
expect(shared).toContain('entity_details');
expect(shared).toContain('sql_execution');
expect(shared).toContain('sql_execution({connectionName, sql: "SELECT DISTINCT');
expect(shared).toContain('sql_execution({connectionName, sql: "SELECT 1 FROM');
```
- [ ] **Step 3: Run the failing prompt guards**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/memory/memory-runtime-assets.test.ts src/ingest/ingest-runtime-assets.test.ts
```
Expected: FAIL. The failure must mention at least one current legacy string:
`sql_execution({ sql`, `session shape`, or missing
`sql_execution({connectionName`.
### Task 2: Update The Shared Identifier Verification Protocol
**Files:**
- Modify: `packages/context/skills/_shared/identifier-verification.md`
- Modify: `packages/context/skills/notion_synthesize/SKILL.md`
- Modify: `packages/context/skills/dbt_ingest/SKILL.md`
- Modify: `packages/context/skills/lookml_ingest/SKILL.md`
- Modify: `packages/context/skills/looker_ingest/SKILL.md`
- Modify: `packages/context/skills/metabase_ingest/SKILL.md`
- Modify: `packages/context/skills/metricflow_ingest/SKILL.md`
- Modify: `packages/context/skills/live_database_ingest/SKILL.md`
- Modify: `packages/context/skills/historic_sql_table_digest/SKILL.md`
- Modify: `packages/context/skills/historic_sql_patterns/SKILL.md`
- Modify: `packages/context/skills/knowledge_capture/SKILL.md`
- Modify: `packages/context/skills/sl_capture/SKILL.md`
- [ ] **Step 1: Replace the shared protocol text**
Replace the full `## Identifier Verification Protocol` block in
`packages/context/skills/_shared/identifier-verification.md` with:
```md
## Identifier Verification Protocol
Before writing a wiki page or SL source on any topic:
1. `discover_data({query: "<topic>"})` - see what wikis, SL sources, and raw
tables already exist. Prefer updating existing pages over creating new ones.
Before emitting any `schema.table` or `schema.table.column` into a wiki body,
SL source, `tables:` frontmatter, `sl_refs`, or `emit_unmapped_fallback`:
2. `entity_details({connectionName, targets: [{display: "<identifier>"}]})` -
confirm the identifier resolves; inspect native types, FK/PK, and
sampleValues.
3. For literal values from the source, such as status codes or plan tiers,
check whether they appear in `entity_details` sampleValues for the relevant
column. If sampleValues is short or the sample may have missed real values,
run a `sql_execution` probe with the same warehouse connection name:
`sql_execution({connectionName, sql: "SELECT DISTINCT <col> FROM <ref> LIMIT 50"})`.
4. If the candidate identifier still does not resolve, do one of:
- Use `sql_execution({connectionName, sql: "SELECT 1 FROM <ref> LIMIT 0"})`.
If it errors, the identifier is fictional.
- Wrap the identifier in `[unverified - from <rawPath>]` in the wiki body,
citing the exact raw path that mentioned it.
- When recording `emit_unmapped_fallback` with `no_physical_table`, include
the failing probe error in `clarification`.
5. Never copy `<schema>.<table>` placeholder strings from these instructions
into output.
```
- [ ] **Step 2: Inline the same protocol in every writer skill**
Replace the existing `## Identifier Verification Protocol` block in each writer
skill with the exact block from Step 1:
```bash
packages/context/skills/notion_synthesize/SKILL.md
packages/context/skills/dbt_ingest/SKILL.md
packages/context/skills/lookml_ingest/SKILL.md
packages/context/skills/looker_ingest/SKILL.md
packages/context/skills/metabase_ingest/SKILL.md
packages/context/skills/metricflow_ingest/SKILL.md
packages/context/skills/live_database_ingest/SKILL.md
packages/context/skills/historic_sql_table_digest/SKILL.md
packages/context/skills/historic_sql_patterns/SKILL.md
packages/context/skills/knowledge_capture/SKILL.md
packages/context/skills/sl_capture/SKILL.md
```
- [ ] **Step 3: Run the shared prompt asset tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/memory/memory-runtime-assets.test.ts src/ingest/ingest-runtime-assets.test.ts
```
Expected: still FAIL because the per-skill legacy SQL examples in LookML,
MetricFlow, and `sl_capture` have not been fixed yet.
### Task 3: Fix Legacy Per-Skill SQL Examples
**Files:**
- Modify: `packages/context/skills/lookml_ingest/SKILL.md`
- Modify: `packages/context/skills/metricflow_ingest/SKILL.md`
- Modify: `packages/context/skills/sl_capture/SKILL.md`
- [ ] **Step 1: Fix the LookML fallback probe example**
In `packages/context/skills/lookml_ingest/SKILL.md`, replace the current
Required flow item 2 with:
```md
2. If the table isn't in the manifest, use the warehouse `connectionName`
returned by `discover_data` or the target connection chosen from
`sl_discover`, then call a dialect-appropriate SQL probe with that
connection name, for example:
`sql_execution({connectionName: "warehouse", sql: "SELECT 1 FROM analytics.orders LIMIT 0"})`.
Replace `warehouse`, `analytics`, and `orders` with the verified connection,
schema or dataset, and table from the WorkUnit evidence.
```
- [ ] **Step 2: Fix the MetricFlow fallback probe example**
In `packages/context/skills/metricflow_ingest/SKILL.md`, replace the paragraph
that begins `If \`sl_discover\` errors` with:
```md
If `sl_discover` errors because no such table exists, use `discover_data` and
`entity_details` to find the warehouse target. If a SQL probe is still needed,
call `sql_execution` with the same warehouse connection name, for example:
`sql_execution({connectionName: "warehouse", sql: "SELECT 1 FROM analytics.orders LIMIT 0"})`.
**Never invent column names** - every column in `columns:`, `grain:`, and
`sql:` must be sourced from raw files, `entity_details`, or a successful SQL
probe.
```
- [ ] **Step 3: Fix the `sl_capture` join probe example**
In `packages/context/skills/sl_capture/SKILL.md`, replace Tool sequence item 6
with:
```md
6. For join discovery: use `sql_execution({connectionName: "warehouse", sql: "SELECT count(*) FROM public.orders o JOIN public.customers c ON c.id = o.customer_id LIMIT 20"})` with the target warehouse connection name and dialect-correct table names to verify the join key exists in both tables and assess cardinality before declaring the join.
```
- [ ] **Step 4: Run the prompt asset tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/memory/memory-runtime-assets.test.ts src/ingest/ingest-runtime-assets.test.ts
```
Expected: PASS. The tests must report 2 files passed.
### Task 4: Final Verification
**Files:**
- No new files.
- [ ] **Step 1: Run focused warehouse prompt and tool tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/connections/dialects.test.ts src/connections/read-only-sql.test.ts src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts src/ingest/tools/warehouse-verification/entity-details.tool.test.ts src/ingest/tools/warehouse-verification/sql-execution.tool.test.ts src/ingest/tools/warehouse-verification/discover-data.tool.test.ts src/ingest/ingest-prompts.test.ts src/ingest/ingest-runtime-assets.test.ts src/memory/memory-runtime-assets.test.ts
```
Expected: PASS.
- [ ] **Step 2: Run package type-check**
Run:
```bash
pnpm --filter @ktx/context run type-check
```
Expected: PASS.
- [ ] **Step 3: Inspect final diff**
Run:
```bash
git diff -- packages/context/src/memory/memory-runtime-assets.test.ts packages/context/src/ingest/ingest-runtime-assets.test.ts packages/context/skills/_shared/identifier-verification.md packages/context/skills/notion_synthesize/SKILL.md packages/context/skills/dbt_ingest/SKILL.md packages/context/skills/lookml_ingest/SKILL.md packages/context/skills/looker_ingest/SKILL.md packages/context/skills/metabase_ingest/SKILL.md packages/context/skills/metricflow_ingest/SKILL.md packages/context/skills/live_database_ingest/SKILL.md packages/context/skills/historic_sql_table_digest/SKILL.md packages/context/skills/historic_sql_patterns/SKILL.md packages/context/skills/knowledge_capture/SKILL.md packages/context/skills/sl_capture/SKILL.md
```
Expected: only prompt wording and prompt-asset guards changed. No tool
implementation files changed.
- [ ] **Step 4: Commit**
Run:
```bash
git add packages/context/src/memory/memory-runtime-assets.test.ts packages/context/src/ingest/ingest-runtime-assets.test.ts packages/context/skills/_shared/identifier-verification.md packages/context/skills/notion_synthesize/SKILL.md packages/context/skills/dbt_ingest/SKILL.md packages/context/skills/lookml_ingest/SKILL.md packages/context/skills/looker_ingest/SKILL.md packages/context/skills/metabase_ingest/SKILL.md packages/context/skills/metricflow_ingest/SKILL.md packages/context/skills/live_database_ingest/SKILL.md packages/context/skills/historic_sql_table_digest/SKILL.md packages/context/skills/historic_sql_patterns/SKILL.md packages/context/skills/knowledge_capture/SKILL.md packages/context/skills/sl_capture/SKILL.md
git commit -m "fix(context): align warehouse sql probe prompt shape"
```
Expected: one focused commit.
## Self-Review
Spec coverage:
- The original spec requires `sql_execution` inputs to include
`connectionName`; this plan removes contradictory session-only examples from
all active writer guidance.
- The shared protocol remains in `_shared` and inlined in every synthesis
writer skill named by the original spec.
- The tool implementation remains unchanged because the shipped schema already
enforces the v1 contract.
Placeholder scan:
- The plan has no deferred implementation markers.
- Prompt examples use concrete `warehouse`, `analytics`, and `orders` example
names only to demonstrate JSON shape, and each example tells the worker to
replace them with discovered evidence.
Type consistency:
- Tests assert the exact KTX tool call shape:
`sql_execution({connectionName, sql: ...})`.
- Prompt wording consistently uses `connectionName`, matching
`packages/context/src/ingest/tools/warehouse-verification/sql-execution.tool.ts`.

View file

@ -0,0 +1,215 @@
# Warehouse Verification SQL Example Closure Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Remove the last connectionless `sql_execution` prompt example so
warehouse-verification writer guidance always matches KTX's shipped tool
contract.
**Architecture:** Keep the warehouse verification tool code unchanged. Tighten
the prompt asset guard so multiline `sql_execution({ sql: ... })` examples
fail tests, then update the stale `sl_capture` worked example to pass
`connectionName` explicitly.
**Tech Stack:** Markdown skill prompts, TypeScript, Vitest, pnpm workspace
commands.
---
## Audit summary
The warehouse verification tools, runner wiring, source-adapter target fan-out,
CLI query executor, and focused tests are present. Focused verification passed:
```bash
pnpm --filter @ktx/context exec vitest run src/connections/dialects.test.ts src/connections/read-only-sql.test.ts src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts src/ingest/tools/warehouse-verification/entity-details.tool.test.ts src/ingest/tools/warehouse-verification/sql-execution.tool.test.ts src/ingest/tools/warehouse-verification/discover-data.tool.test.ts src/ingest/ingest-prompts.test.ts src/ingest/ingest-runtime-assets.test.ts src/memory/memory-runtime-assets.test.ts src/ingest/local-adapters.test.ts src/ingest/adapters/notion/notion.adapter.test.ts src/ingest/adapters/lookml/lookml.adapter.test.ts src/ingest/adapters/metricflow/metricflow.adapter.test.ts
pnpm --filter @ktx/cli exec vitest run src/ingest-query-executor.test.ts src/ingest.test.ts -t "supplies a scan-connector query executor"
```
Remaining v1-blocking gap:
- `packages/context/skills/sl_capture/SKILL.md` still contains a worked example
with a multiline `sql_execution({ sql: ... })` call. KTX's tool contract is
`sql_execution({connectionName, sql, rowLimit?})`, so this example can teach
agents to call the shipped tool with invalid input.
Non-blocking gaps remain out of scope for this v1 plan:
- Full DDL-style `entity_details` formatting with FK profile summaries.
- AST-backed SQL validation for data-modifying CTE bodies.
- Search over generated `enrichment/descriptions.json`.
- Per-WorkUnit reuse of a single `WarehouseCatalogService` instance for cache
hits across separate tool calls.
- A deterministic fake-LLM end-to-end Notion hallucination regression.
- Tokenized or embedding-backed raw schema search ranking in `discover_data`.
## File structure
Modify these files:
- `packages/context/src/memory/memory-runtime-assets.test.ts`: add a prompt
guard that catches multiline `sql_execution` calls without `connectionName`.
- `packages/context/skills/sl_capture/SKILL.md`: update the stale worked
example to include the target warehouse `connectionName`.
### Task 1: Add a multiline SQL prompt guard
**Files:**
- Modify: `packages/context/src/memory/memory-runtime-assets.test.ts`
- [ ] **Step 1: Add a helper that extracts `sql_execution` call examples**
In `packages/context/src/memory/memory-runtime-assets.test.ts`, add this helper
after `forbiddenProductPattern()`:
```ts
function sqlExecutionCallBlocks(body: string): string[] {
const blocks: string[] = [];
const marker = 'sql_execution({';
let offset = 0;
while (offset < body.length) {
const start = body.indexOf(marker, offset);
if (start === -1) {
break;
}
const end = body.indexOf('})', start + marker.length);
blocks.push(body.slice(start, end === -1 ? start + marker.length : end + 2));
offset = start + marker.length;
}
return blocks;
}
```
- [ ] **Step 2: Strengthen the existing SQL-shape test**
Replace the body of
`ships only the KTX connectionName sql_execution call shape in writer guidance`
with:
```ts
const shared = await readFile(join(skillsDir, '_shared', 'identifier-verification.md'), 'utf-8');
const bodies = [{ name: '_shared/identifier-verification.md', body: shared }];
expect(shared).toContain('sql_execution({connectionName, sql: "SELECT DISTINCT');
expect(shared).toContain('sql_execution({connectionName, sql: "SELECT 1 FROM');
for (const skillName of verificationWriterSkills) {
const body = await readFile(join(skillsDir, skillName, 'SKILL.md'), 'utf-8');
bodies.push({ name: `${skillName}/SKILL.md`, body });
expect(body).toContain('sql_execution({connectionName');
expect(body).not.toContain('sql_execution({ sql');
expect(body).not.toContain('session shape');
expect(body).not.toContain('connection is already pinned by the ingest session');
}
for (const { name, body } of bodies) {
const calls = sqlExecutionCallBlocks(body);
expect(calls.length, `${name} should contain sql_execution guidance`).toBeGreaterThan(0);
expect(
calls.filter((call) => !call.includes('connectionName')),
`${name} has sql_execution calls without connectionName`,
).toEqual([]);
expect(body, `${name} has a connectionless multiline sql_execution call`).not.toMatch(
/sql_execution\(\{\s*sql\s*:/,
);
}
```
- [ ] **Step 3: Run the failing prompt guard**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/memory/memory-runtime-assets.test.ts -t "connectionName sql_execution"
```
Expected: FAIL. The failure must identify
`sl_capture/SKILL.md` as having a `sql_execution` call without
`connectionName` or a connectionless multiline `sql_execution` call.
- [ ] **Step 4: Commit the failing guard**
Run:
```bash
git add packages/context/src/memory/memory-runtime-assets.test.ts
git commit -m "test(context): catch connectionless sql execution prompt examples"
```
### Task 2: Fix the stale `sl_capture` SQL example
**Files:**
- Modify: `packages/context/skills/sl_capture/SKILL.md`
- Test: `packages/context/src/memory/memory-runtime-assets.test.ts`
- Test: `packages/context/src/ingest/ingest-runtime-assets.test.ts`
- [ ] **Step 1: Update the worked example**
In `packages/context/skills/sl_capture/SKILL.md`, replace the `sql_execution`
block in "Worked example - new join" with:
```md
sql_execution({
connectionName: "warehouse",
sql: "SELECT COUNT(*), COUNT(DISTINCT a.admin_user_id) FROM public.fct_orders a JOIN public.fct_mau_multiprotocol b ON a.admin_user_id = b.admin_user_id LIMIT 1"
})
```
- [ ] **Step 2: Run the prompt guards**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/memory/memory-runtime-assets.test.ts src/ingest/ingest-runtime-assets.test.ts
```
Expected: PASS.
- [ ] **Step 3: Run a direct stale-shape scan**
Run:
```bash
rg -n -U "sql_execution\\(\\{\\s*\\n\\s*sql:" packages/context/skills packages/context/prompts
```
Expected: no matches and exit code 1.
- [ ] **Step 4: Run the context type-check**
Run:
```bash
pnpm --filter @ktx/context run type-check
```
Expected: PASS.
- [ ] **Step 5: Commit the prompt fix**
Run:
```bash
git add packages/context/skills/sl_capture/SKILL.md
git commit -m "fix(context): include connection name in sl capture sql example"
```
## Self-review
Spec coverage:
- The only remaining v1-blocking prompt-shape gap has a failing test and a
direct prompt edit.
- Tool implementation, runner wiring, adapter scoping, and CLI execution
remain covered by the focused suites listed in the audit summary.
Placeholder scan:
- This plan contains no deferred implementation placeholders.
Type consistency:
- The plan uses the shipped KTX tool shape:
`sql_execution({connectionName, sql, rowLimit?})`.

View file

@ -0,0 +1,236 @@
# Warehouse Verification Structured Target Miss Closure Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make `entity_details` return model-visible not-found evidence for every documented target shape, including structured `{catalog, db, name, column?}` targets.
**Architecture:** Keep the existing warehouse verification module. Add focused tests for missing structured table and column targets, then route structured target labels through the same candidate lookup used by display targets while preserving exact structured resolution.
**Tech Stack:** TypeScript, Node 22, Vitest, AI SDK v6 tools, Zod, KTX ingest tools.
---
## Audit Summary
The implemented plans have landed the warehouse verification tools, ingest
runner wiring, adapter warehouse target fan-out, CLI read-only query executor,
and prompt-shape closures. Focused verification passed on May 13, 2026:
```bash
pnpm --filter @ktx/context exec vitest run src/connections/dialects.test.ts src/connections/read-only-sql.test.ts src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts src/ingest/tools/warehouse-verification/entity-details.tool.test.ts src/ingest/tools/warehouse-verification/sql-execution.tool.test.ts src/ingest/tools/warehouse-verification/discover-data.tool.test.ts src/ingest/ingest-prompts.test.ts src/ingest/ingest-runtime-assets.test.ts src/memory/memory-runtime-assets.test.ts src/ingest/local-adapters.test.ts src/ingest/adapters/notion/notion.adapter.test.ts src/ingest/adapters/lookml/lookml.adapter.test.ts src/ingest/adapters/metricflow/metricflow.adapter.test.ts
pnpm --filter @ktx/cli exec vitest run src/ingest-query-executor.test.ts src/ingest.test.ts -t "supplies a scan-connector query executor"
rg -n -U "sql_execution\\(\\{\\s*\\n\\s*sql:" packages/context/skills packages/context/prompts
rg -n "wiki_sl_search|sl_describe_table|orbit_analytics\\.customer" packages/context/skills packages/context/prompts packages/context/src/ingest/tools/emit-unmapped-fallback.tool.ts packages/context/src/sl/tools/sl-warehouse-validation.ts
```
Remaining v1-blocking gap:
- `entity_details` accepts structured targets, but if a structured table target
does not exist, it records `structured.missing` and emits no markdown. Tool
outputs are sent to the model as markdown only, so the synthesis agent gets
an empty response instead of the required "Not found in scan" verification
signal.
Non-blocking gaps remain out of scope for this v1 plan:
- Full DDL-style `entity_details` formatting with FK and profile summaries.
- AST-backed SQL validation for data-modifying CTE bodies.
- Dialect-specific row-limit wrapping for SQL Server probes.
- Search over generated `enrichment/descriptions.json`.
- Per-WorkUnit reuse of a single `WarehouseCatalogService` instance for cache
hits across separate tool calls.
- A deterministic fake-LLM end-to-end Notion hallucination regression.
- Cleanup of legacy demo Orbit wiki fixtures that still mention
`orbit_analytics.customer`.
## File Structure
Modify these files:
- `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts`: add failing coverage for missing structured targets.
- `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts`: render missing structured targets into markdown and reuse candidate lookup.
### Task 1: Report Structured Target Misses In `entity_details`
**Files:**
- Modify: `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts`
- Modify: `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts`
- [ ] **Step 1: Add failing structured miss tests**
In `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts`, add these tests after `reports missing explicit columns instead of returning an empty column list`:
```ts
it('reports missing structured table targets in model-visible markdown', async () => {
const result = await tool.call(
{
connectionName: 'warehouse',
targets: [{ catalog: null, db: 'public', name: 'orderz' }],
},
context,
);
expect(result.markdown).toContain('Not found in scan: public.orderz');
expect(result.markdown).toContain('Closest matches: orders');
expect(result.structured.resolved).toHaveLength(0);
expect(result.structured.missing).toHaveLength(1);
});
it('reports missing structured column targets in model-visible markdown', async () => {
const result = await tool.call(
{
connectionName: 'warehouse',
targets: [{ catalog: null, db: 'public', name: 'orders', column: 'plan_tier' }],
},
context,
);
expect(result.markdown).toContain('Column not found in scan: public.orders.plan_tier');
expect(result.markdown).toContain('Available columns: id, status');
expect(result.structured.resolved).toHaveLength(0);
expect(result.structured.missing).toHaveLength(1);
});
```
- [ ] **Step 2: Run the failing focused test**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/entity-details.tool.test.ts -t "structured"
```
Expected: FAIL. The first new test must fail because `result.markdown` does not contain `Not found in scan: public.orderz`.
- [ ] **Step 3: Add structured target labels and candidate lookup**
In `packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts`, add this type alias after `type EntityDetailsInput = z.infer<typeof entityDetailsInputSchema>;`:
```ts
type EntityDetailsTarget = EntityDetailsInput['targets'][number];
```
Add these helpers after `function allowedConnectionNames(context: ToolContext): ReadonlySet<string> | null { ... }`:
```ts
function targetLabel(target: EntityDetailsTarget): string {
if ('display' in target) {
return target.display;
}
return [target.catalog, target.db, target.name, target.column].filter((part): part is string => !!part).join('.');
}
function appendMissingTargetMarkdown(parts: string[], target: EntityDetailsTarget, candidates: KtxTableRef[]): void {
parts.push(`Not found in scan: ${targetLabel(target)}`);
if (candidates.length > 0) {
parts.push(`Closest matches: ${candidates.map((candidate) => candidate.name).join(', ')}`);
}
}
async function resolveTarget(
catalog: WarehouseCatalogService,
connectionName: string,
target: EntityDetailsTarget,
): Promise<{ resolved: (KtxTableRef & { column?: string }) | null; candidates: KtxTableRef[] }> {
if ('display' in target) {
return catalog.resolveDisplayTarget(connectionName, target.display);
}
const candidateResolution = await catalog.resolveDisplayTarget(connectionName, targetLabel(target));
return {
resolved: {
catalog: target.catalog,
db: target.db,
name: target.name,
column: target.column,
},
candidates: candidateResolution.candidates,
};
}
```
Then replace the `const resolution = ...` block inside the `for (const target of input.targets)` loop with:
```ts
const resolution = await resolveTarget(catalog, input.connectionName, target);
```
Replace the missing-resolution block with:
```ts
if (!resolution.resolved) {
missing.push({ target, candidates: resolution.candidates });
appendMissingTargetMarkdown(parts, target, resolution.candidates);
continue;
}
```
Replace the missing-detail block with:
```ts
if (!detail) {
missing.push({ target, candidates: resolution.candidates });
appendMissingTargetMarkdown(parts, target, resolution.candidates);
continue;
}
```
- [ ] **Step 4: Run the focused entity-details tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/entity-details.tool.test.ts
```
Expected: PASS.
- [ ] **Step 5: Run warehouse verification regression tests**
Run:
```bash
pnpm --filter @ktx/context exec vitest run src/ingest/tools/warehouse-verification/warehouse-catalog.service.test.ts src/ingest/tools/warehouse-verification/entity-details.tool.test.ts src/ingest/tools/warehouse-verification/discover-data.tool.test.ts
```
Expected: PASS.
- [ ] **Step 6: Run context type-check**
Run:
```bash
pnpm --filter @ktx/context run type-check
```
Expected: PASS.
- [ ] **Step 7: Commit**
Run:
```bash
git add \
packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.ts \
packages/context/src/ingest/tools/warehouse-verification/entity-details.tool.test.ts
git commit -m "fix(context): report structured entity detail misses"
```
## Self-review
Spec coverage:
- The original `entity_details` contract says structured and display targets
are mixed shapes and unresolved targets must produce `Not found in scan` with
candidates. This plan adds that model-visible behavior for structured table
misses and preserves the existing column-miss behavior.
Placeholder scan:
- This plan contains no deferred implementation placeholders.
Type consistency:
- The plan uses the existing `WarehouseCatalogService`, `KtxTableRef`,
`EntityDetailsStructured`, and `ToolOutput` types without adding public API
compatibility wrappers.

View file

@ -0,0 +1,331 @@
# Warehouse Verification Tools for Ingestion Synthesis
**Date:** 2026-05-12
**Author:** Andrey Avtomonov
**Status:** Design — pending implementation plan
## Background and motivation
KTX's ingest pipeline synthesises wiki pages and semantic-layer (SL) sources from third-party content (Notion, LookML, Looker, Metabase, dbt, MetricFlow, historic SQL, live-database scans, and chat). The synthesis stage is an LLM call that runs once per WorkUnit, governed by a skill prompt (e.g. `notion_synthesize`) and a set of allowed tools.
A real-world inspection (project `/tmp/ktx-proj-1`) surfaced two failure modes the synthesis stage produces:
1. **Fictional identifiers laundered into wiki output.** A Notion page mentioned `orbit_analytics.customer` as a legacy "customer source" table with a `plan_tier in {free, pro, enterprise}` column. Neither the table, the column, nor those values exist in the configured warehouse. The synthesis LLM faithfully copied them into `knowledge/global/orbit/customers-source.md` as a "Conflict Note", giving the fabricated names full wiki frontmatter, a `Source:` citation, and apparent authority.
2. **Column attribution drift.** The same wiki page documents columns under `orbit_raw.accounts` but states the `paying_account_count` measure filters on `normalized_plan_code` and `contract_status`. Those columns live on `orbit_analytics.mart_account_segments`, not on `accounts`. A reader (or a downstream agent) following the page will write `accounts.normalized_plan_code` and get a `column does not exist` error.
Root cause analysis (`packages/context/skills/notion_synthesize/SKILL.md`, `packages/context/src/ingest/tools/emit-unmapped-fallback.tool.ts`, `packages/context/src/wiki/tools/wiki-write.tool.ts`) showed three contributing factors:
- The synthesis LLM has no verification primitive that distinguishes a real warehouse identifier from a fabricated one. `sl_discover` only finds objects already promoted into the semantic layer; raw warehouse scans (which already exist on disk under `raw-sources/<conn>/live-database/<sync>/`) are not surfaced to the LLM at all.
- `wiki_write` performs no body-text validation — anything the LLM emits is written.
- The skill prompt itself uses `orbit_analytics.customer` as a canonical example string (`SKILL.md:70`), reinforcing the same fictional name the LLM ends up emitting.
Kaelio's server-side ingest WU agent (`/Users/andrey/conductor/workspaces/kaelio-main2/douala/server/src/tools/toolset-factory.service.ts`) had four verification tools that KTX dropped during the open-source extraction: `discover_data`, `entity_details`, `dictionary_search`, and `sql_execution`. The underlying connector infrastructure (`KtxScanConnector`, dialect classes, `assertReadOnlySql`, `SemanticLayerService.executeQuery`) is present in KTX, so the gap is at the tool layer, not the platform layer.
## Goal
Give every ingest adapter's synthesis-time LLM call the tools and skill-prompt instructions needed to verify warehouse identifiers (`schema.table`, `schema.table.column`) and sample values before emitting them into wiki pages, SL sources, `tables:` frontmatter, `sl_refs`, or `emit_unmapped_fallback` records.
## Non-goals
- Not changing `wiki_write` itself. A complementary spec covers hard write-time validation; this spec focuses on giving the LLM the tools to self-validate.
- Not modifying any Notion fetch/chunk/cluster behaviour.
- Not changing the `_schema/*.yaml` format.
- Not introducing a UUID layer for tables or columns; KTX keeps `(connection, catalog, db, name)` as the canonical table identity.
- Not adding `semantic_query` to the synthesis toolset. `semantic_query` is a future tool for the research/chat-time agent; synthesis creates SL sources rather than queries them, so the wrong shape.
- Not adding `dictionary_search`. `entity_details` already returns per-column `sampleValues` from the relationship-profile, and `sql_execution` covers the rarer "where does this literal live?" case more accurately than a sampled-JSON full-text scan.
## What already exists in KTX
The dialect/driver/connection architecture is fully ported from Kaelio. The new tools sit on top of three already-shipping primitives:
| Primitive | Location |
|---|---|
| `KtxTableRef = { catalog: string\|null, db: string\|null, name: string }` | `packages/context/src/scan/types.ts:168` |
| `SemanticLayerService.executeQuery(connectionId, sql)` | `packages/context/src/sl/semantic-layer.service.ts:1004`, used today by `sl_validate` |
| `assertReadOnlySql` / `limitSqlForExecution` | `packages/context/src/connections/read-only-sql.ts` |
| 7 connectors with parallel layout (postgres, mysql, sqlserver, snowflake, bigquery, clickhouse, sqlite), each exporting a dialect class | `packages/connector-*` |
| Raw scan artefacts: `tables/<base64(catalog??'_')>.<base64(db)>.<base64(name)>.json` and `enrichment/relationship-profile.json` (with `nativeType`, `nullable`, `primaryKey`, `foreignKeys`, `rowCount`, `nullCount`, `distinctCount`, `sampleValues`, descriptions) | `raw-sources/<connectionId>/live-database/<latest-sync>/` |
| `wiki_search`, `sl_discover`, `sl_read_source`, `sl_validate`, `emit_unmapped_fallback` | already wired into synthesis stages |
The only meaningfully new code is `WarehouseCatalogService`, a small `getDialectForDriver` dispatch, the three tool files, and the wiring in `ingest-bundle.runner.ts`.
## Architecture
### Module layout
```
packages/context/src/ingest/tools/warehouse-verification/
discover-data.tool.ts
entity-details.tool.ts
sql-execution.tool.ts
warehouse-catalog.service.ts
index.ts # exports createWarehouseVerificationTools()
packages/context/src/connections/
dialects.ts # adds getDialectForDriver()
packages/context/skills/_shared/
identifier-verification.md # the protocol snippet referenced from every synthesis skill
```
### Canonical table identity
Every tool that names a warehouse object uses the tuple `(connectionName, catalog, db, name[, column])`. `connectionName` is the slug from `ktx.yaml` (e.g., `"warehouse"`), validated against `^[a-zA-Z0-9][a-zA-Z0-9_-]*$`. There is no UUID layer.
`display` strings the LLM picks up from source pages (e.g., `"orbit_raw.accounts"` for Postgres or `"project.dataset.table"` for BigQuery) are parsed by `WarehouseCatalogService.resolveDisplay`, which knows the connection's driver via `getDialectForDriver`. Ambiguous parses (e.g., a 2-part display on BigQuery) return a candidates list instead of guessing.
Dialect mapping:
| Driver | catalog | db | name | Display |
|---|---|---|---|---|
| postgres | `null` | schema | table | `schema.table` |
| mysql | `null` | schema | table | `schema.table` |
| sqlserver | catalog | schema | table | `catalog.schema.table` |
| snowflake | database | schema | table | `db.schema.table` |
| bigquery | project | dataset | table | `project.dataset.table` |
| clickhouse | `null` | database | table | `database.table` |
| sqlite | `null` | `null` | table | `table` |
### `WarehouseCatalogService`
Stateless except for a per-WorkUnit cache. Reads raw scan files under `raw-sources/<connectionName>/live-database/<latest-sync>/`.
```ts
class WarehouseCatalogService {
getTable(ref: { connectionName: string } & KtxTableRef): Promise<TableDetail | null>;
listTables(connectionName: string): Promise<KtxTableRef[]>;
resolveDisplay(connectionName: string, display: string): Promise<{
resolved: KtxTableRef | null;
candidates: KtxTableRef[]; // ranked by edit distance when resolved is null
dialect: string;
}>;
searchByName(connectionName: string, query: string, limit: number): Promise<Array<
| { kind: 'table'; ref: KtxTableRef; matchedOn: 'name'|'db'|'comment'|'description' }
| { kind: 'column'; ref: KtxTableRef & { column: string }; matchedOn: 'name'|'comment'|'description' }
>>;
getLatestSyncId(connectionName: string): Promise<string | null>;
}
```
`getTable` merges the raw schema file (native types, PK, FK, nullable) with the enrichment profile (row counts, null rates, distinct counts, sample values, AI-generated descriptions). When no scan exists for the connection, every read returns `null`; tools surface this as a distinct "no scan available" state rather than as "identifier not found", so the LLM doesn't conclude a real table is fictional just because a scan hasn't run yet.
### `getDialectForDriver`
```ts
// packages/context/src/connections/dialects.ts
export type SupportedDriver = 'postgres'|'postgresql'|'mysql'|'sqlserver'|'snowflake'|'bigquery'|'clickhouse'|'sqlite'|'sqlite3';
export function getDialectForDriver(driver: SupportedDriver): KtxDialect;
```
Sync dispatch. The connectors' existing dialect classes already expose the same shape — `formatTableName(KtxTableRef)`, `quoteIdentifier(string)`, `mapToDimensionType(nativeType)`. The implementation plan introduces a minimal `KtxDialect` interface that these classes already satisfy structurally; no connector-internal changes required. Used by tools only for display-string parsing and error-message formatting; tools never construct executable SQL.
## Tool contracts
### `entity_details`
```ts
input = {
connectionName: string,
targets: Array< // 1..50, mixed shapes allowed
| { display: string } // "orbit_raw.accounts" or "orbit_raw.accounts.account_id"
| { catalog: string|null, db: string, name: string, column?: string }
>,
}
```
Output (markdown, per target):
```
### orbit_raw.accounts
Type: table | Native columns: 11 | PK: account_id | FKs: parent_account_id → orbit_raw.accounts.account_id
Description: One row per customer account…
Columns:
- account_id (text, nullable=false, PK) — sample: ["acct_001","acct_002",…]
- parent_account_id (text, nullable=true, FK → orbit_raw.accounts.account_id)
- account_name (text, nullable=false)
- …
Profile: rowCount=4321 distinctCount(account_id)=4321 nullRate(parent_account_id)=0.62
```
When `column` is provided in a target, output is scoped to that one column. When a target doesn't resolve, output is `Not found in scan. Closest matches: …` with up to 5 candidates from `searchByName`. When the connection has no `live-database` scan, output is `No live-database scan available for connection "<name>"; run \`ktx scan\` first.` — distinct from the "not found" state.
Structured output: `{ resolved: TableDetail[], missing: Array<{target, candidates}>, scanAvailable: boolean }`.
Refuses `connectionName` values not in the WU-stage's `allowedConnectionNames` set.
### `sql_execution`
```ts
input = {
connectionName: string,
sql: string, // single SELECT or WITH only
rowLimit?: number, // default 100, hard cap 1000
}
```
Pipeline:
1. `assertReadOnlySql(sql)` — regex rejects anything starting with `insert|update|delete|merge|alter|drop|create|truncate|grant|revoke|copy|call|do|vacuum|analyze|refresh`.
2. `limitSqlForExecution(sql, rowLimit)` — wraps as `select * from (<llm_sql>) as ktx_query_result limit N`.
3. `SemanticLayerService.executeQuery(connectionName, wrappedSql)`.
4. Format as markdown table; first ~20 rows inline; if truncated, append `… +N more rows`.
Structured output: `{ headers, rows, rowCount, truncated, sql, wrappedSql }`.
Connector errors surface verbatim (e.g., Postgres `relation "orbit_analytics.customer" does not exist`). That error message is the most valuable verification signal — it tells the LLM the identifier is fictional.
Refuses `connectionName` not in `allowedConnectionNames`. Each connector's driver-level read-only enforcement (Postgres read-only transaction, BigQuery query-only jobs) is a second defence under the regex gate.
### `discover_data`
```ts
input = {
query: string,
connectionName?: string, // omit to search all configured warehouse connections
limit?: number, // default 10 per section
sourceName?: string, // SL source detail mode (delegates to sl_discover)
}
```
Composes three searches and groups output into three sections, omitting empty sections:
1. **Wiki Pages**`wiki_search({query, limit})`. Routing hint: *use `wiki_read(blockKey)` for full content*.
2. **Semantic Layer Sources**`sl_discover({query, connectionName})`. Routing hint: *use `sl_read_source(sourceName)` for the YAML, or `entity_details` for warehouse-shape details*.
3. **Raw Warehouse Schema**`WarehouseCatalogService.searchByName(connectionName, query, limit)`. Routing hint: *use `entity_details({connectionName, targets: [{display}]})` for full DDL + sample values*.
When `sourceName` is set, delegates entirely to `sl_discover` inspect mode and skips other sections. When all three sections are empty, output is `No matches for "<query>" across wiki, semantic layer, or raw warehouse schema. Try broader terms; this concept may not exist yet.`
Structured output: `{ wiki: WikiSearchStructured|null, sl: SlDiscoverStructured|null, raw: RawSchemaHits|null }`.
## Wiring
`packages/context/src/ingest/ingest-bundle.runner.ts` already plumbs `emit_unmapped_fallback` into both the WorkUnit stage (`createEmitUnmappedFallbackTool` around line 726) and the reconcile stage (around line 962), with merging done via `packages/context/src/ingest/stages/build-wu-context.ts` and `build-reconcile-context.ts`.
Add a parallel factory next to those existing calls:
```ts
const warehouseTools = createWarehouseVerificationTools({
semanticLayerService: scopedSemanticLayerService,
warehouseCatalog: new WarehouseCatalogService({ fileStore, projectDir }),
dialects: getDialectForDriver,
allowedConnectionNames: slConnectionIds, // reuse existing scoping
sqlExecutionRowLimit: 100,
});
// Merge `entity_details`, `sql_execution`, `discover_data` into both stage tool maps
// alongside emit_unmapped_fallback.
```
`createWarehouseVerificationTools` returns `Record<string, Tool>` with three keys. The set is wired into every adapter's synthesis stage — no per-adapter opt-in.
## Skill-prompt updates
### Shared protocol
`packages/context/skills/_shared/identifier-verification.md`:
```md
## Identifier Verification Protocol
Before writing a wiki page or SL source on any topic:
1. `discover_data({query: "<topic>"})` — see what wikis, SL sources, and raw tables
already exist. Prefer updating existing pages over creating new ones.
Before emitting any `schema.table` or `schema.table.column` into a wiki body,
SL source, `tables:` frontmatter, `sl_refs`, or `emit_unmapped_fallback`:
2. `entity_details({connectionName, targets: [{display: "<identifier>"}]})`
confirm the identifier resolves; inspect native types, FK/PK, and sampleValues.
3. For literal values from the source (status codes, plan tiers): check whether
they appear in `entity_details`' `sampleValues` for the relevant column.
If `sampleValues` is short or you suspect the sample missed real values, run
a `sql_execution` probe: `SELECT DISTINCT <col> FROM <ref> LIMIT 50`.
4. If the candidate identifier still doesn't resolve, do one of:
(a) Use `sql_execution` with `SELECT 1 FROM <ref> LIMIT 0`. If it errors,
the identifier is fictional.
(b) Wrap the identifier in `[unverified — from <rawPath>]` in the wiki body,
citing the exact raw path that mentioned it.
(c) When recording `emit_unmapped_fallback` with `no_physical_table`,
include the failing probe error in `clarification`.
5. Never copy `<schema>.<table>` placeholder strings from these instructions
into output.
```
Each affected skill inlines this block verbatim (skill files are independent prompts; KTX has no cross-skill include mechanism today).
### Per-skill diffs
Two skills are deliberately excluded from updates: `ingest_triage` (read-only triage; produces no wiki or SL output) and `sl` (umbrella reference doc; cross-links to the protocol but doesn't need its own copy).
| Skill | Changes |
|---|---|
| `notion_synthesize` | Inline protocol; append `discover_data`, `entity_details`, `sql_execution` to `Allowed:` (line 74); replace `orbit_analytics.customer` example on line 70 with `<schema>.<table>` |
| `dbt_ingest` | Inline protocol; line 24: replace `wiki_sl_search``discover_data` and `sl_describe_table``entity_details`; strengthen the "not permission to invent physical columns" paragraph by naming `entity_details` as the verification call |
| `lookml_ingest` | Inline protocol; add: "Verify each `sql_table_name` from the LookML view with `entity_details` before mapping to an SL source" |
| `looker_ingest` | Inline protocol; add: "For every Looker field reference, call `entity_details` on the underlying `(schema, table, column)` before promoting to `sl_refs` or quoting in wiki body" |
| `metabase_ingest` | Inline protocol; add: "Before writing a wiki page derived from a Metabase question's SQL, verify each `schema.table.column` mentioned with `entity_details`" |
| `metricflow_ingest` | Inline protocol; add: "Verify each MetricFlow model's source table with `entity_details` before producing the corresponding `sl_write_source`" |
| `live_database_ingest` | Inline protocol; add: "Sample values come from the scan record; do not invent values not present in `relationship-profile.json`" |
| `historic_sql_table_digest` | Shortened protocol focused on column attribution: "Only mention columns visible in the table's scan record. Use `entity_details({display})` if uncertain" |
| `historic_sql_patterns` | Inline protocol; add: "Every join column mentioned in pattern descriptions must be verified via `entity_details` for both sides of the join" |
| `knowledge_capture` | Inline protocol; update line 44: "First call `discover_data` to find existing wiki pages, SL sources, and raw tables on the topic" |
| `sl_capture` | Inline protocol; add: "Before `sl_write_source`, call `entity_details` on the target table to confirm column names and types match the YAML being written" |
### Cleanups beyond the four-tool addition
- `notion_synthesize/SKILL.md:70` — remove `orbit_analytics.customer` (placeholder).
- `packages/context/src/ingest/tools/emit-unmapped-fallback.tool.ts:67` — same example string in the Zod `.describe()` — replace with `<schema>.<table>`.
- `dbt_ingest/SKILL.md:24` — fix `wiki_sl_search` and `sl_describe_table` (neither tool exists in KTX).
- `packages/context/src/sl/tools/sl-warehouse-validation.ts:93` — inline error message references the non-existent `sl_describe_table`. Replace with `sl_read_source`.
## Testing strategy
### Unit tests
| Component | Tests |
|---|---|
| `getDialectForDriver` | Every supported driver returns a dialect; unknown driver throws with a clear list of supported drivers |
| `WarehouseCatalogService.getTable` | Reads and merges `tables/<b64>.json` and `relationship-profile.json`; returns `null` when no sync exists; returns `null` for unknown `(catalog, db, name)` |
| `WarehouseCatalogService.resolveDisplay` | Postgres 2-part display → `{catalog: null, db, name}`; BigQuery 3-part display → `{catalog, db, name}`; ambiguous 2-part on BigQuery returns candidates list; unknown displays produce closest-match candidates ordered by edit distance |
| `WarehouseCatalogService.searchByName` | Substring and token match; tiers (exact-name → token-match) ordered correctly; cache hit on second call within same instance |
| `entity_details` | Resolves `{display}` and structured inputs; reports "Not found" with candidates for unknown ref; reports "no scan available" distinctly when scan dir missing; truncates above 50 targets |
| `discover_data` | Three sections present when all three have hits; sections omitted when empty; `sourceName` inspect mode delegates to `sl_discover` and skips other sections; `allowedConnectionNames` scope honoured |
| `sql_execution` | `assertReadOnlySql` rejects each mutating verb; row-limit wrap visible in `wrappedSql`; connector errors surface verbatim with the failing SQL; rejects `connectionName` not in `allowedConnectionNames` |
### Integration tests
- Extend `packages/context/src/ingest/ingest-bundle.runner.test.ts` to verify the three new tools are present in both WU-stage and reconcile-stage tool maps and refuse out-of-scope `connectionName` values.
- New fixture-based test: stage a small `raw-sources/<conn>/live-database/<sync>/` directory with 2 tables + 1 enrichment profile, then call each tool through the runner's tool map and assert the markdown contains the expected fields. Uses the same fake-LLM harness as `notion.adapter.test.ts`.
- One end-to-end regression test reproducing the `orbit_analytics.customer` hallucination: a fake Notion page mentioning the fictional table is fed to the synthesis stage; the run produces a wiki page where the fictional name is wrapped in `[unverified — …]` or omitted, not promoted to `tables:` frontmatter.
### Prompt-bundling tests
Extend `packages/context/src/memory/memory-runtime-assets.test.ts`:
- Every skill in the synthesis-writers list embeds the verification-protocol block (assert by stable header text).
- Every such skill lists the three new tools when it has a `## Tools / Allowed` section, or mentions them inline in a workflow step otherwise.
- No skill file contains any of the banned strings: `orbit_analytics.customer`, `wiki_sl_search`, `sl_describe_table`.
### Performance guards
`WarehouseCatalogService` caches the per-connection table list per stage (one WorkUnit's lifetime). Tests assert second call is a cache hit. No DB index for `searchByName` in this iteration — linear scan over scan artefacts is acceptable up to ~50K columns. If volume warrants it later, a follow-up PR adds a SQLite FTS index.
## Rollout
Four mergeable PRs:
| PR | Lands |
|---|---|
| 1 | `getDialectForDriver` + `WarehouseCatalogService` + `entity_details` tool + wiring in `ingest-bundle.runner.ts` + unit/integration tests |
| 2 | `sql_execution` tool + tests + the `orbit_analytics.customer` regression test (which exercises protocol steps 4a/4c) |
| 3 | `discover_data` tool + tests |
| 4 | All 11 skill prompts updated with the verification protocol + the three cleanups + extended `memory-runtime-assets.test.ts` |
Skill prompts land last so they can reference the tools that already exist.
## Out of scope
- **Hard write-time validation in `wiki_write` / `emit_unmapped_fallback`.** A complementary spec covers regex-based identifier validation at the write boundary. Defence-in-depth — separate concern.
- **SQLite FTS index for `searchByName`.** Deferred until the linear scan benchmark fails.
- **`raw_schema_search` as a standalone tool.** `discover_data`'s raw section covers the concept-search case.
- **`semantic_query` in the synthesis toolset.** `semantic_query` will exist in KTX for the research/chat-time agent; it is deliberately excluded from synthesis because synthesis creates SL sources rather than queries them.
- **`dictionary_search`.** `entity_details` already returns per-column `sampleValues`; for the rarer "where does this literal live?" case, `sql_execution` is more accurate than a sampled-JSON scan.
- **UUID layer for tables/columns.** KTX deliberately stays string-keyed on `(connection, catalog, db, name)`.