ktx/packages/context/skills/knowledge_capture/SKILL.md
Andrey Avtomonov c22248dabf
feat(context): add warehouse verification tools (#46)
* feat(context): add warehouse dialect dispatch

* feat(context): read warehouse scan catalog

* feat(context): add entity details verification tool

* feat(context): add ingest SQL verification tool

* feat(context): add raw warehouse discovery tool

* feat(context): expose warehouse verification tools to ingest

* docs(context): add ingest identifier verification protocol

* test(context): guard ingest identifier verification prompts

* chore(context): verify warehouse verification tools

* docs: add warehouse verification tools plan and spec

* fix(context): expose target warehouses to Notion ingest

* fix(context): update ingest prompts for warehouse verification tools

* fix(context): scope raw schema discovery to allowed connections

* fix(context): verify warehouse column display targets

* docs: add notion warehouse verification gap closure plan

* fix(context): include raw discovery connection names

* fix(context): expose warehouse targets for LookML and MetricFlow

* fix(context): pass connection config to ingest query executors

* fix(cli): enable read-only SQL probes for local ingest

* docs: add warehouse verification final v1 closure plan

* fix(context): align warehouse sql probe prompt shape

* docs: add warehouse verification prompt shape closure plan

* test(context): catch connectionless sql execution prompt examples

* fix(context): include connection name in sl capture sql example

* docs: add warehouse verification sql example closure plan

* fix(context): report structured entity detail misses

* docs: add warehouse verification structured target miss closure plan

* fix: report untracked squash merge conflicts

* feat: require ingest verification ledger

* fix: stabilize ingest wiki references
2026-05-13 13:43:23 +02:00

9.4 KiB
Raw Permalink Blame History

name description callers
knowledge_capture KTX's knowledge base — wiki pages for durable, reusable business knowledge. Covers capture workflow for user preferences, metric definitions, organizational conventions, and cross-references between knowledge pages and semantic-layer sources. Loaded by the post-turn memory-agent only. The research agent reads wiki via `wiki_read`/`wiki_search` but does not write it.
memory_agent

Knowledge Capture

Role

The knowledge base stores durable, reusable business knowledge for an analytics assistant. Each page is a self-contained rule, definition, or convention that answers "how should this concept be handled in this organization?" — written once and reused across chats.

Scope selection is handled by the runtime:

  • When user-scoped knowledge is enabled AND the caller is a chat turn, writes go to the user's personal scope.
  • When the caller is an admin-driven ingest (sourceType: 'external_ingest'), writes go to the global scope.
  • When user-scoped knowledge is disabled, all writes go to the global scope.

The wiki_write tool picks the right scope based on the session. Capture logic does not need to choose — focus on whether the content is worth capturing at all.

What to capture

Capture when the user or the ingested document expresses:

  • A metric definition ("revenue means booked revenue after refunds").
  • A filter or convention that should always apply ("exclude test accounts when reporting ARR").
  • A mapping or alias ("mood_stress_sleep = Oxytocin protocol").
  • A domain rule that is not visible from column names alone ("status = 'T' means terminated, not 'terminated'").
  • A link or external system convention ("medplum_patient_id is the primary key in the EMR at https://emr.example/patients/{id}").

Do NOT capture:

  • One-off requests ("answer under 100 words").
  • Temporary instructions scoped to the current chat.
  • Ad-hoc formatting preferences.
  • Information already present in the semantic layer (column names, join paths, measure formulas — those belong in SL).
  • Query results, snapshots, or time-bounded benchmark tables. Numbers go stale; pasting "Oct 2025: 25%, Nov 2025: 19.9%, …" creates misinformation as soon as new data lands. Reference the SL source by name (sl_refs) and let future queries pull live data — the wiki captures the rule (definition, exclusion, segmentation), the SL source captures the measure, and semantic_query captures the current values.
  • Interpretive narrative tied to a specific snapshot ("M1 retention degraded sharply from Dec 2025"). The observation is anchored to data that will move; the actionable convention (e.g., "always exclude in-progress cohorts") may be worth capturing on its own, but the snapshot-specific commentary is not.

If nothing is worth capturing, respond without calling any tool.

Workflow

  1. Read the wiki index (provided in the prompt) and decide whether the turn introduces durable knowledge.
  2. Before writing, search for related content so cross-references are accurate:
    • discover_data first when a page relates to data or SL concepts — find existing wiki pages, SL sources, and raw warehouse schema together.
    • wiki_search with the topic — find related wiki pages to populate refs.
    • sl_discover with the concept — if the page defines a metric (revenue, churn, retention, LTV, ARR, MRR, CAC, attribution, etc.), find matching SL sources or measures to populate sl_refs. If no matches, pass sl_refs: [] so future readers know you checked.
  3. If updating an existing page, wiki_read it first. Use the returned structured.content or markdown body as the exact stored text for targeted replacements; current tags, refs, and sl_refs are returned in structured metadata.
  4. wiki_write to create or update. Prefer merging into an existing page over creating a new one.
  5. wiki_remove only when a page is truly obsolete — not to replace stale content (update it instead).

For bundle/external ingest, include rawPaths on every wiki_write/wiki_remove call with only the raw files that directly support that wiki action. This keeps ingest provenance tied to the actual source file, not every file in the WorkUnit.

Identifier Verification Protocol

Before writing a wiki page or SL source on any topic:

  1. discover_data({query: "<topic>"}) - see what wikis, SL sources, and raw tables already exist. Prefer updating existing pages over creating new ones.

Before emitting any schema.table or schema.table.column into a wiki body, SL source, tables: frontmatter, sl_refs, or emit_unmapped_fallback:

  1. entity_details({connectionName, targets: [{display: "<identifier>"}]}) - confirm the identifier resolves; inspect native types, FK/PK, and sampleValues.
  2. For literal values from the source, such as status codes or plan tiers, check whether they appear in entity_details sampleValues for the relevant column. If sampleValues is short or the sample may have missed real values, run a sql_execution probe with the same warehouse connection name: sql_execution({connectionName, sql: "SELECT DISTINCT <col> FROM <ref> LIMIT 50"}).
  3. If the candidate identifier still does not resolve, do one of:
    • Use sql_execution({connectionName, sql: "SELECT 1 FROM <ref> LIMIT 0"}). If it errors, the identifier is fictional.
    • Wrap the identifier in [unverified - from <rawPath>] in the wiki body, citing the exact raw path that mentioned it.
    • When recording emit_unmapped_fallback with no_physical_table, include the failing probe error in clarification.
  4. Never copy <schema>.<table> placeholder strings from these instructions into output.

Keys, summaries, and content

  • Keys are short kebab-case topic identifiers: leads-source-filter, revenue-definition, churn-calculation. No namespacing, no prefixes.
  • Summary is a one-line hook (≤200 chars) shown in the index.
  • Content is concise markdown — actionable rules, not prose.
## [Topic Title]
- Rule or preference statement
- Another rule if applicable

Prefer fewer, richer pages over many thin ones. Each page covers one coherent topic thoroughly. If the new information relates to an existing page, update that page instead of fragmenting the knowledge.

Tags, refs, sl_refs

The wiki_write tool accepts three array fields that go into the page frontmatter:

  • tags: 13 short lowercase topic tags (["finance"], ["data-quality"]). Call wiki_list_tags first to reuse existing tags for consistency.
  • refs: keys of related wiki pages. Add when the new page materially depends on concepts from another (e.g., a churn definition that uses the paid-orders filter from a revenue definition). Don't add refs just because pages share a topic area.
  • sl_refs: names of SL sources or measures the page relates to. Format: "source_name" or "source_name.measure_name". Discover via sl_discover → inspect with sl_read_source → include the confirmed matches.

Wiki page keys must be flat slugs. Use large-contract-requesters, not historic-sql/large-contract-requesters. Use tags, source, and content headings for grouping.

Replace semantics

All three fields use REPLACE semantics on update:

  • Omit the field → existing value is kept.
  • Pass [] → field is cleared.
  • Pass [values] → replaces existing with exactly those values (no merging).

Editing existing pages

Two modes:

  • Full content — pass content to rewrite the whole page. Use when the page structure needs to change.
  • Targeted edits — pass replacements: [{ oldText, newText }] to apply exact-string replacements. Use for small updates; preserves the rest of the page.

When editing, read the page first so the edit matches exact whitespace and indentation.

Overriding an organization rule

Organization (GLOBAL) pages are read-only from a user's personal-scope session. To override a global rule for a single user, write a personal page with the same key. At read time the USER page wins.

Worked example — capturing a metric with cross-references

User says: "Going forward, the official refund rate is total refunded amount divided by total gross transaction amount."

wiki_list_tags()
  → existing tags include "finance"
wiki_search({ query: "refund revenue paid orders" })
  → returns `revenue-definition` (related — defines paid-orders filter)
sl_discover({ query: "refund rate" })
  → returns fct_orders (score 0.08), fct_gaap_revenue (0.06)
sl_read_source({ sourceName: "fct_orders" })
  → confirms amount_refunded_dollars and transaction_amount_dollars exist
wiki_write({
  key: "refund-rate-definition",
  summary: "Refund rate = refunded amount / gross transaction amount",
  content: "## Refund Rate\n- Definition: sum(amount_refunded_dollars) / sum(transaction_amount_dollars)\n- Source of truth: fct_orders\n- Related: see revenue-definition for paid-orders filter.",
  tags: ["finance"],
  refs: ["revenue-definition"],
  sl_refs: ["fct_orders.refund_rate_pct", "fct_orders"]
})

Search-then-write order matters. Cross-references are part of the page's identity, not an afterthought.

Rules

  • Read existing pages before updating them.
  • Prefer merging into an existing page over creating a new one.
  • Prefer fewer, richer pages over many thin ones.
  • Write content as clear, actionable rules — not narrative prose.
  • Discover cross-references via search before writing, not after.
  • If nothing is worth capturing, respond without calling any tool.