mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-07 07:55:13 +02:00
* docs: add CLI component reuse guidance * docs: add unified ingest ux design * Refine unified ingest UX design after adversarial review iteration 1 * Refine unified ingest UX design after adversarial review iteration 2 * Refine unified ingest UX design after adversarial review iteration 3 * feat(cli): route public connection ingest command * feat(cli): hide standalone scan from public help * feat(cli): plan public ingest depth and query history * feat(cli): execute public database ingest facets * feat(ingest): read connection query history config * fix(cli): use public ingest wording * fix(config): stop generating ingest adapter allow lists * docs: document public ingest command * test: align ingest surface expectations * docs: add unified ingest public CLI surface plan * feat(cli): preflight deep public ingest readiness * feat(setup): store query history in connection context * feat(setup): store database context depth * feat(setup): verify context readiness by database depth * fix(setup): keep context build foreground only * fix(config): reject reserved ingest connection ids * test: close unified ingest v1 expectations * docs: add unified ingest v1 closure plan * fix(ingest): bypass adapter allow-list for public source ingest * fix(ingest): honor query history window intent * fix(ingest): hide scan internals from public database ingest * feat(ingest): use foreground view for interactive public ingest * fix(setup): use schema context and query history wording * test(cli): verify unified ingest public output * docs: add unified ingest v1 public output closure plan * fix(setup): forward query history flags * fix(setup): prompt for postgres query history * fix(status): report query history readiness * fix(ingest): remove legacy public guidance * fix(ingest): polish foreground retry copy * docs(examples): use unified query history wording * chore(ingest): finish public query history cleanup * docs: add unified ingest v1 query history status cleanup plan * test(docs): cover unified ingest public docs * docs: align ingest CLI reference with unified UX * docs: update context build guides for unified ingest * docs: update setup and primary source ingest wording * docs: stop advertising adapter-backed example ingest * docs: close unified ingest public docs gaps * docs: add unified ingest v1 docs site closure plan * fix: render unified ingest foreground warnings * fix: explain query history schema order * fix: add public ingest retry guidance * fix: align setup next steps with unified ingest * fix: remove scan wording from demo progress * test: verify unified ingest ux closure * docs: add unified ingest v1 foreground and retry closure plan * fix(cli): preserve query-history pull config in public ingest * fix(cli): omit hidden commands from docs command tree * test(cli): close unified ingest final public surface checks * docs: add unified ingest v1 final public surface closure plan * fix(cli): use public source labels in ingest reports * fix(cli): suppress low-level public ingest output * test(cli): verify unified ingest public plain output * docs: add unified ingest v1 public plain output closure plan * fix(cli): add public ingest copy sanitizers * fix(cli): sanitize public ingest progress copy * fix(cli): rename setup schema scope prompt * docs(plan): add progress copy closure; test: align setup back-nav fixture Adds the iter9 plan and updates the setup back-navigation test fixture to pass disableQueryHistory plus listSchemas/listTables stubs that the unified ingest setup step now requires. * docs(plan): add final ux labels plan with narrowed label scans * fix(cli): aggregate unsupported query-history warnings * fix(cli): align setup database labels * test(cli): fix setup database test type-check * fix(cli): remove primary-source wording from setup output * test(cli): verify unified ingest setup closure * docs(plan): add unified ingest v1 verification copy closure plan * fix(cli): remove top-level scan command * fix(cli): remove legacy ingest and wiki commands * Merge scan into ingest flow * feat(cli): split ingest progress into per-phase rows, rename work units to tasks Each database target in the unified ingest dashboard now renders one row per real subprocess (Schema, then Query history when enabled) instead of a single combined bar. Each phase has its own monotonic 0-100% bar so the progress never snaps back to zero when historic-sql starts after scan completes. Completed phases keep their final bar, summary, and elapsed time visible as an inline audit trail; queued and skipped phases are shown explicitly. Also rename user-facing "work units" / "Failed work units" to "tasks" / "Failed tasks" in ingest output and parseIngestSummary. The parser still accepts the legacy "Work units:" wording in captured output for backward compat. Internal memory-flow event names and type fields are left alone. * Fix test harness failures * Fix CI smoke checks --------- Co-authored-by: Andrey Avtomonov <7889985+andreybavt@users.noreply.github.com>
295 lines
12 KiB
Text
295 lines
12 KiB
Text
---
|
|
title: Writing Context
|
|
description: Write and refine semantic sources and wiki pages.
|
|
---
|
|
|
|
After building context through scanning and ingestion, you'll want to refine it — edit semantic sources to match your business logic, add wiki pages that capture tribal knowledge, and query your data through the semantic layer to verify everything works.
|
|
|
|
## Agent workflow summary
|
|
|
|
Agents should refine context in this order:
|
|
|
|
1. `ktx sl list --json` — discover available sources and connection ids.
|
|
2. `ktx sl search <query> --json` — find source candidates for a concept.
|
|
3. Edit the source YAML directly in `semantic-layer/<connection-id>/`.
|
|
4. `ktx sl validate <source> --connection-id <id>` — verify columns, joins, and table references.
|
|
5. `ktx sl query ... --format sql` — compile a representative query without executing it.
|
|
6. `ktx wiki search ...` — check business context captured by ingest or memory.
|
|
|
|
## Semantic Sources
|
|
|
|
Semantic sources are YAML files that describe your tables, columns, measures, and joins. They're the core of the context layer — the structured definitions that agents use to generate correct SQL.
|
|
|
|
### Listing sources
|
|
|
|
```bash
|
|
# List all sources across connections
|
|
ktx sl list
|
|
|
|
# List sources for a specific connection
|
|
ktx sl list --connection-id my-postgres
|
|
|
|
# Output as JSON
|
|
ktx sl list --json
|
|
```
|
|
|
|
### Searching sources
|
|
|
|
```bash
|
|
ktx sl search "revenue" --connection-id my-postgres --json
|
|
```
|
|
|
|
Search returns ranked source summaries. To inspect or edit a source, open the
|
|
YAML file under `semantic-layer/<connection-id>/`.
|
|
|
|
### The source schema
|
|
|
|
A semantic source defines a single queryable entity — usually a table or a SQL expression. Here's a fully annotated example:
|
|
|
|
```yaml
|
|
name: orders
|
|
description: Customer orders with line-item totals
|
|
table: public.orders # or use `sql:` for a custom SQL expression
|
|
grain:
|
|
- order_id # columns that uniquely identify a row
|
|
|
|
columns:
|
|
- name: order_id
|
|
type: string # string | number | time | boolean
|
|
description: Unique order identifier
|
|
|
|
- name: order_date
|
|
type: time
|
|
role: time # marks this as the default time dimension
|
|
description: Date the order was placed
|
|
|
|
- name: status
|
|
type: string
|
|
visibility: public # public (default) | internal | hidden
|
|
description: Current order status
|
|
|
|
- name: _etl_loaded_at
|
|
type: time
|
|
visibility: hidden # hidden columns are excluded from agent queries
|
|
description: Internal ETL timestamp
|
|
|
|
- name: total_amount
|
|
type: number
|
|
description: Order total in USD
|
|
|
|
measures:
|
|
- name: total_revenue
|
|
expr: SUM(total_amount)
|
|
description: Sum of all order values
|
|
- name: order_count
|
|
expr: COUNT(DISTINCT order_id)
|
|
description: Number of distinct orders
|
|
- name: avg_order_value
|
|
expr: AVG(total_amount)
|
|
description: Average order value
|
|
- name: high_value_revenue
|
|
expr: SUM(total_amount)
|
|
filter: total_amount > 100
|
|
description: Revenue from orders over $100
|
|
|
|
segments:
|
|
- name: us_orders
|
|
expr: country = 'US'
|
|
description: Orders from US customers
|
|
|
|
joins:
|
|
- to: customers
|
|
on: orders.customer_id = customers.customer_id
|
|
relationship: many_to_one # many_to_one | one_to_many | one_to_one
|
|
- to: order_items
|
|
on: orders.order_id = order_items.order_id
|
|
relationship: one_to_many
|
|
alias: items # optional alias for the joined source
|
|
```
|
|
|
|
Key fields:
|
|
|
|
| Field | Required | Description |
|
|
|-------|----------|-------------|
|
|
| `name` | Yes | Source identifier (lowercase, underscores) |
|
|
| `table` or `sql` | Yes | Database table or custom SQL expression (exactly one) |
|
|
| `grain` | Yes | Columns that define row uniqueness |
|
|
| `columns` | No | Column definitions with type, role, visibility |
|
|
| `measures` | No | Aggregation expressions (SUM, COUNT, AVG, etc.) |
|
|
| `joins` | No | Relationships to other sources |
|
|
| `segments` | No | Named filter conditions |
|
|
| `inherits_columns_from` | No | Inherit column metadata from a manifest entry |
|
|
|
|
Source component fields:
|
|
|
|
| Component | Field | Required | Description |
|
|
|-----------|-------|----------|-------------|
|
|
| Column | `name` | Yes | Column identifier as used in SQL expressions |
|
|
| Column | `type` | Yes | Agent-facing type: `string`, `number`, `time`, or `boolean` |
|
|
| Column | `role` | No | Special role such as `time` for default time dimensions |
|
|
| Column | `visibility` | No | `public`, `internal`, or `hidden` |
|
|
| Column | `description` | Strongly recommended | Human-readable business meaning |
|
|
| Measure | `name` | Yes | Queryable metric name |
|
|
| Measure | `expr` | Yes | SQL aggregation expression at the source grain |
|
|
| Measure | `filter` | No | SQL predicate applied only to this measure |
|
|
| Measure | `description` | Strongly recommended | Definition agents can cite and compare |
|
|
| Segment | `name` | Yes | Reusable filter name |
|
|
| Segment | `expr` | Yes | SQL predicate for the segment |
|
|
| Join | `to` | Yes | Target semantic source name |
|
|
| Join | `on` | Yes | SQL join condition using source names or aliases |
|
|
| Join | `relationship` | Yes | `many_to_one`, `one_to_many`, or `one_to_one` |
|
|
| Join | `alias` | No | Query alias for repeated or clearer joins |
|
|
|
|
Column visibility controls what agents see:
|
|
|
|
| Visibility | Behavior |
|
|
|------------|----------|
|
|
| `public` | Included in agent queries and listings (default) |
|
|
| `internal` | Available for joins and measures but not shown to agents |
|
|
| `hidden` | Excluded entirely — useful for ETL columns |
|
|
|
|
### Editing a source
|
|
|
|
Edit source files directly. They live at
|
|
`semantic-layer/<connection-id>/<source-name>.yaml` in your project directory.
|
|
|
|
### Validating sources
|
|
|
|
Validation checks a source definition against the actual database schema:
|
|
|
|
```bash
|
|
ktx sl validate orders --connection-id my-postgres
|
|
```
|
|
|
|
This catches mismatches — columns that don't exist in the table, type mismatches, invalid join targets — before an agent tries to use the source.
|
|
|
|
### Querying
|
|
|
|
The semantic layer compiles your measures and dimensions into SQL, optionally executing it against the database:
|
|
|
|
```bash
|
|
# Compile a query to SQL
|
|
ktx sl query \
|
|
--connection-id my-postgres \
|
|
--measure total_revenue \
|
|
--measure order_count \
|
|
--dimension "order_date" \
|
|
--filter "status = 'completed'" \
|
|
--order-by order_date:desc \
|
|
--limit 10 \
|
|
--format sql
|
|
```
|
|
|
|
This outputs the compiled SQL without executing it. To run the query:
|
|
|
|
```bash
|
|
# Execute and return results
|
|
ktx sl query \
|
|
--connection-id my-postgres \
|
|
--measure total_revenue \
|
|
--dimension "order_date" \
|
|
--execute \
|
|
--max-rows 100
|
|
```
|
|
|
|
Query flags:
|
|
|
|
| Flag | Description |
|
|
|------|-------------|
|
|
| `--measure <name>` | Measure to query (repeatable, at least one required) |
|
|
| `--dimension <name>` | Dimension to group by (repeatable) |
|
|
| `--filter <expr>` | Filter expression (repeatable) |
|
|
| `--segment <name>` | Named segment to apply (repeatable) |
|
|
| `--order-by <field[:dir]>` | Sort field, optionally with `:asc` or `:desc` (repeatable) |
|
|
| `--limit <n>` | Maximum rows in the compiled query |
|
|
| `--format <mode>` | Output format: `json` (default) or `sql` |
|
|
| `--execute` | Execute the query against the database |
|
|
| `--max-rows <n>` | Maximum rows to return when executing |
|
|
| `--include-empty` | Include empty/null rows in results |
|
|
|
|
The query planner is grain-aware — it understands the cardinality of joins and avoids chasm traps (double-counting caused by many-to-many fan-outs). When you query measures that span multiple sources, KTX generates sub-queries at the correct grain before joining.
|
|
|
|
### Workflow: edit and validate a source
|
|
|
|
1. Open `semantic-layer/my-postgres/orders.yaml`.
|
|
2. Edit the file to add columns, measures, joins, or descriptions.
|
|
3. `ktx sl validate orders --connection-id my-postgres` — check the definition against the live schema.
|
|
4. `ktx sl query --connection-id my-postgres --measure total_revenue --dimension order_date --format sql` — compile a representative query.
|
|
|
|
If validation fails, fix the YAML before asking an agent to use the source. Common validation failures are missing columns, invalid join targets, and measure expressions that reference fields outside the source.
|
|
|
|
## Wiki Pages
|
|
|
|
Wiki pages are Markdown files that capture business context — definitions, rules, gotchas, and anything an agent needs to understand beyond what the schema tells it.
|
|
|
|
### What they are
|
|
|
|
When an agent asks "what counts as an active user?" or "why do revenue numbers differ between the dashboard and the SQL query?", the answer isn't in the schema. It's tribal knowledge that lives in Slack threads, Notion pages, or someone's head. Wiki pages make that context searchable and available to agents.
|
|
|
|
### Organization
|
|
|
|
Wiki pages are organized by scope:
|
|
|
|
```
|
|
wiki/
|
|
├── global/ # Cross-cutting definitions
|
|
│ ├── order-status-definitions.md
|
|
│ ├── revenue-recognition-rules.md
|
|
│ └── data-freshness-sla.md
|
|
└── user/
|
|
└── local/ # User-scoped context
|
|
├── schema-conventions.md
|
|
└── known-data-issues.md
|
|
```
|
|
|
|
- **Global pages** apply across all connections — business definitions, metric standards, company terminology.
|
|
- **User-scoped pages** are private to a user ID — personal notes, local gotchas, or context you do not want shared globally.
|
|
|
|
### Editing pages
|
|
|
|
Create and edit wiki pages directly as Markdown files in the `wiki/`
|
|
directory. Ingest and memory capture also create these pages automatically.
|
|
|
|
Wiki page fields:
|
|
|
|
| Field | Required | Description |
|
|
|-------|----------|-------------|
|
|
| Key | Yes | Stable page identifier used as the Markdown filename |
|
|
| Summary | Yes | Short text shown in search results |
|
|
| Content | Yes | Full Markdown business context |
|
|
| Scope | No | `global` for shared context or `user` for user-scoped notes |
|
|
| Tags | No | Search and organization labels |
|
|
| External refs | No | Links or identifiers for source-of-truth systems |
|
|
| Semantic-layer refs | No | Source names the page explains or constrains |
|
|
|
|
### Listing pages
|
|
|
|
```bash
|
|
ktx wiki list
|
|
```
|
|
|
|
### Searching
|
|
|
|
```bash
|
|
ktx wiki search "revenue recognition"
|
|
```
|
|
|
|
Search uses both full-text matching and semantic similarity — it finds relevant pages even when the exact terms don't match. Agents call this automatically when they need business context to answer a question.
|
|
|
|
### Workflow: add searchable business context
|
|
|
|
1. Search first: `ktx wiki search "order status definitions"`.
|
|
2. If no page already covers the rule, create or edit a Markdown file under `wiki/global/`.
|
|
3. Include concise frontmatter; agents see the summary before loading full content.
|
|
4. Add `tags` values for the business area and `sl_refs` values for related semantic sources.
|
|
5. Search again with the user's likely wording to confirm the page is discoverable.
|
|
|
|
## Common errors
|
|
|
|
| Error or symptom | Likely cause | Recovery |
|
|
|------------------|--------------|----------|
|
|
| `ktx sl validate` reports a missing column | YAML references a column that is absent from the scanned table | Run a fresh scan or update the YAML to match the warehouse schema |
|
|
| Query compilation double-counts a measure | Join relationship or grain is missing or wrong | Add `grain` and explicit `relationship` values, then validate and recompile |
|
|
| Agent cannot find a metric | Measure name or description does not match business terminology | Add a measure description and a wiki page with common synonyms |
|
|
| Wiki search misses a page | Summary and tags do not include likely user wording | Rewrite the summary and add relevant tags, then search again |
|
|
| Semantic-layer changes are hard to review | The YAML edit is too large or unfocused | Split the change into smaller source-file edits, then review the git diff |
|