mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-07 07:55:13 +02:00
* docs: add CLI component reuse guidance * docs: add unified ingest ux design * Refine unified ingest UX design after adversarial review iteration 1 * Refine unified ingest UX design after adversarial review iteration 2 * Refine unified ingest UX design after adversarial review iteration 3 * feat(cli): route public connection ingest command * feat(cli): hide standalone scan from public help * feat(cli): plan public ingest depth and query history * feat(cli): execute public database ingest facets * feat(ingest): read connection query history config * fix(cli): use public ingest wording * fix(config): stop generating ingest adapter allow lists * docs: document public ingest command * test: align ingest surface expectations * docs: add unified ingest public CLI surface plan * feat(cli): preflight deep public ingest readiness * feat(setup): store query history in connection context * feat(setup): store database context depth * feat(setup): verify context readiness by database depth * fix(setup): keep context build foreground only * fix(config): reject reserved ingest connection ids * test: close unified ingest v1 expectations * docs: add unified ingest v1 closure plan * fix(ingest): bypass adapter allow-list for public source ingest * fix(ingest): honor query history window intent * fix(ingest): hide scan internals from public database ingest * feat(ingest): use foreground view for interactive public ingest * fix(setup): use schema context and query history wording * test(cli): verify unified ingest public output * docs: add unified ingest v1 public output closure plan * fix(setup): forward query history flags * fix(setup): prompt for postgres query history * fix(status): report query history readiness * fix(ingest): remove legacy public guidance * fix(ingest): polish foreground retry copy * docs(examples): use unified query history wording * chore(ingest): finish public query history cleanup * docs: add unified ingest v1 query history status cleanup plan * test(docs): cover unified ingest public docs * docs: align ingest CLI reference with unified UX * docs: update context build guides for unified ingest * docs: update setup and primary source ingest wording * docs: stop advertising adapter-backed example ingest * docs: close unified ingest public docs gaps * docs: add unified ingest v1 docs site closure plan * fix: render unified ingest foreground warnings * fix: explain query history schema order * fix: add public ingest retry guidance * fix: align setup next steps with unified ingest * fix: remove scan wording from demo progress * test: verify unified ingest ux closure * docs: add unified ingest v1 foreground and retry closure plan * fix(cli): preserve query-history pull config in public ingest * fix(cli): omit hidden commands from docs command tree * test(cli): close unified ingest final public surface checks * docs: add unified ingest v1 final public surface closure plan * fix(cli): use public source labels in ingest reports * fix(cli): suppress low-level public ingest output * test(cli): verify unified ingest public plain output * docs: add unified ingest v1 public plain output closure plan * fix(cli): add public ingest copy sanitizers * fix(cli): sanitize public ingest progress copy * fix(cli): rename setup schema scope prompt * docs(plan): add progress copy closure; test: align setup back-nav fixture Adds the iter9 plan and updates the setup back-navigation test fixture to pass disableQueryHistory plus listSchemas/listTables stubs that the unified ingest setup step now requires. * docs(plan): add final ux labels plan with narrowed label scans * fix(cli): aggregate unsupported query-history warnings * fix(cli): align setup database labels * test(cli): fix setup database test type-check * fix(cli): remove primary-source wording from setup output * test(cli): verify unified ingest setup closure * docs(plan): add unified ingest v1 verification copy closure plan * fix(cli): remove top-level scan command * fix(cli): remove legacy ingest and wiki commands * Merge scan into ingest flow * feat(cli): split ingest progress into per-phase rows, rename work units to tasks Each database target in the unified ingest dashboard now renders one row per real subprocess (Schema, then Query history when enabled) instead of a single combined bar. Each phase has its own monotonic 0-100% bar so the progress never snaps back to zero when historic-sql starts after scan completes. Completed phases keep their final bar, summary, and elapsed time visible as an inline audit trail; queued and skipped phases are shown explicitly. Also rename user-facing "work units" / "Failed work units" to "tasks" / "Failed tasks" in ingest output and parseIngestSummary. The parser still accepts the legacy "Work units:" wording in captured output for backward compat. Internal memory-flow event names and type fields are left alone. * Fix test harness failures * Fix CI smoke checks --------- Co-authored-by: Andrey Avtomonov <7889985+andreybavt@users.noreply.github.com>
156 lines
12 KiB
Text
156 lines
12 KiB
Text
---
|
|
title: The Context Layer
|
|
description: What a context layer is, why agents need one, and how KTX compares to other semantic layers.
|
|
---
|
|
|
|
## The problem
|
|
|
|
Give an agent access to your database and it will generate SQL. It might even produce a decent chart. But ask it a real analytics question — "what's our net revenue trend by segment?" — and things fall apart.
|
|
|
|
The agent doesn't know that `orders.amount` includes refunds and needs a status filter. It doesn't know that `customers` should join to `orders` on `customer_id`, not `id`. It doesn't know that your team stopped using `legacy_segments` six months ago, or that "enterprise" means contracts over $100k, not just big logos. It sees column names and types. It doesn't see your business.
|
|
|
|
This isn't a model capability problem. Claude Code, Codex, and your BI agents can write correct SQL when they know what correct means. The gap is context: which tables matter, which joins are valid, which metrics are canonical, what the business terms actually refer to. Without that, agents produce plausible-looking artifacts that are subtly, dangerously wrong. Wrong enough to pass a glance, wrong enough to drive a decision.
|
|
|
|
Analytics engineers already know this pain. It's the same reason you write dbt tests, maintain a data dictionary, and spend half of standup explaining why someone's dashboard number doesn't match the board deck. The difference is that agents make decisions at machine speed, so the wrong context propagates faster than a human can catch it.
|
|
|
|
## Three waves of AI analytics
|
|
|
|
The industry has moved through three distinct approaches to getting AI and data to work together.
|
|
|
|
**Wave one: database access.** Connect an LLM to a database, let it generate SQL. This works for simple lookups — "how many orders last week?" — but breaks on anything that requires business knowledge. The agent guesses at joins, invents metrics, and hallucinates table relationships. Every query is a coin flip.
|
|
|
|
**Wave two: semantic layers and text-to-SQL.** Add structure. Define metrics in MetricFlow or Cube, expose schemas, build text-to-SQL pipelines. This is better — the agent knows that `revenue` means `sum(amount) where status != 'refunded'` — but building and maintaining that structure by hand is manual, time-consuming, and still limited. Semantic layers define what to calculate, not why, when, or how to interpret the result. The agent can compute net revenue but doesn't know about the February refund anomaly, the segment reclassification, or the fact that `enterprise` changed definition last quarter.
|
|
|
|
**Wave three: agentic context.** AI is no longer just answering questions — it's generating dashboards, writing semantic definitions, proposing dbt models, creating tests and documentation. For that to work, agents need more than metric definitions. They need the full picture: business rules, known data quality issues, relationship maps, historical context, and the institutional knowledge that lives in your team's heads. They need a context layer.
|
|
|
|
## What a context layer is
|
|
|
|
A context layer is the infrastructure that gives agents the business knowledge they need to produce correct analytics artifacts. It includes a semantic layer — that's a critical component — but it's not the whole thing.
|
|
|
|
KTX organizes context into four pillars:
|
|
|
|
- Semantic sources
|
|
- Wiki pages
|
|
- Scan artifacts
|
|
- Provenance
|
|
|
|
Each pillar covers a different kind of context agents need before they can safely write SQL, update semantic definitions, or explain an analytics result.
|
|
|
|
**Semantic sources** are YAML definitions that describe your data in terms agents can reason about. Each source maps to a table or SQL query, declares its grain, defines typed columns, specifies valid joins, and exposes named measures with optional filters. This is where "revenue means `sum(amount)` excluding refunds" lives.
|
|
|
|
```yaml
|
|
name: orders
|
|
table: public.orders
|
|
grain: [id]
|
|
columns:
|
|
- name: id
|
|
type: number
|
|
- name: customer_id
|
|
type: number
|
|
- name: amount
|
|
type: number
|
|
- name: status
|
|
type: string
|
|
- name: created_at
|
|
type: time
|
|
role: time
|
|
joins:
|
|
- to: customers
|
|
"on": customer_id = customers.id
|
|
relationship: many_to_one
|
|
measures:
|
|
- name: revenue
|
|
expr: sum(amount)
|
|
filter: "status != 'refunded'"
|
|
description: Net revenue excluding refunds
|
|
- name: order_count
|
|
expr: count(id)
|
|
```
|
|
|
|
**Wiki pages** are Markdown documents that capture business definitions, rules, and operating context — the kind of context that doesn't fit in a schema definition. Pages have structured frontmatter (summary, tags, semantic layer references) and free-form content. Agents search them when they need to understand why a metric works a certain way, not just how to compute it.
|
|
|
|
```markdown
|
|
---
|
|
summary: Gross-to-net revenue reconciles paid invoices, credits, and refunds.
|
|
tags:
|
|
- finance
|
|
- revenue
|
|
refs:
|
|
- arr-contract-first
|
|
sl_refs:
|
|
- warehouse.invoices
|
|
usage_mode: auto
|
|
---
|
|
|
|
Gross revenue starts from paid invoice activity. Net revenue subtracts
|
|
credits and successful refunds in the month they are recorded.
|
|
|
|
Exclude unpaid, void, draft, and test-account invoice activity from
|
|
canonical revenue reporting.
|
|
```
|
|
|
|
**Scan artifacts** are the raw output of KTX's database scanner: table and column metadata, inferred foreign key relationships (even without declared constraints), column statistics, and enrichment reports. They form the foundation that semantic sources are built on.
|
|
|
|
**Provenance** is the record of how context was created and changed. Every ingestion session records a full transcript — which adapter ran, what the LLM decided, which sources were created or updated, and why. This is what makes the system auditable: you can trace any semantic source back to the ingestion decision that created it.
|
|
|
|
Together, these four pillars give agents enough context to produce analytics artifacts that match what your team would produce — not just syntactically valid SQL, but the right query for the question.
|
|
|
|
## How KTX compares
|
|
|
|
KTX is a context layer with an agent-native semantic layer at its core. MetricFlow, Cube, and Malloy model metrics, dimensions, joins, and generated SQL. KTX covers that semantic-layer work, then adds the context agents need to use and maintain it: wiki pages, schema scans, provenance, ingestion, validation, and agent-facing CLI commands.
|
|
|
|
The workflow is the difference. Traditional semantic layers are powerful, but they are usually built and maintained through manual modeling work, product-specific runtimes, or language-specific workflows. They are not agent-native by default, which makes them harder for agents to inspect, edit, validate, and review in a tight loop. KTX is designed for agents that need to read context, change semantic files, inspect generated SQL, and leave a reviewable git diff.
|
|
|
|
| | KTX semantic layer | MetricFlow | Cube | Malloy |
|
|
|---|---|---|---|---|
|
|
| **Model surface** | Plain YAML sources plus Markdown wiki pages | YAML semantic models and metrics in a dbt project | YAML or JavaScript cubes, views, access policies, and pre-aggregations | `.malloy` models, query pipelines, notebooks, and annotations |
|
|
| **What it models** | Sources, columns, measures, segments, joins, grain, filters, default time dimensions, and context references | Semantic models, entities, dimensions, measures, metrics, time grains, and metric types | Cubes, views, measures, dimensions, segments, joins, hierarchies, policies, and rollups | Sources, joins, dimensions, measures, calculations, nested results, and query pipelines |
|
|
| **Agent edit loop** | First-class. Agents can patch small files, save imperfect drafts, run validation, query through the CLI, inspect SQL, and refine in the same workflow | Possible, but the interface is a dbt/metric workflow rather than an agent context workflow | Possible through code-first models and platform APIs, but changes are tied to runtime deployment and governance concerns | Possible, but agents must operate in Malloy's language and compiler model |
|
|
| **Fan-out safety** | Explicit `grain` plus relationship metadata. KTX detects `one_to_many` fan-out, identifies chasm traps, pre-aggregates independent fact measures into CTEs, and rejects unsafe filters | Dataflow query planning for metric requests, multi-hop joins, metric time, and metric types | Runtime planner, modeled joins, primary keys, views, multi-fact views, and pre-aggregations | Symmetric aggregates and path-based aggregation in the language |
|
|
| **SQL generation** | Structured semantic query to canonical SQL, then dialect transpilation with sqlglot | Metric request to optimized query plan, then engine-specific SQL | REST, GraphQL, Postgres-compatible SQL, Semantic SQL, and cached/pre-aggregated execution | Malloy source/query to dialect-specific SQL and result metadata |
|
|
| **Context around semantics** | Built in: wiki pages, scan artifacts, relationship inference, ingest transcripts, replay, and agent-facing CLI commands | Primarily metric and dbt project context | Descriptions and `meta.ai_context` inside the semantic model, plus platform agent features | Annotations/tags can carry metadata; surrounding context depends on the application |
|
|
| **Best fit** | Agents maintaining analytics code, metrics, joins, SQL, docs, and semantic definitions | Teams standardizing metrics inside dbt workflows | Production semantic APIs, BI integrations, access control, caching, and concurrency | Expressive modeling and exploratory analysis above SQL |
|
|
|
|
If you do not have a semantic layer, KTX can build an agent-native one from your database schema and enrich it with generated descriptions and wiki pages. If you already use MetricFlow or LookML, KTX ingests from those tools and merges their context into KTX's files. You can keep your existing BI or metric-serving system while using KTX as the semantic and contextual surface agents work against.
|
|
|
|
## The plain-files philosophy
|
|
|
|
A KTX project is a directory of plain files. No server to run, no database to manage, no proprietary store to back up. Everything is YAML, Markdown, and SQLite — formats you can read, diff, and version-control with tools you already use.
|
|
|
|
```
|
|
my-project/
|
|
├── ktx.yaml # Project configuration
|
|
├── semantic-layer/
|
|
│ └── warehouse/
|
|
│ ├── orders.yaml # Semantic source definitions
|
|
│ ├── customers.yaml
|
|
│ └── order_items.yaml
|
|
├── wiki/
|
|
│ ├── global/
|
|
│ │ ├── revenue.md # Business definitions and rules
|
|
│ │ └── segment-classification.md
|
|
│ └── user/
|
|
│ └── local/
|
|
│ └── data-quality-notes.md
|
|
├── raw-sources/
|
|
│ └── warehouse/
|
|
│ └── database-ingest/ # Schema ingest artifacts and reports
|
|
└── .ktx/
|
|
├── db.sqlite # Local state (git-ignored)
|
|
└── cache/ # Runtime cache (git-ignored)
|
|
```
|
|
|
|
Semantic sources and wiki pages are committed to git. The SQLite database holds ephemeral state — schema ingest results, embedding indexes, session logs — and is git-ignored. If you delete it, KTX rebuilds it on the next run.
|
|
|
|
This means your analytics context travels with your code. You can fork it, branch it, review it in a PR, and merge it with the same tools you use for dbt models. There's no sync problem between a remote server and your local state. There's no migration to run. The files are the source of truth.
|
|
|
|
## Agent usage notes
|
|
|
|
Use this page when an agent needs to explain why KTX exists, why schema-only database access is not enough, or how KTX differs from MetricFlow, Cube, Malloy, and traditional semantic layers.
|
|
|
|
| Agent task | Relevant section | Next page |
|
|
|------------|------------------|-----------|
|
|
| Explain why a database agent made a plausible but wrong query | The problem | [Writing Context](/docs/guides/writing-context) |
|
|
| Decide whether a metric belongs in YAML or Markdown | What a context layer is | [Writing Context](/docs/guides/writing-context) |
|
|
| Compare KTX to another semantic layer | How KTX compares | [Primary Sources](/docs/integrations/primary-sources) |
|
|
| Explain reviewability and source of truth | The plain-files philosophy | [Context as Code](/docs/concepts/context-as-code) |
|