mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
Polish documentation copy (#98)
This commit is contained in:
parent
ce23aca4c4
commit
372c90b533
65 changed files with 478 additions and 478 deletions
|
|
@ -1,23 +1,23 @@
|
|||
---
|
||||
title: Context as Code
|
||||
description: Treat analytics context like code — version it, review it, merge it.
|
||||
description: Treat analytics context like code - version it, review it, merge it.
|
||||
---
|
||||
|
||||
## The idea
|
||||
|
||||
dbt proved that analytics transformations belong in version control. Before dbt, SQL lived in BI tools, scheduling systems, and spreadsheets — scattered, unreviewed, impossible to audit. "Analytics as code" changed that: put your models in git, review them in PRs, deploy them by merging.
|
||||
dbt proved that analytics transformations belong in version control. Before dbt, SQL lived in BI tools, scheduling systems, and spreadsheets - scattered, unreviewed, impossible to audit. "Analytics as code" changed that: put your models in git, review them in PRs, deploy them by merging.
|
||||
|
||||
KTX applies the same principle to analytics context. Metric definitions, business rules, join relationships, wiki pages — these are artifacts that determine whether an agent produces correct results. They change over time. They need review. They need history. They need to be treated like code.
|
||||
KTX applies the same principle to analytics context. Metric definitions, business rules, join relationships, wiki pages - these are artifacts that determine whether an agent produces correct results. They change over time. They need review. They need history. They need to be treated like code.
|
||||
|
||||
A KTX project is a git repository. Semantic sources are YAML files. Wiki pages are Markdown files. Changes are commits. Updates are pull requests. Deployment is a merge. The entire lifecycle of your analytics context follows the same workflow your team already uses for dbt models, application code, and infrastructure.
|
||||
|
||||
## Auto-ingestion
|
||||
|
||||
Most analytics context already exists — it's in your dbt manifests, LookML models, Metabase questions, and team Notion pages. KTX pulls from these sources automatically through adapters.
|
||||
Most analytics context already exists - it's in your dbt manifests, LookML models, Metabase questions, and team Notion pages. KTX pulls from these sources automatically through adapters.
|
||||
|
||||
An ingestion run works like this:
|
||||
|
||||
1. **Adapters extract metadata.** Each configured source — dbt, LookML, Metabase, MetricFlow, Notion, or your live database — provides structured metadata about models, metrics, dimensions, questions, and documentation.
|
||||
1. **Adapters extract metadata.** Each configured source - dbt, LookML, Metabase, MetricFlow, Notion, or your live database - provides structured metadata about models, metrics, dimensions, questions, and documentation.
|
||||
|
||||
2. **The LLM agent reconciles.** KTX doesn't blindly overwrite existing context. An LLM agent compares incoming metadata against your current semantic sources and wiki pages. It decides what to create, what to update, and what to leave alone. If your dbt project added a new model, the agent writes a new semantic source. If a Metabase question references a metric you've already defined, the agent skips the duplicate.
|
||||
|
||||
|
|
@ -66,17 +66,17 @@ metadata, and documentation updates are ready for review each morning.
|
|||
|
||||
Once merged, agents querying through the KTX CLI see the updated context immediately. No deployment step, no cache invalidation, no restart. The files are the source of truth, and agents read them on every request.
|
||||
|
||||
This workflow gives you the same review guarantees you have for dbt models. No semantic source reaches production without a human approving it. But unlike maintaining context manually, the heavy lifting — discovering new tables, drafting source definitions, extracting business rules from documentation — is done by the ingestion agent. You review and approve. You don't write from scratch.
|
||||
This workflow gives you the same review guarantees you have for dbt models. No semantic source reaches production without a human approving it. But unlike maintaining context manually, the heavy lifting - discovering new tables, drafting source definitions, extracting business rules from documentation - is done by the ingestion agent. You review and approve. You don't write from scratch.
|
||||
|
||||
## Feedback loops
|
||||
|
||||
Context improves over time through two feedback channels.
|
||||
|
||||
**Analyst corrections.** When an analytics engineer spots something wrong — a measure formula that doesn't match the business definition, a join that should be `many_to_one` instead of `one_to_many`, a wiki page that's out of date — they edit the YAML or Markdown directly and commit. These corrections become part of the project's git history, and the next ingestion run respects them. If you manually fix a measure definition, KTX won't overwrite it on the next ingest.
|
||||
**Analyst corrections.** When an analytics engineer spots something wrong - a measure formula that doesn't match the business definition, a join that should be `many_to_one` instead of `one_to_many`, a wiki page that's out of date - they edit the YAML or Markdown directly and commit. These corrections become part of the project's git history, and the next ingestion run respects them. If you manually fix a measure definition, KTX won't overwrite it on the next ingest.
|
||||
|
||||
**Agent feedback.** When an agent queries the semantic layer and gets unexpected results — a query that returns no rows because of a bad filter, a join path that produces duplicated results — it can flag the issue. These signals feed back into the context: wiki pages can note known data quality issues, and source definitions can be tightened with better filters, join paths, or grain declarations.
|
||||
**Agent feedback.** When an agent queries the semantic layer and gets unexpected results - a query that returns no rows because of a bad filter, a join path that produces duplicated results - it can flag the issue. These signals feed back into the context: wiki pages can note known data quality issues, and source definitions can be tightened with better filters, join paths, or grain declarations.
|
||||
|
||||
Each of these channels makes the next ingestion cycle better. Analyst corrections teach the system what your team considers authoritative. Agent feedback surfaces gaps in coverage. Context is not a static artifact — it's a living system that converges toward accuracy with every iteration.
|
||||
Each of these channels makes the next ingestion cycle better. Analyst corrections teach the system what your team considers authoritative. Agent feedback surfaces gaps in coverage. Context is not a static artifact - it's a living system that converges toward accuracy with every iteration.
|
||||
|
||||
## Deterministic replay
|
||||
|
||||
|
|
@ -84,9 +84,9 @@ Every ingestion session in KTX produces a full transcript: every tool call the L
|
|||
|
||||
This matters for three reasons.
|
||||
|
||||
**Debugging.** When a semantic source looks wrong — the grain is off, a join points to the wrong table, a measure formula doesn't match the business definition — you can trace it back to the ingestion session that created it. The transcript shows exactly which adapter provided the input, how the LLM interpreted it, and why it made the decision it did. You don't have to guess.
|
||||
**Debugging.** When a semantic source looks wrong - the grain is off, a join points to the wrong table, a measure formula doesn't match the business definition - you can trace it back to the ingestion session that created it. The transcript shows exactly which adapter provided the input, how the LLM interpreted it, and why it made the decision it did. You don't have to guess.
|
||||
|
||||
**Trust.** Analytics teams need to trust the context that agents consume. Deterministic replay means you can verify any part of the context layer by re-examining the session that produced it. If a stakeholder asks "where did this revenue definition come from?", you have a complete audit trail — from the dbt manifest entry, through the LLM's reconciliation logic, to the YAML file that was written.
|
||||
**Trust.** Analytics teams need to trust the context that agents consume. Deterministic replay means you can verify any part of the context layer by re-examining the session that produced it. If a stakeholder asks "where did this revenue definition come from?", you have a complete audit trail - from the dbt manifest entry, through the LLM's reconciliation logic, to the YAML file that was written.
|
||||
|
||||
**Reproducibility.** Because ingestion sessions are recorded as structured transcripts (tool calls and responses, not just logs), they can be replayed for testing and validation. If you change your ingestion configuration or upgrade the LLM, you can replay previous sessions to see how the output would differ. This gives you a safety net for changes that affect how context is generated.
|
||||
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ description: What a context layer is, why agents need one, and how KTX compares
|
|||
|
||||
## The problem
|
||||
|
||||
Give an agent access to your database and it will generate SQL. It might even produce a decent chart. But ask it a real analytics question — "what's our net revenue trend by segment?" — and things fall apart.
|
||||
Give an agent access to your database and it will generate SQL. It might even produce a decent chart. But ask it a real analytics question - "what's our net revenue trend by segment?" - and things fall apart.
|
||||
|
||||
The agent doesn't know that `orders.amount` includes refunds and needs a status filter. It doesn't know that `customers` should join to `orders` on `customer_id`, not `id`. It doesn't know that your team stopped using `legacy_segments` six months ago, or that "enterprise" means contracts over $100k, not just big logos. It sees column names and types. It doesn't see your business.
|
||||
|
||||
|
|
@ -17,15 +17,15 @@ Analytics engineers already know this pain. It's the same reason you write dbt t
|
|||
|
||||
The industry has moved through three distinct approaches to getting AI and data to work together.
|
||||
|
||||
**Wave one: database access.** Connect an LLM to a database, let it generate SQL. This works for simple lookups — "how many orders last week?" — but breaks on anything that requires business knowledge. The agent guesses at joins, invents metrics, and hallucinates table relationships. Every query is a coin flip.
|
||||
**Wave one: database access.** Connect an LLM to a database, let it generate SQL. This works for simple lookups - "how many orders last week?" - but breaks on anything that requires business knowledge. The agent guesses at joins, invents metrics, and hallucinates table relationships. Every query is a coin flip.
|
||||
|
||||
**Wave two: semantic layers and text-to-SQL.** Add structure. Define metrics in MetricFlow or Cube, expose schemas, build text-to-SQL pipelines. This is better — the agent knows that `revenue` means `sum(amount) where status != 'refunded'` — but building and maintaining that structure by hand is manual, time-consuming, and still limited. Semantic layers define what to calculate, not why, when, or how to interpret the result. The agent can compute net revenue but doesn't know about the February refund anomaly, the segment reclassification, or the fact that `enterprise` changed definition last quarter.
|
||||
**Wave two: semantic layers and text-to-SQL.** Add structure. Define metrics in MetricFlow or Cube, expose schemas, build text-to-SQL pipelines. This is better - the agent knows that `revenue` means `sum(amount) where status != 'refunded'` - but building and maintaining that structure by hand is manual, time-consuming, and still limited. Semantic layers define what to calculate, not why, when, or how to interpret the result. The agent can compute net revenue but doesn't know about the February refund anomaly, the segment reclassification, or the fact that `enterprise` changed definition last quarter.
|
||||
|
||||
**Wave three: agentic context.** AI is no longer just answering questions — it's generating dashboards, writing semantic definitions, proposing dbt models, creating tests and documentation. For that to work, agents need more than metric definitions. They need the full picture: business rules, known data quality issues, relationship maps, historical context, and the institutional knowledge that lives in your team's heads. They need a context layer.
|
||||
**Wave three: agentic context.** AI is no longer just answering questions - it's generating dashboards, writing semantic definitions, proposing dbt models, creating tests and documentation. For that to work, agents need more than metric definitions. They need the full picture: business rules, known data quality issues, relationship maps, historical context, and the institutional knowledge that lives in your team's heads. They need a context layer.
|
||||
|
||||
## What a context layer is
|
||||
|
||||
A context layer is the infrastructure that gives agents the business knowledge they need to produce correct analytics artifacts. It includes a semantic layer — that's a critical component — but it's not the whole thing.
|
||||
A context layer is the infrastructure that gives agents the business knowledge they need to produce correct analytics artifacts. It includes a semantic layer - that's a critical component - but it's not the whole thing.
|
||||
|
||||
KTX organizes context into four pillars:
|
||||
|
||||
|
|
@ -67,7 +67,7 @@ measures:
|
|||
expr: count(id)
|
||||
```
|
||||
|
||||
**Wiki pages** are Markdown documents that capture business definitions, rules, and operating context — the kind of context that doesn't fit in a schema definition. Pages have structured frontmatter (summary, tags, semantic layer references) and free-form content. Agents search them when they need to understand why a metric works a certain way, not just how to compute it.
|
||||
**Wiki pages** are Markdown documents that capture business definitions, rules, and operating context - the kind of context that doesn't fit in a schema definition. Pages have structured frontmatter (summary, tags, semantic layer references) and free-form content. Agents search them when they need to understand why a metric works a certain way, not just how to compute it.
|
||||
|
||||
```markdown
|
||||
---
|
||||
|
|
@ -91,9 +91,9 @@ canonical revenue reporting.
|
|||
|
||||
**Scan artifacts** are the raw output of KTX's database scanner: table and column metadata, inferred foreign key relationships (even without declared constraints), column statistics, and enrichment reports. They form the foundation that semantic sources are built on.
|
||||
|
||||
**Provenance** is the record of how context was created and changed. Every ingestion session records a full transcript — which adapter ran, what the LLM decided, which sources were created or updated, and why. This is what makes the system auditable: you can trace any semantic source back to the ingestion decision that created it.
|
||||
**Provenance** is the record of how context was created and changed. Every ingestion session records a full transcript - which adapter ran, what the LLM decided, which sources were created or updated, and why. This is what makes the system auditable: you can trace any semantic source back to the ingestion decision that created it.
|
||||
|
||||
Together, these four pillars give agents enough context to produce analytics artifacts that match what your team would produce — not just syntactically valid SQL, but the right query for the question.
|
||||
Together, these four pillars give agents enough context to produce analytics artifacts that match what your team would produce - not just syntactically valid SQL, but the right query for the question.
|
||||
|
||||
## How KTX compares
|
||||
|
||||
|
|
@ -115,7 +115,7 @@ If you do not have a semantic layer, KTX can build an agent-native one from your
|
|||
|
||||
## The plain-files philosophy
|
||||
|
||||
A KTX project is a directory of plain files. No server to run, no database to manage, no proprietary store to back up. Everything is YAML, Markdown, and SQLite — formats you can read, diff, and version-control with tools you already use.
|
||||
A KTX project is a directory of plain files. No server to run, no database to manage, no proprietary store to back up. Everything is YAML, Markdown, and SQLite - formats you can read, diff, and version-control with tools you already use.
|
||||
|
||||
```
|
||||
my-project/
|
||||
|
|
@ -140,7 +140,7 @@ my-project/
|
|||
└── cache/ # Runtime cache (git-ignored)
|
||||
```
|
||||
|
||||
Semantic sources and wiki pages are committed to git. The SQLite database holds ephemeral state — schema ingest results, embedding indexes, session logs — and is git-ignored. If you delete it, KTX rebuilds it on the next run.
|
||||
Semantic sources and wiki pages are committed to git. The SQLite database holds ephemeral state - schema ingest results, embedding indexes, session logs - and is git-ignored. If you delete it, KTX rebuilds it on the next run.
|
||||
|
||||
This means your analytics context travels with your code. You can fork it, branch it, review it in a PR, and merge it with the same tools you use for dbt models. There's no sync problem between a remote server and your local state. There's no migration to run. The files are the source of truth.
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue