mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-07 07:55:13 +02:00
Add ktx ingestion blog post draft
Standalone long-form technical post explaining ktx ingestion end to end: the parallel-LLM-agent / isolated-worktree model, deterministic serial integration, conflict resolution, reconciliation, fail-closed gates, and the atomic squash commit. Drafted from the 2026-06-03 design spec.
This commit is contained in:
parent
45aa95d2cc
commit
20d8c627b6
1 changed files with 428 additions and 0 deletions
428
docs/ktx-ingestion-blog-post.md
Normal file
428
docs/ktx-ingestion-blog-post.md
Normal file
|
|
@ -0,0 +1,428 @@
|
|||
# How ktx ingests your data stack
|
||||
|
||||
## The hard problem
|
||||
|
||||
**ktx** is a context layer for data agents. It gives an agent the reviewed
|
||||
surface it needs to query a warehouse accurately: executable semantic-layer
|
||||
YAML for tables, metrics, and joins, plus wiki Markdown for the business
|
||||
meaning around those definitions. Ingestion is how the context engine builds
|
||||
and refreshes that layer from primary sources and context sources: databases,
|
||||
modeling projects, BI tools, query history, and docs.
|
||||
|
||||
That sounds like an extraction problem. It is really a concurrency problem.
|
||||
Good ingestion has to let many fallible LLM work units inspect different pieces
|
||||
of evidence and write to the same file tree. Those work units may discover the
|
||||
same metric, touch the same semantic source, rewrite the same wiki page, or
|
||||
reach contradictory conclusions from evidence that looked clean in isolation.
|
||||
|
||||
The requirements are unforgiving. The result must be deterministic where
|
||||
**ktx** controls the process. It must be atomic, so a run either lands as one
|
||||
reviewable change or leaves the committed context layer alone. And it must
|
||||
never be silently wrong: when the system cannot prove a conflict is safe, it
|
||||
has to create a decision point instead of blending facts together.
|
||||
|
||||
This post follows one ingest run end to end. The interesting part is not just
|
||||
that **ktx** can read PostgreSQL, Snowflake, BigQuery, ClickHouse, MySQL, SQL
|
||||
Server, SQLite, dbt, MetricFlow, LookML, Looker, Metabase, and Notion. The
|
||||
interesting part is how it lets LLM agents write files safely.
|
||||
|
||||
[[image: Hero pipeline showing primary sources and context sources flowing into ktx ingest, then raw snapshot and diff, a parallel work-unit band, serial patch integration, reconciliation, finalization and gates, one atomic commit, and the two outputs: semantic-layer YAML and wiki Markdown. Add three badges: deterministic, atomic, never silently wrong.]]
|
||||
|
||||
## What ingestion produces
|
||||
|
||||
Ingestion produces two durable surfaces.
|
||||
|
||||
Semantic-layer YAML holds the structured, executable part: semantic sources,
|
||||
tables, columns, measures, joins, filters, segments, and enough metadata for the
|
||||
compiler to turn a semantic query into SQL. Wiki Markdown holds the explanatory
|
||||
part: definitions, caveats, ownership, reporting policies, anomalies, and
|
||||
decisions that help an agent know when a metric is appropriate.
|
||||
|
||||
The split is deliberate. Structured data lives where it can be compiled. Prose
|
||||
lives where it can be searched. Wiki pages can point back to semantic sources
|
||||
through `sl_refs`, so the statement "net revenue excludes refunds" can remain a
|
||||
human-readable note while still anchoring itself to the executable revenue
|
||||
measure.
|
||||
|
||||
That is the difference between the context layer and the semantic layer. The
|
||||
context layer is the whole reviewed surface agents use. The context engine is
|
||||
the active machinery that builds, reconciles, validates, indexes, and serves
|
||||
that surface. The semantic layer is the compiler pillar inside it.
|
||||
|
||||
[[image: Two pillars of the ktx context layer: semantic-layer YAML on the left labeled compiled to SQL, wiki Markdown on the right labeled searched for meaning, with sl_refs linking wiki pages back to semantic sources.]]
|
||||
|
||||
## Sources, connectors, and bundles
|
||||
|
||||
The public command is one command: `ktx ingest`. When more than one configured
|
||||
connection is selected, the planner runs database ingest targets first and then
|
||||
context-source connection targets. Database ingest builds enriched warehouse
|
||||
context. Context-source connection ingest imports evidence from tools and
|
||||
documents that already carry business or modeling knowledge.
|
||||
|
||||
In public prose, these plugins are connectors. Internally, the extension point
|
||||
is the `SourceAdapter` interface: every connector must be able to detect whether
|
||||
it recognizes staged files and split those files into chunks. Other hooks are
|
||||
optional. A connector can fetch its own snapshot, project deterministic output,
|
||||
finalize after reconciliation, describe the scope of a partial snapshot, or
|
||||
cluster work units before agents run.
|
||||
|
||||
A bundle is one snapshot's payload. It may come from an upload, a scheduled
|
||||
pull, or an override that replays a prior snapshot through the current
|
||||
reconciliation path. Once the bundle exists on disk, the rest of the pipeline
|
||||
is intentionally uniform.
|
||||
|
||||
| Input kind | What it contributes |
|
||||
|---|---|
|
||||
| Live database | Tables, columns, types, keys, comments, samples, and relationship evidence |
|
||||
| Query history | SQL usage patterns and common table relationships |
|
||||
| dbt | Model and source definitions, descriptions, tests, and lineage |
|
||||
| MetricFlow | Semantic models, measures, metrics, entities, and relationships |
|
||||
| LookML | Views, explores, fields, joins, and warehouse mappings |
|
||||
| Looker | Explores, looks, dashboards, field metadata, and folder context |
|
||||
| Metabase | Dashboards, questions, SQL, and database-to-warehouse mappings |
|
||||
| Notion | Knowledge pages, hierarchy, and document evidence for wiki capture |
|
||||
|
||||
Primary sources: PostgreSQL, Snowflake, BigQuery, ClickHouse, MySQL, SQL Server, SQLite.
|
||||
Context sources: dbt, MetricFlow, LookML, Looker, Metabase, Notion. Different
|
||||
connectors group raw files differently, but they all hand the orchestrator a
|
||||
bundle with the same shape: evidence in, work units out, provenance recorded.
|
||||
|
||||
[[image: Sources to connector to bundle: live database, historic SQL, dbt, MetricFlow, LookML, Looker, Metabase, and Notion entering a uniform SourceAdapter-shaped connector boundary, then producing a bundle with provenance kinds upload, scheduled pull, and override.]]
|
||||
|
||||
## From a warehouse: scan and enrichment
|
||||
|
||||
Database ingest delegates to an enriched scan. The scan connector reads a
|
||||
structural catalog from the primary source: tables or views, columns, native
|
||||
types, normalized types, nullability, primary keys, foreign keys, comments, and
|
||||
row-count estimates when the database exposes them. If constraint metadata
|
||||
cannot be read because of permissions, the snapshot keeps going and records a
|
||||
warning instead of pretending the metadata was complete.
|
||||
|
||||
Then enrichment starts. For descriptions, **ktx** asks for a table description
|
||||
and all column descriptions in one batched table call, using existing comments
|
||||
and sampled data as evidence. For embeddings, it builds column-level embedding
|
||||
text from names, types, table context, descriptions, sample values, and foreign
|
||||
key hints. Embedding calls are batched, and when the provider does not expose a
|
||||
positive maximum batch size, the default batch size is `100`.
|
||||
|
||||
Relationship discovery is the densest part of the scan. It starts with formal
|
||||
metadata, then generates deterministic candidates from name similarity,
|
||||
table-name normalization, suffixes such as `_id`, `_key`, `_code`, and `_uuid`,
|
||||
profile overlap, embedding locality, and optional LLM proposals. Candidates are
|
||||
scored with seven signals: name similarity, type compatibility, value overlap,
|
||||
embedding similarity, target uniqueness, null-rate evidence, and a structural
|
||||
prior.
|
||||
|
||||
Scoring is not the end. When the connector can run read-only SQL, **ktx**
|
||||
profiles table and column values and validates candidate joins against sampled
|
||||
real data. The validation budget defaults to `min(2 * tableCount, 1000)`, so
|
||||
large schemas do not turn relationship discovery into an unbounded database
|
||||
load. Candidates outside the validation budget remain review candidates instead
|
||||
of being overclaimed.
|
||||
|
||||
After validation, graph resolution chooses accepted, review, and rejected
|
||||
relationships. It reasons about primary-key-like targets, resolves conflicts
|
||||
where one candidate source column points to multiple parents, and can detect
|
||||
multi-column relationships. Accepted formal, inferred, and composite
|
||||
relationships are written into `_schema` manifest shards.
|
||||
|
||||
Those manifest shards are not a blind overwrite. When a warehouse is scanned
|
||||
again, scan-managed descriptions and usage fields are refreshed, while
|
||||
non-scan descriptions, existing usage fields, and joins that still point to
|
||||
present tables are preserved. That gives later reconciliation a stable on-disk
|
||||
surface instead of forcing every run to rediscover the user's accepted edits.
|
||||
|
||||
[[image: Relationship discovery funnel: candidate generation from name similarity, embeddings, suffix stripping, profiles, and LLM proposals; seven-signal scoring; validation on sampled real data with a min(2 x tables, 1000) budget badge; graph resolution; accepted joins.]]
|
||||
|
||||
## The core problem, stated precisely
|
||||
|
||||
At this point the context engine has raw evidence. The hard part is turning
|
||||
that evidence into reviewed artifacts without letting nondeterminism leak into
|
||||
the committed context layer.
|
||||
|
||||
Three requirements drive the rest of the design.
|
||||
|
||||
First, ordering must be deterministic wherever **ktx** controls ordering:
|
||||
target selection, file sorting, work-unit collection, patch names, integration
|
||||
order, and tiebreaks. Second, landing must be atomic: the context layer should
|
||||
not contain half of a bundle because one late gate failed. Third, uncertainty
|
||||
must be visible. The system can accept identical content, deterministic
|
||||
replacements, and clearly elected canonical artifacts. It cannot hide
|
||||
unresolved textual or semantic conflict in a polished-looking file.
|
||||
|
||||
The mechanism that makes this possible is a combination of raw snapshots,
|
||||
stable identity, isolated git worktrees, serial patch integration,
|
||||
reconciliation, final gates, and provenance.
|
||||
|
||||
## Raw snapshot and diff: stable identity
|
||||
|
||||
Stage 1 copies every raw bundle file into the session worktree under
|
||||
`raw-sources/<connection>/<source>/<syncId>/`. Each raw file is hashed with
|
||||
SHA-256 before it is written. The raw path is the identity; the hash is the
|
||||
change detector.
|
||||
|
||||
Diffing compares the current hashes with the latest completed sync for the same
|
||||
connection and context source. The buckets are simple and sorted: added,
|
||||
modified, deleted, and unchanged. That matters because connectors can use those
|
||||
buckets to skip unchanged work and because deletions become first-class
|
||||
evidence for later evictions.
|
||||
|
||||
Artifact identity is stable too. A write is identified as
|
||||
`target:connectionId:key`, so two domains can both have `revenue` without
|
||||
colliding. `finance.revenue` and `marketing.revenue` can coexist because their
|
||||
target connection and key are part of the identity.
|
||||
|
||||
Every durable outcome becomes provenance. Action types include
|
||||
`source_created`, `measure_added`, `join_added`, `merged`, `subsumed`,
|
||||
`wiki_written`, and `skipped`. If a current raw file did not produce a semantic
|
||||
source or wiki action, the fallback provenance row records it as skipped. That
|
||||
keeps future syncs honest: **ktx** can detect changes, know what artifacts came
|
||||
from deleted raw files, and avoid treating "no output" as "never processed."
|
||||
|
||||
[[image: Raw snapshot diff and provenance ledger: raw-sources path tree with added, modified, deleted, and unchanged buckets, plus a ledger table showing raw path, hash, artifact identity, action, and sync id.]]
|
||||
|
||||
## Work units: the unit of parallel work
|
||||
|
||||
A work unit is a connector-chosen grouping of raw files. It is not necessarily
|
||||
one table, one dashboard, or one LLM call. For one connector, a work unit might
|
||||
be a dashboard and its cards. For another, it might be a connected component of
|
||||
MetricFlow metrics and semantic models. For Notion, it might be a page, or a
|
||||
span of an oversized page.
|
||||
|
||||
Each work unit has three file sets. `rawFiles` are the files the work unit owns
|
||||
and can use as direct evidence. `dependencyPaths` are read-only context files
|
||||
that help interpret the owned evidence. `peerFileIndex` is a sorted index of
|
||||
nearby files so the agent can understand the neighborhood without receiving
|
||||
the whole bundle in its prompt.
|
||||
|
||||
On re-sync, connector chunkers can run only work units that contain added or
|
||||
modified files. Unchanged peers can be demoted to dependencies. That is how a
|
||||
large context-source connection ingest avoids rerunning all agent work when
|
||||
only one dashboard or page changed.
|
||||
|
||||
Some connectors add more routing before the model runs. Notion and Looker can
|
||||
use triage lanes to skip, lightly process, or fully process content. Notion can
|
||||
also cluster work units using embedding similarity so related knowledge is
|
||||
handled together. Bounds are explicit: a prompt larger than `240000`
|
||||
characters fails before the model runs, and the default work-unit step budget
|
||||
is `40`.
|
||||
|
||||
## Isolated git worktrees
|
||||
|
||||
The central trick is isolation. Every work unit runs in its own throwaway git
|
||||
worktree created from the same ingestion base SHA. That means two agents can
|
||||
write files with the same initial view of the context layer without seeing each
|
||||
other's partial edits.
|
||||
|
||||
Inside that worktree, the agent gets a constrained toolset: read raw files or
|
||||
spans, read and write semantic-layer YAML and wiki Markdown, validate touched
|
||||
semantic sources, inspect context evidence when available, and load the ingest
|
||||
skills. The run is bounded by the work-unit step budget. If the agent loop
|
||||
errors, calls a tool that fails, creates dangling wiki references, or writes a
|
||||
semantic source that fails validation, the work unit fails and its worktree is
|
||||
reset before any patch is collected.
|
||||
|
||||
Successful work units emit numbered, index-prefixed git patches from the
|
||||
shared base to the child worktree head. The patches are binary-safe and
|
||||
generated with rename detection disabled. Failed work units keep their failure
|
||||
record, but their file changes vanish with the throwaway worktree.
|
||||
|
||||
This gives **ktx** the pattern it needs: parallel produce, serial integrate.
|
||||
Work units can run through a concurrency limiter, but
|
||||
`ingest.workUnits.maxConcurrency` defaults to `1`. The machinery is there so a
|
||||
project can raise concurrency without changing correctness rules. Results are
|
||||
collected by work-unit index, patch filenames start with that index, and
|
||||
successful patches are integrated one at a time in index order.
|
||||
|
||||
[[image: Parallel produce, serial integrate: several work-unit lanes each in a throwaway git worktree branched from the same base SHA; one lane fails and is discarded; successful lanes emit numbered patches that are applied one at a time to the shared integration worktree.]]
|
||||
|
||||
## Per-work-unit gates
|
||||
|
||||
Bad work-unit output should not enter the shared integration tree. **ktx**
|
||||
checks that before a patch exists.
|
||||
|
||||
The first gate is prompt size. If the system prompt plus user prompt is over
|
||||
`240000` characters, the work unit fails before the model runs. The second gate
|
||||
is the agent loop itself: model-loop errors and fatal tool-call failures fail
|
||||
the work unit. The third gate checks wiki references created by the work unit
|
||||
and rejects dangling links. The fourth gate validates the semantic sources the
|
||||
work unit touched.
|
||||
|
||||
Local semantic-layer validation is intentionally shape-only during local ingest:
|
||||
the local runtime uses `probeRowCount: 0`, so validation checks YAML structure,
|
||||
composition, and semantic-layer shape rather than probing live warehouse rows
|
||||
for every write. Later gates repeat validation over the integrated tree and
|
||||
include direct join neighbors, but local ingest still remains a local file and
|
||||
compiler check rather than a full warehouse proof.
|
||||
|
||||
## Integration and conflict resolution
|
||||
|
||||
Patch integration is the first merge layer. For each successful work unit,
|
||||
**ktx** checks that the patch touches only allowed artifact paths, obeys target
|
||||
connection policy, and does not write semantic-layer files when that work unit
|
||||
was restricted from doing so. Then it applies the patch with 3-way git
|
||||
application against the shared integration worktree.
|
||||
|
||||
If the patch applies cleanly, **ktx** runs semantic gates on the touched paths.
|
||||
If git cannot apply the patch because another accepted patch changed the same
|
||||
text, **ktx** makes one constrained repair attempt. The repair agent can read
|
||||
the failed patch and only the touched integration files. It can edit only those
|
||||
allowed paths. Its instructions are conservative: preserve unrelated accepted
|
||||
content, incorporate patch evidence only when compatible, and do not create
|
||||
facts absent from the current file or failed patch.
|
||||
|
||||
Semantic gate failures get the same bounded treatment: one constrained repair
|
||||
attempt against only the affected files. A repair that completes without
|
||||
changing an allowed file fails. A repair that changes files but still fails the
|
||||
gate fails. Unresolved textual or semantic conflict marks the run failed and
|
||||
the session worktree is discarded.
|
||||
|
||||
The second merge layer is semantic reconciliation, described in the next
|
||||
section. It is where **ktx** handles the question readers usually ask: what
|
||||
wins when multiple pieces of evidence point at the same concept?
|
||||
|
||||
| Case | Resolution |
|
||||
|---|---|
|
||||
| Identical content | Keep the existing artifact; do not rewrite just to churn the diff. |
|
||||
| Expression-only re-ingest change | Replace silently when the name, grain, columns, and structure stay the same. |
|
||||
| Structural re-ingest change with prior provenance | Replace and flag for human review because the new bundle is a signal, but the semantic break matters. |
|
||||
| Same-bundle contradiction | Capture all variants and flag; there is no prior user signal that one variant wins. |
|
||||
| Canonical pin | Keep the pinned artifact name when it is valid; disambiguate competitors and do not re-flag a decision the user already made. |
|
||||
|
||||
Definitional contradictions get special handling. If two variants use the same
|
||||
bare concept name but compute substantively different things, **ktx** renames
|
||||
all variants with domain suffixes, keeps the contested bare name out of the
|
||||
semantic layer, and writes a unified `<concept>-definitions.md` wiki page that
|
||||
records the competing definitions and their evidence. Numeric suffixes are not
|
||||
enough; names need to say why the variant exists.
|
||||
|
||||
[[image: Conflict resolution decision tree showing identical content, expression-only re-ingest changes, structural re-ingest breaks, same-bundle contradictions, and canonical pins, with green accepted, amber flagged, and red fail-closed terminal nodes.]]
|
||||
|
||||
## Reconciliation: the cross-work-unit sweep
|
||||
|
||||
Reconciliation is Stage 4: a fresh sweep over the whole job after individual
|
||||
work units and serial patch integration finish. Its input is the Stage Index,
|
||||
the eviction set for deleted raw files, source-specific reconciliation notes,
|
||||
and, for document context sources, deduped context candidates.
|
||||
|
||||
The Stage Index lists each work unit, its status, owned raw files, write
|
||||
actions, touched semantic sources, conflict records, evictions, artifact
|
||||
resolutions, and unmapped fallback decisions. The reconciliation agent can
|
||||
inspect the index, diff specific work units, read raw spans, list artifacts
|
||||
connected to deleted raw files, and emit structured decisions through tools:
|
||||
conflict-resolution records, eviction decisions, artifact-resolution records,
|
||||
and unmapped-fallback records.
|
||||
|
||||
Reconciliation skips when there are no writes, no evictions, no document
|
||||
candidates, and no forced override. Otherwise it can run in a single pass or,
|
||||
for document evidence with pagination and create/update budgets, curator mode.
|
||||
Canonical pins are injected only when they are relevant to the Stage Index, and
|
||||
the reconciliation prompt tells the agent to apply those pins before flagging a
|
||||
same-name or near-duplicate conflict.
|
||||
|
||||
The deterministic election rules matter. For structural duplicates and
|
||||
near-duplicates, **ktx** elects a canonical artifact by inbound reference count,
|
||||
then by lexicographic unit key, then by lexicographic source name. The outcome
|
||||
does not depend on which model call happened to finish first. Same-content
|
||||
duplicates can be subsumed. Definitional contradictions are renamed, captured
|
||||
in a unified definitions page, and flagged. Evictions remove artifacts whose
|
||||
raw files disappeared, with a recorded raw-path reason.
|
||||
|
||||
[[image: Reconciliation sweep: Stage Index, eviction set, and context candidates flow into a reconciliation agent, which emits conflict-resolution, eviction-decision, artifact-resolution, and unmapped-fallback decision cards; include deterministic election by inbound refs 5/2/2 where 5 becomes canonical.]]
|
||||
|
||||
## Finalization, gates, and the atomic commit
|
||||
|
||||
After reconciliation, a connector may run deterministic finalization. Query-
|
||||
history ingest, for example, can project structured usage results after the
|
||||
agent-written artifacts have settled. Finalization has its own safety checks:
|
||||
it cannot modify a path that earlier projection, work units, or reconciliation
|
||||
already changed, and the paths it declares as touched must match the paths
|
||||
**ktx** derives from the actual file diff.
|
||||
|
||||
Then **ktx** repairs wiki `sl_refs` and builds the final gate scope. Changed
|
||||
wiki pages, touched semantic sources, reconciliation actions, finalization
|
||||
changes, and repaired references are folded into one validation set. Semantic
|
||||
sources are expanded to include direct join neighbors. Wiki `sl_refs`, wiki
|
||||
refs, body references, semantic-layer validation, direct join neighbors, and
|
||||
target path policy are checked before anything lands.
|
||||
|
||||
If a final artifact gate fails, **ktx** allows one conservative repair attempt
|
||||
over the affected semantic-layer and wiki files. The repair prompt is explicit:
|
||||
read the gate error first, inspect only allowed files, make the smallest edit,
|
||||
preserve accepted work, and stop without editing if the problem requires
|
||||
choosing between conflicting facts without evidence. A repair that changes
|
||||
nothing fails.
|
||||
|
||||
Provenance rows are validated before landing. Every row must reference a raw
|
||||
path from the current snapshot or from the deletion set. Only after those gates
|
||||
pass does the session worktree squash-merge into the main project as one
|
||||
commit. If the squash has no touched paths, there may be no commit. If it has
|
||||
changes, the whole run lands together.
|
||||
|
||||
After the squash commit, **ktx** updates search indexes. Wiki search syncs from
|
||||
the squashed diff, wiki-to-semantic references are synced, and touched semantic
|
||||
sources are re-indexed for semantic-layer search.
|
||||
|
||||
[[image: All-or-nothing landing: final gates pass and the session worktree squash-merges as one commit; any final gate or conflict failure discards the worktree; outputs are re-indexed semantic-layer search and wiki search after the commit.]]
|
||||
|
||||
## Caveats and honest boundaries
|
||||
|
||||
The determinism is bounded. **ktx** controls result collection, sorting,
|
||||
patch-file naming, integration order, and deterministic tiebreaks. Model
|
||||
content and transcript timing are not deterministic in the same way. The design
|
||||
contains nondeterminism by making outputs pass gates and by integrating in a
|
||||
stable order.
|
||||
|
||||
Work units are serial by default:
|
||||
`ingest.workUnits.maxConcurrency` defaults to `1`. Raising concurrency is a
|
||||
performance choice, not a different correctness model, because every work unit
|
||||
still starts from the same base SHA and integration remains serial.
|
||||
|
||||
Local validation is shape-only for semantic-layer writes during local ingest,
|
||||
because local validation uses `probeRowCount: 0`. That catches malformed YAML,
|
||||
composition failures, broken references, and compiler-shape problems. It does
|
||||
not prove every table or column exists in the live warehouse at write time.
|
||||
|
||||
Work units are not automatically retried. Repairs are bounded to one attempt
|
||||
for textual conflicts, one attempt for semantic gate failures during patch
|
||||
integration, and one attempt for final artifact gates. If the repair cannot
|
||||
make a defensible change, the conflict remains a failure.
|
||||
|
||||
Partial failure is first-class. Failed work units are recorded in the report,
|
||||
and a run can complete with failed work units if other work units succeeded and
|
||||
the final gates pass. That is separate from partial fetch status, which means a
|
||||
connector could not fetch the full snapshot and recorded that fetch boundary.
|
||||
|
||||
Query-history ingest has source limits. Postgres query-history ingest requires
|
||||
PostgreSQL 14 or newer, `pg_stat_statements`, and `pg_read_all_stats`. BigQuery
|
||||
and Snowflake read from their account or job history surfaces and are the
|
||||
query-history paths that honor the query-history window flag; Postgres reads
|
||||
the current aggregate in `pg_stat_statements`.
|
||||
|
||||
Context sources have limits too. MetricFlow conversion metrics are not yet
|
||||
supported, and cumulative metrics are carried with limited semantics. LookML
|
||||
derived tables are unsupported as semantic-layer sources. Notion is
|
||||
knowledge-only: it writes wiki knowledge, uses warehouse or dbt mappings before
|
||||
touching semantic-layer YAML, and enforces per-run create and update caps.
|
||||
|
||||
Observability is built into the local run. Traces live at
|
||||
`.ktx/ingest-traces/<jobId>/trace.jsonl`, work-unit and reconciliation
|
||||
transcripts live under `.ktx/ingest-transcripts/<jobId>/`, and ingest profiling
|
||||
is displayed when `KTX_PROFILE_INGEST` is enabled.
|
||||
|
||||
## Why it is built this way
|
||||
|
||||
The shape follows from the risk. A data agent needs a context layer it can act
|
||||
on, not a pile of plausible notes. That means evidence must be traceable,
|
||||
semantic definitions must compile, wiki references must stay live, and
|
||||
conflicts must either resolve by rule or stop at a decision point.
|
||||
|
||||
Worktree isolation lets fallible agents produce candidate edits without
|
||||
colliding. Deterministic tiebreaks make cross-work-unit decisions independent
|
||||
of timing. Fail-closed gates keep unresolved conflicts out of the committed
|
||||
context layer. Provenance makes future syncs compare against what actually
|
||||
happened, not what the system hoped happened.
|
||||
|
||||
That is why **ktx** ingestion looks less like a scraper and more like a compiler
|
||||
with a conscience: it translates messy data-stack evidence into files agents
|
||||
can use, while refusing to silently turn ambiguity into authority.
|
||||
Loading…
Add table
Add a link
Reference in a new issue