mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-28 08:49:38 +02:00
Refine isolated-diff ingestion design after adversarial review iteration 1
This commit is contained in:
parent
2a19e88806
commit
0849dcdcfa
1 changed files with 124 additions and 13 deletions
|
|
@ -99,7 +99,7 @@ converts authoritative source facts into KTX artifacts without an agent. Good
|
||||||
examples are live database schema introspection and straightforward MetricFlow
|
examples are live database schema introspection and straightforward MetricFlow
|
||||||
semantic-model import.
|
semantic-model import.
|
||||||
|
|
||||||
The adapter contract remains small:
|
The isolation-relevant adapter surface remains small:
|
||||||
|
|
||||||
```ts
|
```ts
|
||||||
interface SourceAdapter {
|
interface SourceAdapter {
|
||||||
|
|
@ -114,6 +114,15 @@ interface SourceAdapter {
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
This is the subset the isolated-diff runner needs to understand source-shaped
|
||||||
|
planning and deterministic projection. It is not a proposal to delete existing
|
||||||
|
`SourceAdapter` fields. Existing lifecycle and source-support fields such as
|
||||||
|
`detect`, `readFetchReport`, `listTargetConnectionIds`, `clusterWorkUnits`,
|
||||||
|
`describeScope`, `onPullSucceeded`, `evidenceIndexing`, `triageSupported`,
|
||||||
|
`getTriageSignals`, and `reconcileSkillNames` stay part of the adapter contract
|
||||||
|
until a separate cleanup intentionally removes them with migration impact
|
||||||
|
called out.
|
||||||
|
|
||||||
`chunk()` returns ordinary `WorkUnit[]`. The runner does not need a
|
`chunk()` returns ordinary `WorkUnit[]`. The runner does not need a
|
||||||
`planningStrategy` enum because the source adapter can plan by any domain shape
|
`planningStrategy` enum because the source adapter can plan by any domain shape
|
||||||
that makes sense.
|
that makes sense.
|
||||||
|
|
@ -155,6 +164,15 @@ Each per-work-unit worktree starts from the same ingestion base commit. A work
|
||||||
unit never observes another concurrent work unit's transient edits. This makes
|
unit never observes another concurrent work unit's transient edits. This makes
|
||||||
the work unit diff a clean proposal against a stable base.
|
the work unit diff a clean proposal against a stable base.
|
||||||
|
|
||||||
|
The runner creates and runs child worktrees under the existing
|
||||||
|
`workUnitMaxConcurrency` setting. A run may have many planned work units, but no
|
||||||
|
more than that bound may be active or left on disk at once. The default remains
|
||||||
|
serial execution. Child worktrees must be cleaned up after the diff, transcript,
|
||||||
|
and outcome metadata are persisted, including failure paths. Adapters with
|
||||||
|
large fan-out, such as Notion, may use `clusterWorkUnits` before execution to
|
||||||
|
keep work-unit count tractable, but clustering remains source-shaped planning
|
||||||
|
rather than a separate execution mode.
|
||||||
|
|
||||||
## Work-unit lifecycle
|
## Work-unit lifecycle
|
||||||
|
|
||||||
Each work unit follows a fixed lifecycle.
|
Each work unit follows a fixed lifecycle.
|
||||||
|
|
@ -172,6 +190,39 @@ records: unit key, status, actions, touched semantic-layer sources, failure
|
||||||
reason, raw files, and transcript path. It does not add a proposal manifest.
|
reason, raw files, and transcript path. It does not add a proposal manifest.
|
||||||
The diff is the proposal.
|
The diff is the proposal.
|
||||||
|
|
||||||
|
For `slDisallowed` work units, isolation is defense in depth. The scoped
|
||||||
|
work-unit tools must withhold semantic-layer write and edit tools, and the
|
||||||
|
integration layer must reject any otherwise accepted diff from that work unit
|
||||||
|
that touches `semantic-layer/**`. This catches buggy or bypassed tool behavior
|
||||||
|
before an invalid LookML connection-mismatch write can reach the integration
|
||||||
|
worktree.
|
||||||
|
|
||||||
|
### Diff proposal contract
|
||||||
|
|
||||||
|
The proposal artifact is a Git patch with binary-safe content, not the existing
|
||||||
|
hash-based raw-source `DiffSet`.
|
||||||
|
|
||||||
|
The first implementation must use one pinned patch contract:
|
||||||
|
|
||||||
|
- collect `git diff --binary --no-renames <base>..HEAD`;
|
||||||
|
- disable rename and copy detection so renames are represented as delete plus
|
||||||
|
create in version one;
|
||||||
|
- preserve mode changes from the patch metadata, but reject unexpected
|
||||||
|
executable-mode or binary changes under known text artifact roots such as
|
||||||
|
`wiki/**` and `semantic-layer/**`;
|
||||||
|
- apply each accepted patch to the integration worktree with
|
||||||
|
`git apply --3way --index`;
|
||||||
|
- do not use `git apply --reject`, because partial hunk application is not an
|
||||||
|
accepted integration state; and
|
||||||
|
- if patch application fails, leaves conflicts, or touches a path disallowed for
|
||||||
|
that work unit, roll back the integration worktree to its pre-apply HEAD and
|
||||||
|
classify the outcome as a textual conflict.
|
||||||
|
|
||||||
|
Delete-versus-edit, recreate-versus-edit, and delete-versus-create races are
|
||||||
|
therefore textual conflicts when Git cannot apply the patch cleanly. If Git
|
||||||
|
applies the patch but known artifact validators reject the resulting tree, the
|
||||||
|
outcome is a semantic conflict.
|
||||||
|
|
||||||
## Integration lifecycle
|
## Integration lifecycle
|
||||||
|
|
||||||
The integration worktree applies accepted work-unit diffs after local gates
|
The integration worktree applies accepted work-unit diffs after local gates
|
||||||
|
|
@ -191,20 +242,53 @@ types, the runner uses artifact-aware merge helpers. For unknown file types, the
|
||||||
runner can fall back to agent-assisted conflict resolution in the integration
|
runner can fall back to agent-assisted conflict resolution in the integration
|
||||||
worktree.
|
worktree.
|
||||||
|
|
||||||
|
### Reconciliation in the new flow
|
||||||
|
|
||||||
|
Reconciliation remains a shared runner stage, but it runs as a serial
|
||||||
|
integration-stage pass instead of a parallel work unit.
|
||||||
|
|
||||||
|
The runner applies all accepted work-unit diffs to the integration worktree,
|
||||||
|
resolves textual conflicts that can be resolved, and then runs reconciliation in
|
||||||
|
that integration worktree before final global gates and before squash.
|
||||||
|
Reconciliation must see the integrated state because its job is to resolve
|
||||||
|
cross-work-unit duplicates, evictions, fallbacks, and source-specific
|
||||||
|
reconcile guidance.
|
||||||
|
|
||||||
|
Reconciliation is not allowed to mutate project main directly and does not run
|
||||||
|
inside any child worktree. Its changes are captured as a reconciliation diff
|
||||||
|
against the pre-reconciliation integration HEAD, recorded in the existing
|
||||||
|
stage/report metadata, and validated with the same touched-artifact and scoped
|
||||||
|
connection gates as work-unit writes. The final global gates validate the
|
||||||
|
combined tree after reconciliation. If reconciliation introduces an invalid
|
||||||
|
wiki or semantic-layer reference, touches a disallowed target, or records an
|
||||||
|
unresolvable artifact conflict, the run fails or routes to a resolver before
|
||||||
|
squash.
|
||||||
|
|
||||||
## Artifact-aware integration
|
## Artifact-aware integration
|
||||||
|
|
||||||
KTX durable artifacts are structured enough that git-only merge is not a strong
|
KTX durable artifacts are structured enough that git-only merge is not a strong
|
||||||
correctness boundary. Artifact-aware integration must parse and validate known
|
correctness boundary. Artifact-aware integration must parse and validate known
|
||||||
file classes after diffs are applied.
|
file classes after diffs are applied.
|
||||||
|
|
||||||
The first implementation must cover these artifact classes:
|
The first implementation must cover these worktree file classes:
|
||||||
|
|
||||||
- semantic-layer source YAML;
|
- semantic-layer source YAML;
|
||||||
- wiki markdown frontmatter;
|
- wiki markdown frontmatter;
|
||||||
- wiki body references to semantic-layer sources, measures, dimensions, and raw
|
- wiki body references to semantic-layer sources, measures, dimensions, and raw
|
||||||
warehouse tables;
|
warehouse tables.
|
||||||
- fallback records; and
|
|
||||||
- provenance records that map raw paths to written artifacts.
|
Unmapped fallback records are not worktree files in version one. They remain
|
||||||
|
typed stage-index and report records emitted by `emit_unmapped_fallback`; the
|
||||||
|
integration layer validates their raw paths and structured reason codes as
|
||||||
|
report metadata, not as mergeable artifacts.
|
||||||
|
|
||||||
|
Provenance also stays out of the worktree in version one. The source of truth is
|
||||||
|
the ingest provenance store and report body. Before inserting provenance rows,
|
||||||
|
the global gate derives the planned rows from accepted work-unit actions,
|
||||||
|
reconciliation actions, artifact-resolution records, and skipped raw files, then
|
||||||
|
checks those rows against the integrated worktree and staged raw hashes. Moving
|
||||||
|
provenance to on-disk files would be a separate schema migration, not part of
|
||||||
|
this design.
|
||||||
|
|
||||||
Artifact-aware integration can start with validation-only behavior. It does not
|
Artifact-aware integration can start with validation-only behavior. It does not
|
||||||
need to auto-merge every semantic conflict in version one. If two diffs contest
|
need to auto-merge every semantic conflict in version one. If two diffs contest
|
||||||
|
|
@ -227,12 +311,32 @@ The final gates must include:
|
||||||
measure-level references;
|
measure-level references;
|
||||||
- wiki body validation for explicit semantic-layer source, measure, dimension,
|
- wiki body validation for explicit semantic-layer source, measure, dimension,
|
||||||
and table references; and
|
and table references; and
|
||||||
- provenance validation for raw paths referenced by new or changed artifacts.
|
- provenance validation for raw paths referenced by new or changed artifacts
|
||||||
|
before those rows are inserted into SQLite.
|
||||||
|
|
||||||
|
The wiki body gate needs a narrow grammar so ordinary prose does not become a
|
||||||
|
semantic-layer reference. In version one, an explicit body reference is one of
|
||||||
|
these Markdown forms outside fenced code blocks:
|
||||||
|
|
||||||
|
- an inline code token in the form `source.entity`, where `source` matches a
|
||||||
|
visible semantic-layer source and `entity` must match one of that source's
|
||||||
|
measures, dimensions, or segments;
|
||||||
|
- an inline code token in the form `connectionId/source.entity`, which validates
|
||||||
|
against that specific target connection;
|
||||||
|
- an inline code token in the form `source:source_name`, which validates a
|
||||||
|
source-level semantic-layer reference; or
|
||||||
|
- an inline code token in the form `table:qualified_table_name`, which validates
|
||||||
|
a raw warehouse table reference against the visible raw table/catalog sources.
|
||||||
|
|
||||||
|
The parser ignores unformatted prose, fenced SQL examples, and unprefixed
|
||||||
|
single-token inline code. Two-part inline code that does not name a visible
|
||||||
|
semantic-layer source is not treated as an SL entity reference; use the
|
||||||
|
`table:` prefix for raw warehouse table references.
|
||||||
|
|
||||||
The `total_contract_arr_cents` incident is the regression case for this gate:
|
The `total_contract_arr_cents` incident is the regression case for this gate:
|
||||||
the integrated tree must fail if a wiki page references
|
the integrated tree must fail if a wiki page references
|
||||||
`mart_account_segments.total_contract_arr_cents` while the final semantic-layer
|
`mart_account_segments.total_contract_arr_cents` as an inline-code body token
|
||||||
source defines only `total_contract_arr`.
|
while the final semantic-layer source defines only `total_contract_arr`.
|
||||||
|
|
||||||
## Deterministic projection
|
## Deterministic projection
|
||||||
|
|
||||||
|
|
@ -337,7 +441,8 @@ The first test creates a fake or Metabase-like adapter with two work units
|
||||||
starting from the same base:
|
starting from the same base:
|
||||||
|
|
||||||
1. Work unit A writes a wiki page that references
|
1. Work unit A writes a wiki page that references
|
||||||
`mart_account_segments.total_contract_arr_cents`.
|
`mart_account_segments.total_contract_arr_cents` as an inline-code body
|
||||||
|
token.
|
||||||
2. Work unit B writes or overwrites the final semantic-layer source with only
|
2. Work unit B writes or overwrites the final semantic-layer source with only
|
||||||
`total_contract_arr`.
|
`total_contract_arr`.
|
||||||
3. Both work units pass their local gates in isolation.
|
3. Both work units pass their local gates in isolation.
|
||||||
|
|
@ -351,17 +456,23 @@ Additional tests cover:
|
||||||
- a deterministic projector change plus a work-unit wiki reference that becomes
|
- a deterministic projector change plus a work-unit wiki reference that becomes
|
||||||
stale after projection;
|
stale after projection;
|
||||||
- Notion-style direct wiki writes with invalid `sl_refs`; and
|
- Notion-style direct wiki writes with invalid `sl_refs`; and
|
||||||
- LookML-style `slDisallowed` work units that cannot write semantic-layer
|
- LookML-style `slDisallowed` work units where write tools are unavailable and
|
||||||
files.
|
integration rejects any diff that still touches `semantic-layer/**`.
|
||||||
|
|
||||||
## Rollout
|
## Rollout
|
||||||
|
|
||||||
The rollout must be incremental because the current runner is shared by all
|
The rollout must be incremental because the current runner is shared by all
|
||||||
adapters.
|
adapters.
|
||||||
|
|
||||||
1. Add the per-work-unit worktree executor behind an internal setting.
|
The rollout switch is runner-owned. During migration it may be a private
|
||||||
|
per-source allowlist, or an internal `IngestSettingsPort` map keyed by
|
||||||
|
`sourceKey`, but it must not become a `SourceAdapter` field or public connector
|
||||||
|
configuration knob.
|
||||||
|
|
||||||
|
1. Add the per-work-unit worktree executor behind that internal runner setting.
|
||||||
2. Add diff collection and deterministic integration in the existing runner.
|
2. Add diff collection and deterministic integration in the existing runner.
|
||||||
3. Add final global wiki and semantic-layer reference gates.
|
3. Add final global wiki and semantic-layer reference gates, including the wiki
|
||||||
|
body reference parser defined above.
|
||||||
4. Migrate Metabase to the new execution path first.
|
4. Migrate Metabase to the new execution path first.
|
||||||
5. Migrate Notion, LookML, Looker, dbt, and MetricFlow.
|
5. Migrate Notion, LookML, Looker, dbt, and MetricFlow.
|
||||||
6. Promote the new path to the default after the Metabase regression test and
|
6. Promote the new path to the default after the Metabase regression test and
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue