mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-13 08:15:14 +02:00
Refine adapter-owned ingest finalization design after adversarial review iteration 1
This commit is contained in:
parent
e64da5a85d
commit
fd8d6a1134
1 changed files with 352 additions and 0 deletions
|
|
@ -0,0 +1,352 @@
|
|||
# Adapter-owned ingest finalization design
|
||||
|
||||
**Date:** 2026-05-18
|
||||
**Author:** Andrey Avtomonov
|
||||
**Status:** Design - pending implementation plan
|
||||
|
||||
## Background
|
||||
|
||||
The isolated-diff ingestion migration made KTX's shared bundle runner
|
||||
responsible for one durable execution model: stage raw source data, run
|
||||
source-planned work units in isolated child worktrees, integrate their diffs,
|
||||
reconcile, run final gates, and squash the accepted integration tree back into
|
||||
the project worktree.
|
||||
|
||||
That direction is correct, but the current code still has a runner-level
|
||||
post-processing extension point. `IngestBundleRunnerDeps.postProcessors` maps a
|
||||
source key to an arbitrary `IngestBundlePostProcessorPort`, and local runtime
|
||||
wires `historic-sql` to `HistoricSqlProjectionPostProcessor`. That path can
|
||||
write durable semantic-layer and wiki artifacts after work-unit integration and
|
||||
reconciliation, outside the source adapter contract.
|
||||
|
||||
Historic SQL exposed why the extra path exists. Its table and pattern work units
|
||||
emit typed evidence, then a deterministic projection step merges the evidence
|
||||
into `_schema` usage and historic-SQL wiki pages. Some of that work is local to
|
||||
one work unit, but other behavior is whole-run maintenance: marking stale table
|
||||
usage, reusing existing pattern pages, and archiving old pattern pages. Those
|
||||
aggregate decisions do not fit cleanly inside independent per-work-unit writes.
|
||||
|
||||
The design goal is to preserve legitimate adapter-owned deterministic
|
||||
maintenance without keeping a generic runner-level escape hatch.
|
||||
|
||||
## Goals
|
||||
|
||||
This design tightens the isolated-diff architecture around a stable boundary:
|
||||
the generic runner owns execution mechanics, and adapters own source semantics.
|
||||
|
||||
The design has these goals:
|
||||
|
||||
- Remove runner-level `postProcessors` as an alternate durable-write pipeline.
|
||||
- Add a first-class `SourceAdapter.finalize?()` hook for deterministic
|
||||
post-work-unit source maintenance.
|
||||
- Keep `finalize?()` constrained, observable, and subject to the same final
|
||||
validation gates as work-unit and reconciliation changes.
|
||||
- Preserve historic-SQL aggregate projection behavior without treating it as a
|
||||
hidden fallback ingestion path.
|
||||
- Keep public execution knobs out of the adapter API.
|
||||
|
||||
## Non-goals
|
||||
|
||||
This design does not rework source-specific chunking, fetch formats, wiki page
|
||||
frontmatter, semantic-layer YAML, or raw source layouts. It does not replace
|
||||
agent-authored work units with deterministic projectors. It also does not add a
|
||||
public `executionMode`, `planningStrategy`, `conflictPolicy`, or source-key
|
||||
allowlist.
|
||||
|
||||
Override ingest remains a special correction operation that reuses a prior raw
|
||||
snapshot and forces reconciliation. It should be documented and tested as
|
||||
override replay, not as a fallback pipeline. This design does not require
|
||||
override ingest to run source work units.
|
||||
|
||||
## Locked design direction
|
||||
|
||||
The shared ingestion runner keeps one ordered pipeline for sources that can
|
||||
write durable project artifacts.
|
||||
|
||||
```text
|
||||
fetch raw
|
||||
-> adapter plans WorkUnit[]
|
||||
-> optional adapter project
|
||||
-> isolated WU diffs
|
||||
-> artifact-aware integration
|
||||
-> reconciliation
|
||||
-> optional adapter finalize
|
||||
-> runner wiki-SL-ref repair
|
||||
-> final target policy and artifact gates
|
||||
-> squash
|
||||
```
|
||||
|
||||
The exact implementation may continue to call `chunk()` before `project()` so a
|
||||
projector can consume `parseArtifacts`. The architectural invariant is that
|
||||
`project()` runs in the integration worktree before child worktrees start, while
|
||||
`finalize()` runs in the integration worktree after accepted work-unit and
|
||||
reconciliation changes are present.
|
||||
|
||||
Adapters decide what source-specific work belongs in `project()`, work units,
|
||||
or `finalize()`. The runner decides when those phases run, captures their git
|
||||
effects, enforces target scope, runs gates, writes traces and reports, and
|
||||
squashes the final tree.
|
||||
|
||||
## Adapter API
|
||||
|
||||
The source adapter contract should make deterministic source phases explicit.
|
||||
|
||||
```ts
|
||||
interface SourceAdapter {
|
||||
readonly source: string;
|
||||
readonly skillNames: string[];
|
||||
readonly reconcileSkillNames?: string[];
|
||||
readonly evidenceIndexing?: 'documents';
|
||||
readonly triageSupported?: boolean;
|
||||
|
||||
getTriageSignals?(stagedDir: string, externalId: string): Promise<TriageSignals>;
|
||||
detect(stagedDir: string): Promise<boolean>;
|
||||
fetch?(pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise<void>;
|
||||
readFetchReport?(stagedDir: string): Promise<SourceFetchReport | null>;
|
||||
listTargetConnectionIds?(stagedDir: string): Promise<string[]>;
|
||||
chunk(stagedDir: string, diffSet?: DiffSet): Promise<ChunkResult>;
|
||||
clusterWorkUnits?(ctx: ClusterWorkUnitsContext): Promise<WorkUnit[]>;
|
||||
project?(ctx: DeterministicProjectionContext): Promise<ProjectionResult>;
|
||||
finalize?(ctx: DeterministicFinalizationContext): Promise<FinalizationResult>;
|
||||
describeScope?(stagedDir: string): Promise<ScopeDescriptor>;
|
||||
onPullSucceeded?(ctx: PullSucceededContext): Promise<void>;
|
||||
}
|
||||
```
|
||||
|
||||
`finalize?()` is not a compatibility wrapper for old post-processors. It is a
|
||||
source-adapter method with a fixed location in the runner lifecycle.
|
||||
|
||||
```ts
|
||||
interface DeterministicFinalizationContext {
|
||||
connectionId: string;
|
||||
sourceKey: string;
|
||||
syncId: string;
|
||||
jobId: string;
|
||||
runId: string;
|
||||
stagedDir: string;
|
||||
workdir: string;
|
||||
parseArtifacts?: unknown;
|
||||
stageIndex: StageIndex;
|
||||
workUnitOutcomes: WorkUnitOutcome[];
|
||||
reconciliationActions: MemoryAction[];
|
||||
overrideReplay?: FinalizationOverrideReplay;
|
||||
semanticLayerService: SemanticLayerService;
|
||||
}
|
||||
|
||||
interface FinalizationResult {
|
||||
warnings: string[];
|
||||
errors: string[];
|
||||
touchedSources: TouchedSlSource[];
|
||||
changedWikiPageKeys: string[];
|
||||
actions?: MemoryAction[];
|
||||
result?: unknown;
|
||||
}
|
||||
|
||||
interface FinalizationOverrideReplay {
|
||||
priorJobId: string;
|
||||
priorRunId: string;
|
||||
priorSyncId: string;
|
||||
}
|
||||
```
|
||||
|
||||
The implementation plan can adjust exact type names to match the existing
|
||||
module layout, but the contract must preserve these semantics:
|
||||
|
||||
- `finalize?()` is deterministic TypeScript code, not an agent loop.
|
||||
- It runs only in the ingestion integration worktree.
|
||||
- It may write ordinary durable project files.
|
||||
- It must report touched semantic-layer sources and wiki page keys.
|
||||
- `stageIndex` is the canonical runner index for accepted work-unit actions,
|
||||
touched sources, and reconciliation records visible to the current run. In an
|
||||
override replay it may be rebuilt from the prior report.
|
||||
- `workUnitOutcomes` contains only work units executed in the current run. It
|
||||
is empty when override replay skips source work units.
|
||||
- `reconciliationActions` contains only accepted reconciliation writes emitted
|
||||
through the reconciliation tool session in the current run. These actions have
|
||||
already mutated the integration worktree.
|
||||
- `actions` in `FinalizationResult` are descriptive records for finalization
|
||||
writes that the adapter already performed. The runner must not re-apply them.
|
||||
When finalization actions are intended to create provenance rows, they must
|
||||
carry valid current-snapshot or eviction `rawPaths`.
|
||||
- It cannot mutate the main project worktree directly.
|
||||
|
||||
The existing adapter API fields unrelated to deterministic projection and
|
||||
finalization remain part of the contract. Adding `finalize?()` must not remove
|
||||
triage or evidence-indexing support.
|
||||
|
||||
## Override replay
|
||||
|
||||
Override ingest remains a replay of a prior raw snapshot with forced
|
||||
reconciliation. It does not execute source work units, so finalization must not
|
||||
silently assume fresh work-unit evidence exists.
|
||||
|
||||
The runner should still enter the finalization phase for adapters that
|
||||
implement `finalize?()`, but it must pass explicit override metadata. In that
|
||||
mode, `workUnitOutcomes` is empty, `parseArtifacts` is absent unless the runner
|
||||
created fresh parse artifacts in the current run, `stageIndex` comes from the
|
||||
prior report, and `reconciliationActions` contains only new override
|
||||
reconciliation actions.
|
||||
|
||||
Adapters must treat missing current-run deterministic inputs as a no-op, not as
|
||||
negative evidence. For historic SQL, override replay must not mark tables stale,
|
||||
mark pattern pages stale, or archive pattern pages from an empty current-run
|
||||
evidence directory. Any override-safe finalization must be derived from the
|
||||
materialized raw snapshot or explicit prior-report data, not from the absence of
|
||||
fresh work-unit evidence.
|
||||
|
||||
## Runner responsibilities
|
||||
|
||||
The runner owns all reusable mechanics around `finalize?()`.
|
||||
|
||||
After reconciliation completes, the runner calls `adapter.finalize?()` if it
|
||||
exists. The runner then commits any reported or discovered finalization changes
|
||||
in the integration worktree, records the commit SHA and touched paths in the
|
||||
run trace/report, includes finalization actions in saved-memory counts, and
|
||||
runs wiki-SL-ref repair before final target-policy and artifact gates.
|
||||
|
||||
`wiki_sl_ref_repair` remains a runner mechanic, not an adapter method. It runs
|
||||
after finalization and before final gates, and it uses the normal target
|
||||
connection set plus `FinalizationResult.touchedSources` to decide which
|
||||
semantic-layer references are visible. Its writes are part of the same
|
||||
integration worktree diff as finalization/reconciliation, so target-policy
|
||||
checks, final artifact gates, reports, traces, and squash behavior cover those
|
||||
writes before changes reach the main project worktree.
|
||||
|
||||
The runner must treat finalization like deterministic projection and
|
||||
reconciliation, not like a free-form source-key plug-in. It must enforce the
|
||||
same target-connection policy used for work-unit and reconciliation changes.
|
||||
If finalization writes an unauthorized semantic-layer target, references a
|
||||
missing semantic-layer entity, or returns errors, the run fails before changes
|
||||
reach the main project worktree.
|
||||
|
||||
The runner should expose one trace phase named `finalization`. It should not
|
||||
keep a `post_processor` stage, `IngestBundlePostProcessorPort`,
|
||||
`deps.postProcessors`, or report fields that imply a parallel post-processor
|
||||
pipeline.
|
||||
|
||||
## Adapter application
|
||||
|
||||
Each adapter continues to use the same generic runner mechanics, while keeping
|
||||
source-specific choices inside the adapter.
|
||||
|
||||
- `metabase` fetches cards and dashboards, computes scope, plans
|
||||
card/dashboard work units, and usually does not need `project()` or
|
||||
`finalize()`.
|
||||
- `notion` fetches pages, extracts triage signals, clusters page work units,
|
||||
and usually does not need deterministic finalization.
|
||||
- `dbt` fetches the repository, parses dbt project metadata, plans model work
|
||||
units, and may later add `project()` if dbt YAML import becomes deterministic.
|
||||
- `lookml` fetches LookML, produces validation artifacts, plans model and
|
||||
explore work units, and may later add `project()` for deterministic LookML to
|
||||
semantic-layer import.
|
||||
- `looker` fetches runtime bundles, fetch reports, target connections, and
|
||||
triage signals. It continues to rely on work-unit diffs and shared gates.
|
||||
- `metricflow` is the current strong `project()` example. It imports
|
||||
authoritative semantic models before child worktrees start, then lets any
|
||||
work units observe those projected files.
|
||||
- `live-database` can remain work-unit based, but database schema introspection
|
||||
is a good future `project()` candidate because the schema is authoritative
|
||||
structured metadata.
|
||||
- `historic-sql` should move current post-processor behavior into the adapter.
|
||||
Local table-usage and pattern-page writes may move into work-unit tools where
|
||||
they are genuinely per-unit. Whole-run maintenance such as stale table usage,
|
||||
pattern-page reuse, and stale/archive page decisions belongs in
|
||||
`HistoricSqlSourceAdapter.finalize()`.
|
||||
- `fake` remains a test adapter and does not need deterministic phases.
|
||||
|
||||
## Historic-SQL migration
|
||||
|
||||
Historic SQL should stop using evidence-only tool output plus runner-level
|
||||
post-processing as its durable projection path.
|
||||
|
||||
The preferred migration is:
|
||||
|
||||
1. Keep historic-SQL work units responsible for source-shaped analysis.
|
||||
2. Use source-specific tools for per-unit durable writes when the output is
|
||||
local to that unit, such as a table's usage metadata or one pattern page.
|
||||
3. Move whole-run deterministic cleanup into
|
||||
`HistoricSqlSourceAdapter.finalize()`.
|
||||
4. Delete `HistoricSqlProjectionPostProcessor`, `IngestBundlePostProcessorPort`,
|
||||
`deps.postProcessors`, and `post_processor` memory-flow/report stages.
|
||||
|
||||
If the implementation keeps typed evidence as an internal handoff between
|
||||
historic-SQL work units and `finalize()`, that evidence must be framed as
|
||||
source-specific input to the adapter's deterministic finalization, not as a
|
||||
generic runner post-processing mechanism. The evidence files must not become a
|
||||
public compatibility surface.
|
||||
|
||||
Historic-SQL finalization must distinguish "no current-run evidence exists"
|
||||
from "the current snapshot proves this artifact is stale." Whole-run cleanup
|
||||
such as stale table usage, pattern-page staleness, and archive decisions can
|
||||
run only when finalization has current-run historic-SQL evidence or an explicit
|
||||
override-safe source of equivalent facts.
|
||||
|
||||
## Reports and observability
|
||||
|
||||
Reports should describe first-class pipeline phases, not historical extension
|
||||
points. The isolated-diff summary should include finalization metadata when the
|
||||
adapter implements `finalize?()`: whether it ran, finalization commit SHA,
|
||||
touched paths, touched semantic-layer sources, changed wiki page keys,
|
||||
warnings, descriptive finalization actions, and source-specific result payload.
|
||||
|
||||
Saved-memory counts should come from work-unit, reconciliation, and
|
||||
finalization memory actions plus touched artifact reporting. Finalization
|
||||
actions are reporting/provenance records for writes that already happened in
|
||||
the integration worktree; they are not a second write channel. There should be
|
||||
no special `postProcessorSavedMemoryCounts` or `postProcessor` report body.
|
||||
Memory-flow phases should use `finalization` instead of `post_processor`.
|
||||
|
||||
The runner owns provenance for finalization. Adapters return touched artifacts
|
||||
and optional descriptive actions, but they do not call the provenance port.
|
||||
When finalization actions include valid `rawPaths`, the runner folds them into
|
||||
the normal provenance plan using the current `sourceKey`, `syncId`, raw content
|
||||
hashes, artifact kind, artifact key, target connection, and action type. The
|
||||
finalization phase and commit SHA belong in trace/report metadata; they should
|
||||
not be fabricated inside adapter-written files.
|
||||
|
||||
Traces must make finalization useful for postmortems. At minimum, record
|
||||
`finalization_started`, `finalization_committed`, `finalization_skipped`, and
|
||||
`finalization_failed` events with source key, touched paths, warnings, and
|
||||
error summaries.
|
||||
|
||||
## Failure handling
|
||||
|
||||
Finalization failures are ingestion failures. If `finalize?()` returns errors,
|
||||
throws, writes unauthorized targets, or causes final gates to fail, the runner
|
||||
marks the run failed and leaves the main project worktree unchanged.
|
||||
|
||||
Finalization should run after reconciliation because it may need to inspect the
|
||||
accepted work-unit and reconciliation result. Final gates should run after
|
||||
finalization because finalization writes durable project artifacts.
|
||||
|
||||
Finalization must not be used to repair arbitrary integration conflicts or
|
||||
rerun agent work. Conflict repair remains part of artifact-aware integration and
|
||||
reconciliation.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
The implementation is complete when these conditions are true:
|
||||
|
||||
- No production runtime wiring references `deps.postProcessors`.
|
||||
- `IngestBundlePostProcessorPort` and `HistoricSqlProjectionPostProcessor` are
|
||||
removed from source exports and package export tests.
|
||||
- `SourceAdapter.finalize?()` exists with typed context and result objects.
|
||||
- The runner invokes `finalize?()` after reconciliation and before final gates.
|
||||
- Finalization changes are committed in the integration worktree and included
|
||||
in target-policy checks, final gates, reports, traces, and provenance inputs.
|
||||
- Override replay passes explicit override metadata to finalization, leaves
|
||||
`workUnitOutcomes` empty when work units are skipped, and proves historic-SQL
|
||||
finalization does not stale or archive artifacts from missing current-run
|
||||
evidence.
|
||||
- `wiki_sl_ref_repair` remains a runner-owned step after finalization and
|
||||
before final gates, consumes finalization touched sources, and has its writes
|
||||
covered by target-policy checks and final gates.
|
||||
- Finalization `actions` are not re-applied by the runner; they are included
|
||||
only in reporting, saved-memory counts, and provenance planning when their
|
||||
raw-path attribution is valid.
|
||||
- Historic SQL uses adapter-owned finalization for whole-run projection
|
||||
maintenance.
|
||||
- Tests cover a successful finalization, a finalization failure, unauthorized
|
||||
finalization target rejection, override replay finalization behavior,
|
||||
wiki-SL-ref repair placement, and historic-SQL projection behavior without
|
||||
runner-level post-processors.
|
||||
Loading…
Add table
Add a link
Reference in a new issue