Refine adapter-owned ingest finalization design after adversarial review iteration 1

This commit is contained in:
Andrey Avtomonov 2026-05-18 15:11:20 +02:00
parent e64da5a85d
commit fd8d6a1134

View file

@ -0,0 +1,352 @@
# Adapter-owned ingest finalization design
**Date:** 2026-05-18
**Author:** Andrey Avtomonov
**Status:** Design - pending implementation plan
## Background
The isolated-diff ingestion migration made KTX's shared bundle runner
responsible for one durable execution model: stage raw source data, run
source-planned work units in isolated child worktrees, integrate their diffs,
reconcile, run final gates, and squash the accepted integration tree back into
the project worktree.
That direction is correct, but the current code still has a runner-level
post-processing extension point. `IngestBundleRunnerDeps.postProcessors` maps a
source key to an arbitrary `IngestBundlePostProcessorPort`, and local runtime
wires `historic-sql` to `HistoricSqlProjectionPostProcessor`. That path can
write durable semantic-layer and wiki artifacts after work-unit integration and
reconciliation, outside the source adapter contract.
Historic SQL exposed why the extra path exists. Its table and pattern work units
emit typed evidence, then a deterministic projection step merges the evidence
into `_schema` usage and historic-SQL wiki pages. Some of that work is local to
one work unit, but other behavior is whole-run maintenance: marking stale table
usage, reusing existing pattern pages, and archiving old pattern pages. Those
aggregate decisions do not fit cleanly inside independent per-work-unit writes.
The design goal is to preserve legitimate adapter-owned deterministic
maintenance without keeping a generic runner-level escape hatch.
## Goals
This design tightens the isolated-diff architecture around a stable boundary:
the generic runner owns execution mechanics, and adapters own source semantics.
The design has these goals:
- Remove runner-level `postProcessors` as an alternate durable-write pipeline.
- Add a first-class `SourceAdapter.finalize?()` hook for deterministic
post-work-unit source maintenance.
- Keep `finalize?()` constrained, observable, and subject to the same final
validation gates as work-unit and reconciliation changes.
- Preserve historic-SQL aggregate projection behavior without treating it as a
hidden fallback ingestion path.
- Keep public execution knobs out of the adapter API.
## Non-goals
This design does not rework source-specific chunking, fetch formats, wiki page
frontmatter, semantic-layer YAML, or raw source layouts. It does not replace
agent-authored work units with deterministic projectors. It also does not add a
public `executionMode`, `planningStrategy`, `conflictPolicy`, or source-key
allowlist.
Override ingest remains a special correction operation that reuses a prior raw
snapshot and forces reconciliation. It should be documented and tested as
override replay, not as a fallback pipeline. This design does not require
override ingest to run source work units.
## Locked design direction
The shared ingestion runner keeps one ordered pipeline for sources that can
write durable project artifacts.
```text
fetch raw
-> adapter plans WorkUnit[]
-> optional adapter project
-> isolated WU diffs
-> artifact-aware integration
-> reconciliation
-> optional adapter finalize
-> runner wiki-SL-ref repair
-> final target policy and artifact gates
-> squash
```
The exact implementation may continue to call `chunk()` before `project()` so a
projector can consume `parseArtifacts`. The architectural invariant is that
`project()` runs in the integration worktree before child worktrees start, while
`finalize()` runs in the integration worktree after accepted work-unit and
reconciliation changes are present.
Adapters decide what source-specific work belongs in `project()`, work units,
or `finalize()`. The runner decides when those phases run, captures their git
effects, enforces target scope, runs gates, writes traces and reports, and
squashes the final tree.
## Adapter API
The source adapter contract should make deterministic source phases explicit.
```ts
interface SourceAdapter {
readonly source: string;
readonly skillNames: string[];
readonly reconcileSkillNames?: string[];
readonly evidenceIndexing?: 'documents';
readonly triageSupported?: boolean;
getTriageSignals?(stagedDir: string, externalId: string): Promise<TriageSignals>;
detect(stagedDir: string): Promise<boolean>;
fetch?(pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise<void>;
readFetchReport?(stagedDir: string): Promise<SourceFetchReport | null>;
listTargetConnectionIds?(stagedDir: string): Promise<string[]>;
chunk(stagedDir: string, diffSet?: DiffSet): Promise<ChunkResult>;
clusterWorkUnits?(ctx: ClusterWorkUnitsContext): Promise<WorkUnit[]>;
project?(ctx: DeterministicProjectionContext): Promise<ProjectionResult>;
finalize?(ctx: DeterministicFinalizationContext): Promise<FinalizationResult>;
describeScope?(stagedDir: string): Promise<ScopeDescriptor>;
onPullSucceeded?(ctx: PullSucceededContext): Promise<void>;
}
```
`finalize?()` is not a compatibility wrapper for old post-processors. It is a
source-adapter method with a fixed location in the runner lifecycle.
```ts
interface DeterministicFinalizationContext {
connectionId: string;
sourceKey: string;
syncId: string;
jobId: string;
runId: string;
stagedDir: string;
workdir: string;
parseArtifacts?: unknown;
stageIndex: StageIndex;
workUnitOutcomes: WorkUnitOutcome[];
reconciliationActions: MemoryAction[];
overrideReplay?: FinalizationOverrideReplay;
semanticLayerService: SemanticLayerService;
}
interface FinalizationResult {
warnings: string[];
errors: string[];
touchedSources: TouchedSlSource[];
changedWikiPageKeys: string[];
actions?: MemoryAction[];
result?: unknown;
}
interface FinalizationOverrideReplay {
priorJobId: string;
priorRunId: string;
priorSyncId: string;
}
```
The implementation plan can adjust exact type names to match the existing
module layout, but the contract must preserve these semantics:
- `finalize?()` is deterministic TypeScript code, not an agent loop.
- It runs only in the ingestion integration worktree.
- It may write ordinary durable project files.
- It must report touched semantic-layer sources and wiki page keys.
- `stageIndex` is the canonical runner index for accepted work-unit actions,
touched sources, and reconciliation records visible to the current run. In an
override replay it may be rebuilt from the prior report.
- `workUnitOutcomes` contains only work units executed in the current run. It
is empty when override replay skips source work units.
- `reconciliationActions` contains only accepted reconciliation writes emitted
through the reconciliation tool session in the current run. These actions have
already mutated the integration worktree.
- `actions` in `FinalizationResult` are descriptive records for finalization
writes that the adapter already performed. The runner must not re-apply them.
When finalization actions are intended to create provenance rows, they must
carry valid current-snapshot or eviction `rawPaths`.
- It cannot mutate the main project worktree directly.
The existing adapter API fields unrelated to deterministic projection and
finalization remain part of the contract. Adding `finalize?()` must not remove
triage or evidence-indexing support.
## Override replay
Override ingest remains a replay of a prior raw snapshot with forced
reconciliation. It does not execute source work units, so finalization must not
silently assume fresh work-unit evidence exists.
The runner should still enter the finalization phase for adapters that
implement `finalize?()`, but it must pass explicit override metadata. In that
mode, `workUnitOutcomes` is empty, `parseArtifacts` is absent unless the runner
created fresh parse artifacts in the current run, `stageIndex` comes from the
prior report, and `reconciliationActions` contains only new override
reconciliation actions.
Adapters must treat missing current-run deterministic inputs as a no-op, not as
negative evidence. For historic SQL, override replay must not mark tables stale,
mark pattern pages stale, or archive pattern pages from an empty current-run
evidence directory. Any override-safe finalization must be derived from the
materialized raw snapshot or explicit prior-report data, not from the absence of
fresh work-unit evidence.
## Runner responsibilities
The runner owns all reusable mechanics around `finalize?()`.
After reconciliation completes, the runner calls `adapter.finalize?()` if it
exists. The runner then commits any reported or discovered finalization changes
in the integration worktree, records the commit SHA and touched paths in the
run trace/report, includes finalization actions in saved-memory counts, and
runs wiki-SL-ref repair before final target-policy and artifact gates.
`wiki_sl_ref_repair` remains a runner mechanic, not an adapter method. It runs
after finalization and before final gates, and it uses the normal target
connection set plus `FinalizationResult.touchedSources` to decide which
semantic-layer references are visible. Its writes are part of the same
integration worktree diff as finalization/reconciliation, so target-policy
checks, final artifact gates, reports, traces, and squash behavior cover those
writes before changes reach the main project worktree.
The runner must treat finalization like deterministic projection and
reconciliation, not like a free-form source-key plug-in. It must enforce the
same target-connection policy used for work-unit and reconciliation changes.
If finalization writes an unauthorized semantic-layer target, references a
missing semantic-layer entity, or returns errors, the run fails before changes
reach the main project worktree.
The runner should expose one trace phase named `finalization`. It should not
keep a `post_processor` stage, `IngestBundlePostProcessorPort`,
`deps.postProcessors`, or report fields that imply a parallel post-processor
pipeline.
## Adapter application
Each adapter continues to use the same generic runner mechanics, while keeping
source-specific choices inside the adapter.
- `metabase` fetches cards and dashboards, computes scope, plans
card/dashboard work units, and usually does not need `project()` or
`finalize()`.
- `notion` fetches pages, extracts triage signals, clusters page work units,
and usually does not need deterministic finalization.
- `dbt` fetches the repository, parses dbt project metadata, plans model work
units, and may later add `project()` if dbt YAML import becomes deterministic.
- `lookml` fetches LookML, produces validation artifacts, plans model and
explore work units, and may later add `project()` for deterministic LookML to
semantic-layer import.
- `looker` fetches runtime bundles, fetch reports, target connections, and
triage signals. It continues to rely on work-unit diffs and shared gates.
- `metricflow` is the current strong `project()` example. It imports
authoritative semantic models before child worktrees start, then lets any
work units observe those projected files.
- `live-database` can remain work-unit based, but database schema introspection
is a good future `project()` candidate because the schema is authoritative
structured metadata.
- `historic-sql` should move current post-processor behavior into the adapter.
Local table-usage and pattern-page writes may move into work-unit tools where
they are genuinely per-unit. Whole-run maintenance such as stale table usage,
pattern-page reuse, and stale/archive page decisions belongs in
`HistoricSqlSourceAdapter.finalize()`.
- `fake` remains a test adapter and does not need deterministic phases.
## Historic-SQL migration
Historic SQL should stop using evidence-only tool output plus runner-level
post-processing as its durable projection path.
The preferred migration is:
1. Keep historic-SQL work units responsible for source-shaped analysis.
2. Use source-specific tools for per-unit durable writes when the output is
local to that unit, such as a table's usage metadata or one pattern page.
3. Move whole-run deterministic cleanup into
`HistoricSqlSourceAdapter.finalize()`.
4. Delete `HistoricSqlProjectionPostProcessor`, `IngestBundlePostProcessorPort`,
`deps.postProcessors`, and `post_processor` memory-flow/report stages.
If the implementation keeps typed evidence as an internal handoff between
historic-SQL work units and `finalize()`, that evidence must be framed as
source-specific input to the adapter's deterministic finalization, not as a
generic runner post-processing mechanism. The evidence files must not become a
public compatibility surface.
Historic-SQL finalization must distinguish "no current-run evidence exists"
from "the current snapshot proves this artifact is stale." Whole-run cleanup
such as stale table usage, pattern-page staleness, and archive decisions can
run only when finalization has current-run historic-SQL evidence or an explicit
override-safe source of equivalent facts.
## Reports and observability
Reports should describe first-class pipeline phases, not historical extension
points. The isolated-diff summary should include finalization metadata when the
adapter implements `finalize?()`: whether it ran, finalization commit SHA,
touched paths, touched semantic-layer sources, changed wiki page keys,
warnings, descriptive finalization actions, and source-specific result payload.
Saved-memory counts should come from work-unit, reconciliation, and
finalization memory actions plus touched artifact reporting. Finalization
actions are reporting/provenance records for writes that already happened in
the integration worktree; they are not a second write channel. There should be
no special `postProcessorSavedMemoryCounts` or `postProcessor` report body.
Memory-flow phases should use `finalization` instead of `post_processor`.
The runner owns provenance for finalization. Adapters return touched artifacts
and optional descriptive actions, but they do not call the provenance port.
When finalization actions include valid `rawPaths`, the runner folds them into
the normal provenance plan using the current `sourceKey`, `syncId`, raw content
hashes, artifact kind, artifact key, target connection, and action type. The
finalization phase and commit SHA belong in trace/report metadata; they should
not be fabricated inside adapter-written files.
Traces must make finalization useful for postmortems. At minimum, record
`finalization_started`, `finalization_committed`, `finalization_skipped`, and
`finalization_failed` events with source key, touched paths, warnings, and
error summaries.
## Failure handling
Finalization failures are ingestion failures. If `finalize?()` returns errors,
throws, writes unauthorized targets, or causes final gates to fail, the runner
marks the run failed and leaves the main project worktree unchanged.
Finalization should run after reconciliation because it may need to inspect the
accepted work-unit and reconciliation result. Final gates should run after
finalization because finalization writes durable project artifacts.
Finalization must not be used to repair arbitrary integration conflicts or
rerun agent work. Conflict repair remains part of artifact-aware integration and
reconciliation.
## Acceptance criteria
The implementation is complete when these conditions are true:
- No production runtime wiring references `deps.postProcessors`.
- `IngestBundlePostProcessorPort` and `HistoricSqlProjectionPostProcessor` are
removed from source exports and package export tests.
- `SourceAdapter.finalize?()` exists with typed context and result objects.
- The runner invokes `finalize?()` after reconciliation and before final gates.
- Finalization changes are committed in the integration worktree and included
in target-policy checks, final gates, reports, traces, and provenance inputs.
- Override replay passes explicit override metadata to finalization, leaves
`workUnitOutcomes` empty when work units are skipped, and proves historic-SQL
finalization does not stale or archive artifacts from missing current-run
evidence.
- `wiki_sl_ref_repair` remains a runner-owned step after finalization and
before final gates, consumes finalization touched sources, and has its writes
covered by target-policy checks and final gates.
- Finalization `actions` are not re-applied by the runner; they are included
only in reporting, saved-memory counts, and provenance planning when their
raw-path attribution is valid.
- Historic SQL uses adapter-owned finalization for whole-run projection
maintenance.
- Tests cover a successful finalization, a finalization failure, unauthorized
finalization target rejection, override replay finalization behavior,
wiki-SL-ref repair placement, and historic-SQL projection behavior without
runner-level post-processors.