diff --git a/docs/superpowers/specs/2026-05-18-adapter-owned-ingest-finalization-design.md b/docs/superpowers/specs/2026-05-18-adapter-owned-ingest-finalization-design.md new file mode 100644 index 00000000..3848c3f9 --- /dev/null +++ b/docs/superpowers/specs/2026-05-18-adapter-owned-ingest-finalization-design.md @@ -0,0 +1,352 @@ +# Adapter-owned ingest finalization design + +**Date:** 2026-05-18 +**Author:** Andrey Avtomonov +**Status:** Design - pending implementation plan + +## Background + +The isolated-diff ingestion migration made KTX's shared bundle runner +responsible for one durable execution model: stage raw source data, run +source-planned work units in isolated child worktrees, integrate their diffs, +reconcile, run final gates, and squash the accepted integration tree back into +the project worktree. + +That direction is correct, but the current code still has a runner-level +post-processing extension point. `IngestBundleRunnerDeps.postProcessors` maps a +source key to an arbitrary `IngestBundlePostProcessorPort`, and local runtime +wires `historic-sql` to `HistoricSqlProjectionPostProcessor`. That path can +write durable semantic-layer and wiki artifacts after work-unit integration and +reconciliation, outside the source adapter contract. + +Historic SQL exposed why the extra path exists. Its table and pattern work units +emit typed evidence, then a deterministic projection step merges the evidence +into `_schema` usage and historic-SQL wiki pages. Some of that work is local to +one work unit, but other behavior is whole-run maintenance: marking stale table +usage, reusing existing pattern pages, and archiving old pattern pages. Those +aggregate decisions do not fit cleanly inside independent per-work-unit writes. + +The design goal is to preserve legitimate adapter-owned deterministic +maintenance without keeping a generic runner-level escape hatch. + +## Goals + +This design tightens the isolated-diff architecture around a stable boundary: +the generic runner owns execution mechanics, and adapters own source semantics. + +The design has these goals: + +- Remove runner-level `postProcessors` as an alternate durable-write pipeline. +- Add a first-class `SourceAdapter.finalize?()` hook for deterministic + post-work-unit source maintenance. +- Keep `finalize?()` constrained, observable, and subject to the same final + validation gates as work-unit and reconciliation changes. +- Preserve historic-SQL aggregate projection behavior without treating it as a + hidden fallback ingestion path. +- Keep public execution knobs out of the adapter API. + +## Non-goals + +This design does not rework source-specific chunking, fetch formats, wiki page +frontmatter, semantic-layer YAML, or raw source layouts. It does not replace +agent-authored work units with deterministic projectors. It also does not add a +public `executionMode`, `planningStrategy`, `conflictPolicy`, or source-key +allowlist. + +Override ingest remains a special correction operation that reuses a prior raw +snapshot and forces reconciliation. It should be documented and tested as +override replay, not as a fallback pipeline. This design does not require +override ingest to run source work units. + +## Locked design direction + +The shared ingestion runner keeps one ordered pipeline for sources that can +write durable project artifacts. + +```text +fetch raw + -> adapter plans WorkUnit[] + -> optional adapter project + -> isolated WU diffs + -> artifact-aware integration + -> reconciliation + -> optional adapter finalize + -> runner wiki-SL-ref repair + -> final target policy and artifact gates + -> squash +``` + +The exact implementation may continue to call `chunk()` before `project()` so a +projector can consume `parseArtifacts`. The architectural invariant is that +`project()` runs in the integration worktree before child worktrees start, while +`finalize()` runs in the integration worktree after accepted work-unit and +reconciliation changes are present. + +Adapters decide what source-specific work belongs in `project()`, work units, +or `finalize()`. The runner decides when those phases run, captures their git +effects, enforces target scope, runs gates, writes traces and reports, and +squashes the final tree. + +## Adapter API + +The source adapter contract should make deterministic source phases explicit. + +```ts +interface SourceAdapter { + readonly source: string; + readonly skillNames: string[]; + readonly reconcileSkillNames?: string[]; + readonly evidenceIndexing?: 'documents'; + readonly triageSupported?: boolean; + + getTriageSignals?(stagedDir: string, externalId: string): Promise; + detect(stagedDir: string): Promise; + fetch?(pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise; + readFetchReport?(stagedDir: string): Promise; + listTargetConnectionIds?(stagedDir: string): Promise; + chunk(stagedDir: string, diffSet?: DiffSet): Promise; + clusterWorkUnits?(ctx: ClusterWorkUnitsContext): Promise; + project?(ctx: DeterministicProjectionContext): Promise; + finalize?(ctx: DeterministicFinalizationContext): Promise; + describeScope?(stagedDir: string): Promise; + onPullSucceeded?(ctx: PullSucceededContext): Promise; +} +``` + +`finalize?()` is not a compatibility wrapper for old post-processors. It is a +source-adapter method with a fixed location in the runner lifecycle. + +```ts +interface DeterministicFinalizationContext { + connectionId: string; + sourceKey: string; + syncId: string; + jobId: string; + runId: string; + stagedDir: string; + workdir: string; + parseArtifacts?: unknown; + stageIndex: StageIndex; + workUnitOutcomes: WorkUnitOutcome[]; + reconciliationActions: MemoryAction[]; + overrideReplay?: FinalizationOverrideReplay; + semanticLayerService: SemanticLayerService; +} + +interface FinalizationResult { + warnings: string[]; + errors: string[]; + touchedSources: TouchedSlSource[]; + changedWikiPageKeys: string[]; + actions?: MemoryAction[]; + result?: unknown; +} + +interface FinalizationOverrideReplay { + priorJobId: string; + priorRunId: string; + priorSyncId: string; +} +``` + +The implementation plan can adjust exact type names to match the existing +module layout, but the contract must preserve these semantics: + +- `finalize?()` is deterministic TypeScript code, not an agent loop. +- It runs only in the ingestion integration worktree. +- It may write ordinary durable project files. +- It must report touched semantic-layer sources and wiki page keys. +- `stageIndex` is the canonical runner index for accepted work-unit actions, + touched sources, and reconciliation records visible to the current run. In an + override replay it may be rebuilt from the prior report. +- `workUnitOutcomes` contains only work units executed in the current run. It + is empty when override replay skips source work units. +- `reconciliationActions` contains only accepted reconciliation writes emitted + through the reconciliation tool session in the current run. These actions have + already mutated the integration worktree. +- `actions` in `FinalizationResult` are descriptive records for finalization + writes that the adapter already performed. The runner must not re-apply them. + When finalization actions are intended to create provenance rows, they must + carry valid current-snapshot or eviction `rawPaths`. +- It cannot mutate the main project worktree directly. + +The existing adapter API fields unrelated to deterministic projection and +finalization remain part of the contract. Adding `finalize?()` must not remove +triage or evidence-indexing support. + +## Override replay + +Override ingest remains a replay of a prior raw snapshot with forced +reconciliation. It does not execute source work units, so finalization must not +silently assume fresh work-unit evidence exists. + +The runner should still enter the finalization phase for adapters that +implement `finalize?()`, but it must pass explicit override metadata. In that +mode, `workUnitOutcomes` is empty, `parseArtifacts` is absent unless the runner +created fresh parse artifacts in the current run, `stageIndex` comes from the +prior report, and `reconciliationActions` contains only new override +reconciliation actions. + +Adapters must treat missing current-run deterministic inputs as a no-op, not as +negative evidence. For historic SQL, override replay must not mark tables stale, +mark pattern pages stale, or archive pattern pages from an empty current-run +evidence directory. Any override-safe finalization must be derived from the +materialized raw snapshot or explicit prior-report data, not from the absence of +fresh work-unit evidence. + +## Runner responsibilities + +The runner owns all reusable mechanics around `finalize?()`. + +After reconciliation completes, the runner calls `adapter.finalize?()` if it +exists. The runner then commits any reported or discovered finalization changes +in the integration worktree, records the commit SHA and touched paths in the +run trace/report, includes finalization actions in saved-memory counts, and +runs wiki-SL-ref repair before final target-policy and artifact gates. + +`wiki_sl_ref_repair` remains a runner mechanic, not an adapter method. It runs +after finalization and before final gates, and it uses the normal target +connection set plus `FinalizationResult.touchedSources` to decide which +semantic-layer references are visible. Its writes are part of the same +integration worktree diff as finalization/reconciliation, so target-policy +checks, final artifact gates, reports, traces, and squash behavior cover those +writes before changes reach the main project worktree. + +The runner must treat finalization like deterministic projection and +reconciliation, not like a free-form source-key plug-in. It must enforce the +same target-connection policy used for work-unit and reconciliation changes. +If finalization writes an unauthorized semantic-layer target, references a +missing semantic-layer entity, or returns errors, the run fails before changes +reach the main project worktree. + +The runner should expose one trace phase named `finalization`. It should not +keep a `post_processor` stage, `IngestBundlePostProcessorPort`, +`deps.postProcessors`, or report fields that imply a parallel post-processor +pipeline. + +## Adapter application + +Each adapter continues to use the same generic runner mechanics, while keeping +source-specific choices inside the adapter. + +- `metabase` fetches cards and dashboards, computes scope, plans + card/dashboard work units, and usually does not need `project()` or + `finalize()`. +- `notion` fetches pages, extracts triage signals, clusters page work units, + and usually does not need deterministic finalization. +- `dbt` fetches the repository, parses dbt project metadata, plans model work + units, and may later add `project()` if dbt YAML import becomes deterministic. +- `lookml` fetches LookML, produces validation artifacts, plans model and + explore work units, and may later add `project()` for deterministic LookML to + semantic-layer import. +- `looker` fetches runtime bundles, fetch reports, target connections, and + triage signals. It continues to rely on work-unit diffs and shared gates. +- `metricflow` is the current strong `project()` example. It imports + authoritative semantic models before child worktrees start, then lets any + work units observe those projected files. +- `live-database` can remain work-unit based, but database schema introspection + is a good future `project()` candidate because the schema is authoritative + structured metadata. +- `historic-sql` should move current post-processor behavior into the adapter. + Local table-usage and pattern-page writes may move into work-unit tools where + they are genuinely per-unit. Whole-run maintenance such as stale table usage, + pattern-page reuse, and stale/archive page decisions belongs in + `HistoricSqlSourceAdapter.finalize()`. +- `fake` remains a test adapter and does not need deterministic phases. + +## Historic-SQL migration + +Historic SQL should stop using evidence-only tool output plus runner-level +post-processing as its durable projection path. + +The preferred migration is: + +1. Keep historic-SQL work units responsible for source-shaped analysis. +2. Use source-specific tools for per-unit durable writes when the output is + local to that unit, such as a table's usage metadata or one pattern page. +3. Move whole-run deterministic cleanup into + `HistoricSqlSourceAdapter.finalize()`. +4. Delete `HistoricSqlProjectionPostProcessor`, `IngestBundlePostProcessorPort`, + `deps.postProcessors`, and `post_processor` memory-flow/report stages. + +If the implementation keeps typed evidence as an internal handoff between +historic-SQL work units and `finalize()`, that evidence must be framed as +source-specific input to the adapter's deterministic finalization, not as a +generic runner post-processing mechanism. The evidence files must not become a +public compatibility surface. + +Historic-SQL finalization must distinguish "no current-run evidence exists" +from "the current snapshot proves this artifact is stale." Whole-run cleanup +such as stale table usage, pattern-page staleness, and archive decisions can +run only when finalization has current-run historic-SQL evidence or an explicit +override-safe source of equivalent facts. + +## Reports and observability + +Reports should describe first-class pipeline phases, not historical extension +points. The isolated-diff summary should include finalization metadata when the +adapter implements `finalize?()`: whether it ran, finalization commit SHA, +touched paths, touched semantic-layer sources, changed wiki page keys, +warnings, descriptive finalization actions, and source-specific result payload. + +Saved-memory counts should come from work-unit, reconciliation, and +finalization memory actions plus touched artifact reporting. Finalization +actions are reporting/provenance records for writes that already happened in +the integration worktree; they are not a second write channel. There should be +no special `postProcessorSavedMemoryCounts` or `postProcessor` report body. +Memory-flow phases should use `finalization` instead of `post_processor`. + +The runner owns provenance for finalization. Adapters return touched artifacts +and optional descriptive actions, but they do not call the provenance port. +When finalization actions include valid `rawPaths`, the runner folds them into +the normal provenance plan using the current `sourceKey`, `syncId`, raw content +hashes, artifact kind, artifact key, target connection, and action type. The +finalization phase and commit SHA belong in trace/report metadata; they should +not be fabricated inside adapter-written files. + +Traces must make finalization useful for postmortems. At minimum, record +`finalization_started`, `finalization_committed`, `finalization_skipped`, and +`finalization_failed` events with source key, touched paths, warnings, and +error summaries. + +## Failure handling + +Finalization failures are ingestion failures. If `finalize?()` returns errors, +throws, writes unauthorized targets, or causes final gates to fail, the runner +marks the run failed and leaves the main project worktree unchanged. + +Finalization should run after reconciliation because it may need to inspect the +accepted work-unit and reconciliation result. Final gates should run after +finalization because finalization writes durable project artifacts. + +Finalization must not be used to repair arbitrary integration conflicts or +rerun agent work. Conflict repair remains part of artifact-aware integration and +reconciliation. + +## Acceptance criteria + +The implementation is complete when these conditions are true: + +- No production runtime wiring references `deps.postProcessors`. +- `IngestBundlePostProcessorPort` and `HistoricSqlProjectionPostProcessor` are + removed from source exports and package export tests. +- `SourceAdapter.finalize?()` exists with typed context and result objects. +- The runner invokes `finalize?()` after reconciliation and before final gates. +- Finalization changes are committed in the integration worktree and included + in target-policy checks, final gates, reports, traces, and provenance inputs. +- Override replay passes explicit override metadata to finalization, leaves + `workUnitOutcomes` empty when work units are skipped, and proves historic-SQL + finalization does not stale or archive artifacts from missing current-run + evidence. +- `wiki_sl_ref_repair` remains a runner-owned step after finalization and + before final gates, consumes finalization touched sources, and has its writes + covered by target-policy checks and final gates. +- Finalization `actions` are not re-applied by the runner; they are included + only in reporting, saved-memory counts, and provenance planning when their + raw-path attribution is valid. +- Historic SQL uses adapter-owned finalization for whole-run projection + maintenance. +- Tests cover a successful finalization, a finalization failure, unauthorized + finalization target rejection, override replay finalization behavior, + wiki-SL-ref repair placement, and historic-SQL projection behavior without + runner-level post-processors.