Refine adapter-owned ingest finalization design after adversarial review iteration 1

2026-06-13 08:15:14 +02:00 · 2026-05-18 15:11:20 +02:00 · 2026-05-18 15:11:20 +02:00 · fd8d6a1134
commit fd8d6a1134
parent e64da5a85d
1 changed files with 352 additions and 0 deletions
--- a/docs/superpowers/specs/2026-05-18-adapter-owned-ingest-finalization-design.md
+++ b/docs/superpowers/specs/2026-05-18-adapter-owned-ingest-finalization-design.md
@ -0,0 +1,352 @@
+# Adapter-owned ingest finalization design
+
+**Date:** 2026-05-18
+**Author:** Andrey Avtomonov
+**Status:** Design - pending implementation plan
+
+## Background
+
+The isolated-diff ingestion migration made KTX's shared bundle runner
+responsible for one durable execution model: stage raw source data, run
+source-planned work units in isolated child worktrees, integrate their diffs,
+reconcile, run final gates, and squash the accepted integration tree back into
+the project worktree.
+
+That direction is correct, but the current code still has a runner-level
+post-processing extension point. `IngestBundleRunnerDeps.postProcessors` maps a
+source key to an arbitrary `IngestBundlePostProcessorPort`, and local runtime
+wires `historic-sql` to `HistoricSqlProjectionPostProcessor`. That path can
+write durable semantic-layer and wiki artifacts after work-unit integration and
+reconciliation, outside the source adapter contract.
+
+Historic SQL exposed why the extra path exists. Its table and pattern work units
+emit typed evidence, then a deterministic projection step merges the evidence
+into `_schema` usage and historic-SQL wiki pages. Some of that work is local to
+one work unit, but other behavior is whole-run maintenance: marking stale table
+usage, reusing existing pattern pages, and archiving old pattern pages. Those
+aggregate decisions do not fit cleanly inside independent per-work-unit writes.
+
+The design goal is to preserve legitimate adapter-owned deterministic
+maintenance without keeping a generic runner-level escape hatch.
+
+## Goals
+
+This design tightens the isolated-diff architecture around a stable boundary:
+the generic runner owns execution mechanics, and adapters own source semantics.
+
+The design has these goals:
+
+- Remove runner-level `postProcessors` as an alternate durable-write pipeline.
+- Add a first-class `SourceAdapter.finalize?()` hook for deterministic
+  post-work-unit source maintenance.
+- Keep `finalize?()` constrained, observable, and subject to the same final
+  validation gates as work-unit and reconciliation changes.
+- Preserve historic-SQL aggregate projection behavior without treating it as a
+  hidden fallback ingestion path.
+- Keep public execution knobs out of the adapter API.
+
+## Non-goals
+
+This design does not rework source-specific chunking, fetch formats, wiki page
+frontmatter, semantic-layer YAML, or raw source layouts. It does not replace
+agent-authored work units with deterministic projectors. It also does not add a
+public `executionMode`, `planningStrategy`, `conflictPolicy`, or source-key
+allowlist.
+
+Override ingest remains a special correction operation that reuses a prior raw
+snapshot and forces reconciliation. It should be documented and tested as
+override replay, not as a fallback pipeline. This design does not require
+override ingest to run source work units.
+
+## Locked design direction
+
+The shared ingestion runner keeps one ordered pipeline for sources that can
+write durable project artifacts.
+
+```text
+fetch raw
+  -> adapter plans WorkUnit[]
+  -> optional adapter project
+  -> isolated WU diffs
+  -> artifact-aware integration
+  -> reconciliation
+  -> optional adapter finalize
+  -> runner wiki-SL-ref repair
+  -> final target policy and artifact gates
+  -> squash
+```
+
+The exact implementation may continue to call `chunk()` before `project()` so a
+projector can consume `parseArtifacts`. The architectural invariant is that
+`project()` runs in the integration worktree before child worktrees start, while
+`finalize()` runs in the integration worktree after accepted work-unit and
+reconciliation changes are present.
+
+Adapters decide what source-specific work belongs in `project()`, work units,
+or `finalize()`. The runner decides when those phases run, captures their git
+effects, enforces target scope, runs gates, writes traces and reports, and
+squashes the final tree.
+
+## Adapter API
+
+The source adapter contract should make deterministic source phases explicit.
+
+```ts
+interface SourceAdapter {
+  readonly source: string;
+  readonly skillNames: string[];
+  readonly reconcileSkillNames?: string[];
+  readonly evidenceIndexing?: 'documents';
+  readonly triageSupported?: boolean;
+
+  getTriageSignals?(stagedDir: string, externalId: string): Promise<TriageSignals>;
+  detect(stagedDir: string): Promise<boolean>;
+  fetch?(pullConfig: unknown, stagedDir: string, ctx: FetchContext): Promise<void>;
+  readFetchReport?(stagedDir: string): Promise<SourceFetchReport | null>;
+  listTargetConnectionIds?(stagedDir: string): Promise<string[]>;
+  chunk(stagedDir: string, diffSet?: DiffSet): Promise<ChunkResult>;
+  clusterWorkUnits?(ctx: ClusterWorkUnitsContext): Promise<WorkUnit[]>;
+  project?(ctx: DeterministicProjectionContext): Promise<ProjectionResult>;
+  finalize?(ctx: DeterministicFinalizationContext): Promise<FinalizationResult>;
+  describeScope?(stagedDir: string): Promise<ScopeDescriptor>;
+  onPullSucceeded?(ctx: PullSucceededContext): Promise<void>;
+}
+```
+
+`finalize?()` is not a compatibility wrapper for old post-processors. It is a
+source-adapter method with a fixed location in the runner lifecycle.
+
+```ts
+interface DeterministicFinalizationContext {
+  connectionId: string;
+  sourceKey: string;
+  syncId: string;
+  jobId: string;
+  runId: string;
+  stagedDir: string;
+  workdir: string;
+  parseArtifacts?: unknown;
+  stageIndex: StageIndex;
+  workUnitOutcomes: WorkUnitOutcome[];
+  reconciliationActions: MemoryAction[];
+  overrideReplay?: FinalizationOverrideReplay;
+  semanticLayerService: SemanticLayerService;
+}
+
+interface FinalizationResult {
+  warnings: string[];
+  errors: string[];
+  touchedSources: TouchedSlSource[];
+  changedWikiPageKeys: string[];
+  actions?: MemoryAction[];
+  result?: unknown;
+}
+
+interface FinalizationOverrideReplay {
+  priorJobId: string;
+  priorRunId: string;
+  priorSyncId: string;
+}
+```
+
+The implementation plan can adjust exact type names to match the existing
+module layout, but the contract must preserve these semantics:
+
+- `finalize?()` is deterministic TypeScript code, not an agent loop.
+- It runs only in the ingestion integration worktree.
+- It may write ordinary durable project files.
+- It must report touched semantic-layer sources and wiki page keys.
+- `stageIndex` is the canonical runner index for accepted work-unit actions,
+  touched sources, and reconciliation records visible to the current run. In an
+  override replay it may be rebuilt from the prior report.
+- `workUnitOutcomes` contains only work units executed in the current run. It
+  is empty when override replay skips source work units.
+- `reconciliationActions` contains only accepted reconciliation writes emitted
+  through the reconciliation tool session in the current run. These actions have
+  already mutated the integration worktree.
+- `actions` in `FinalizationResult` are descriptive records for finalization
+  writes that the adapter already performed. The runner must not re-apply them.
+  When finalization actions are intended to create provenance rows, they must
+  carry valid current-snapshot or eviction `rawPaths`.
+- It cannot mutate the main project worktree directly.
+
+The existing adapter API fields unrelated to deterministic projection and
+finalization remain part of the contract. Adding `finalize?()` must not remove
+triage or evidence-indexing support.
+
+## Override replay
+
+Override ingest remains a replay of a prior raw snapshot with forced
+reconciliation. It does not execute source work units, so finalization must not
+silently assume fresh work-unit evidence exists.
+
+The runner should still enter the finalization phase for adapters that
+implement `finalize?()`, but it must pass explicit override metadata. In that
+mode, `workUnitOutcomes` is empty, `parseArtifacts` is absent unless the runner
+created fresh parse artifacts in the current run, `stageIndex` comes from the
+prior report, and `reconciliationActions` contains only new override
+reconciliation actions.
+
+Adapters must treat missing current-run deterministic inputs as a no-op, not as
+negative evidence. For historic SQL, override replay must not mark tables stale,
+mark pattern pages stale, or archive pattern pages from an empty current-run
+evidence directory. Any override-safe finalization must be derived from the
+materialized raw snapshot or explicit prior-report data, not from the absence of
+fresh work-unit evidence.
+
+## Runner responsibilities
+
+The runner owns all reusable mechanics around `finalize?()`.
+
+After reconciliation completes, the runner calls `adapter.finalize?()` if it
+exists. The runner then commits any reported or discovered finalization changes
+in the integration worktree, records the commit SHA and touched paths in the
+run trace/report, includes finalization actions in saved-memory counts, and
+runs wiki-SL-ref repair before final target-policy and artifact gates.
+
+`wiki_sl_ref_repair` remains a runner mechanic, not an adapter method. It runs
+after finalization and before final gates, and it uses the normal target
+connection set plus `FinalizationResult.touchedSources` to decide which
+semantic-layer references are visible. Its writes are part of the same
+integration worktree diff as finalization/reconciliation, so target-policy
+checks, final artifact gates, reports, traces, and squash behavior cover those
+writes before changes reach the main project worktree.
+
+The runner must treat finalization like deterministic projection and
+reconciliation, not like a free-form source-key plug-in. It must enforce the
+same target-connection policy used for work-unit and reconciliation changes.
+If finalization writes an unauthorized semantic-layer target, references a
+missing semantic-layer entity, or returns errors, the run fails before changes
+reach the main project worktree.
+
+The runner should expose one trace phase named `finalization`. It should not
+keep a `post_processor` stage, `IngestBundlePostProcessorPort`,
+`deps.postProcessors`, or report fields that imply a parallel post-processor
+pipeline.
+
+## Adapter application
+
+Each adapter continues to use the same generic runner mechanics, while keeping
+source-specific choices inside the adapter.
+
+- `metabase` fetches cards and dashboards, computes scope, plans
+  card/dashboard work units, and usually does not need `project()` or
+  `finalize()`.
+- `notion` fetches pages, extracts triage signals, clusters page work units,
+  and usually does not need deterministic finalization.
+- `dbt` fetches the repository, parses dbt project metadata, plans model work
+  units, and may later add `project()` if dbt YAML import becomes deterministic.
+- `lookml` fetches LookML, produces validation artifacts, plans model and
+  explore work units, and may later add `project()` for deterministic LookML to
+  semantic-layer import.
+- `looker` fetches runtime bundles, fetch reports, target connections, and
+  triage signals. It continues to rely on work-unit diffs and shared gates.
+- `metricflow` is the current strong `project()` example. It imports
+  authoritative semantic models before child worktrees start, then lets any
+  work units observe those projected files.
+- `live-database` can remain work-unit based, but database schema introspection
+  is a good future `project()` candidate because the schema is authoritative
+  structured metadata.
+- `historic-sql` should move current post-processor behavior into the adapter.
+  Local table-usage and pattern-page writes may move into work-unit tools where
+  they are genuinely per-unit. Whole-run maintenance such as stale table usage,
+  pattern-page reuse, and stale/archive page decisions belongs in
+  `HistoricSqlSourceAdapter.finalize()`.
+- `fake` remains a test adapter and does not need deterministic phases.
+
+## Historic-SQL migration
+
+Historic SQL should stop using evidence-only tool output plus runner-level
+post-processing as its durable projection path.
+
+The preferred migration is:
+
+1. Keep historic-SQL work units responsible for source-shaped analysis.
+2. Use source-specific tools for per-unit durable writes when the output is
+   local to that unit, such as a table's usage metadata or one pattern page.
+3. Move whole-run deterministic cleanup into
+   `HistoricSqlSourceAdapter.finalize()`.
+4. Delete `HistoricSqlProjectionPostProcessor`, `IngestBundlePostProcessorPort`,
+   `deps.postProcessors`, and `post_processor` memory-flow/report stages.
+
+If the implementation keeps typed evidence as an internal handoff between
+historic-SQL work units and `finalize()`, that evidence must be framed as
+source-specific input to the adapter's deterministic finalization, not as a
+generic runner post-processing mechanism. The evidence files must not become a
+public compatibility surface.
+
+Historic-SQL finalization must distinguish "no current-run evidence exists"
+from "the current snapshot proves this artifact is stale." Whole-run cleanup
+such as stale table usage, pattern-page staleness, and archive decisions can
+run only when finalization has current-run historic-SQL evidence or an explicit
+override-safe source of equivalent facts.
+
+## Reports and observability
+
+Reports should describe first-class pipeline phases, not historical extension
+points. The isolated-diff summary should include finalization metadata when the
+adapter implements `finalize?()`: whether it ran, finalization commit SHA,
+touched paths, touched semantic-layer sources, changed wiki page keys,
+warnings, descriptive finalization actions, and source-specific result payload.
+
+Saved-memory counts should come from work-unit, reconciliation, and
+finalization memory actions plus touched artifact reporting. Finalization
+actions are reporting/provenance records for writes that already happened in
+the integration worktree; they are not a second write channel. There should be
+no special `postProcessorSavedMemoryCounts` or `postProcessor` report body.
+Memory-flow phases should use `finalization` instead of `post_processor`.
+
+The runner owns provenance for finalization. Adapters return touched artifacts
+and optional descriptive actions, but they do not call the provenance port.
+When finalization actions include valid `rawPaths`, the runner folds them into
+the normal provenance plan using the current `sourceKey`, `syncId`, raw content
+hashes, artifact kind, artifact key, target connection, and action type. The
+finalization phase and commit SHA belong in trace/report metadata; they should
+not be fabricated inside adapter-written files.
+
+Traces must make finalization useful for postmortems. At minimum, record
+`finalization_started`, `finalization_committed`, `finalization_skipped`, and
+`finalization_failed` events with source key, touched paths, warnings, and
+error summaries.
+
+## Failure handling
+
+Finalization failures are ingestion failures. If `finalize?()` returns errors,
+throws, writes unauthorized targets, or causes final gates to fail, the runner
+marks the run failed and leaves the main project worktree unchanged.
+
+Finalization should run after reconciliation because it may need to inspect the
+accepted work-unit and reconciliation result. Final gates should run after
+finalization because finalization writes durable project artifacts.
+
+Finalization must not be used to repair arbitrary integration conflicts or
+rerun agent work. Conflict repair remains part of artifact-aware integration and
+reconciliation.
+
+## Acceptance criteria
+
+The implementation is complete when these conditions are true:
+
+- No production runtime wiring references `deps.postProcessors`.
+- `IngestBundlePostProcessorPort` and `HistoricSqlProjectionPostProcessor` are
+  removed from source exports and package export tests.
+- `SourceAdapter.finalize?()` exists with typed context and result objects.
+- The runner invokes `finalize?()` after reconciliation and before final gates.
+- Finalization changes are committed in the integration worktree and included
+  in target-policy checks, final gates, reports, traces, and provenance inputs.
+- Override replay passes explicit override metadata to finalization, leaves
+  `workUnitOutcomes` empty when work units are skipped, and proves historic-SQL
+  finalization does not stale or archive artifacts from missing current-run
+  evidence.
+- `wiki_sl_ref_repair` remains a runner-owned step after finalization and
+  before final gates, consumes finalization touched sources, and has its writes
+  covered by target-policy checks and final gates.
+- Finalization `actions` are not re-applied by the runner; they are included
+  only in reporting, saved-memory counts, and provenance planning when their
+  raw-path attribution is valid.
+- Historic SQL uses adapter-owned finalization for whole-run projection
+  maintenance.
+- Tests cover a successful finalization, a finalization failure, unauthorized
+  finalization target rejection, override replay finalization behavior,
+  wiki-SL-ref repair placement, and historic-SQL projection behavior without
+  runner-level post-processors.