fix(context): merge overlay columns onto manifest columns by name (#94)

* fix(context): merge overlay columns onto manifest columns by name composeOverlay was appending overlay columns to the manifest column list, producing duplicate entries when dbt/metabase overlays declared a column just to attach descriptions. The duplicates carried no `type`, so the pydantic SourceDefinition rejected them at semantic-query time and broke `ktx sl query` for every overlay-backed measure. Now overlay columns match base columns by name (case-insensitive): same-name entries merge onto the manifest (overlay fields win, type/role fall back to the base, descriptions merge per source key) and only new names append. * refactor(sl): split overlay columns from column_overrides and enforce TS/Python wire contract Overlay sources now have two distinct collections: `columns:` for computed columns (requiring `expr` + `type`) and `column_overrides:` for metadata patches to inherited manifest columns. Composing or loading an overlay that mixes the two — or references an unknown column — fails with a typed error. Introduce `ResolvedSemanticLayerSource` / `resolvedSourceSchema` / `toResolvedWire` as the strict shape sent to the Python engine, and add a schema contract test that diffs Zod against the Pydantic JSON schema dumped by `python -m semantic_layer dump-schema`. `SourceDefinition` is now `extra="forbid"` on the Python side. `loadAllSources` surfaces per-file load errors instead of swallowing them, so validation/query paths can report manifest shard parse failures. * fix(context): make scan description generation resilient and quiet A transient sampleTable failure during ingest used to take out every table in a connection: generateTableDescription returned a hardcoded 'Table not found' string into descriptions.ai, and KtxDescriptionGenerator was constructed without a logger, so the failure left no trail anywhere. - sampleTable / sampleColumn calls retry 3x with 200/400/800ms backoff, honouring KtxScanContext.signal via a new KtxAbortedError. - On retry exhaustion or missing capability, table generation falls back to a metadata-only prompt built from column name / native type / comment / rawDescriptions. The column path follows the same rule -- call the LLM when any of samples or rawDescriptions are available; skip only when both are absent. - Logger is now threaded from KtxScanContext into the generator. Failures emit structured KtxScanWarning entries (new description_fallback_used code, plus existing sampling_failed / enrichment_failed / connector_capability_missing). ktx scan groups warnings by code so a batch of identical failures collapses to one summary line plus sample. - Returns null on failure instead of the 'Table not found' sentinel; the manifest writer's existing guard already skips empty descriptions, so schema YAML no longer carries misleading text. SCAN_MANAGED_DESCRIPTION_KEYS already strips stale 'ai' on merge, so existing YAML clears on next run. Also suppress AI SDK v6 'system in messages' warning: pull system messages out of KtxMessageBuilder.wrapSimple's output via a new splitKtxSystemMessages helper and pass them top-level to generateText (preserves cacheControl providerOptions on the SystemModelMessage). Agent-runner's local splitSystemPromptMessages dedupes onto the shared helper. * test(docs): align examples-docs assertions with revamped docs PR #103 (setup/guide doc revamp) reworded several CLI examples and connection labels; the assertions in scripts/examples-docs.test.mjs still referenced the pre-revamp wording and were failing in CI on main. Update the regexes to match the post-revamp content: - drop the `--json` flag from the sl-query example expectation - move the `Driver:` / `Status: ok` probe to the connection reference, which is where that output now lives (driver id is lowercase `postgres`, not the display name `PostgreSQL`) - drop the obsolete `Install \`uv\`...` troubleshooting line - accept `<connectionId>` everywhere; the docs no longer use the hyphenated `<connection-id>` form - match the `warehouse` connection id used in the quickstart instead of the `postgres-warehouse` id only used in the README and setup ref * fix(sl): skip TS/Python schema contract test when uv is unavailable The TypeScript checks CI job does not install uv or Python, so the module-level `execFileSync('uv', ...)` in schemas.contract.test.ts threw ENOENT and failed the suite. Wrap the schema dump in a try/catch and guard the describe block with `describe.skipIf` so the test skips in environments without uv. Local dev and any CI job that has uv on PATH still runs the cross-language contract assertion.
2026-07-25 12:01:03 +02:00 · 2026-05-15 02:11:04 +02:00 · 2026-05-15 02:11:04 +02:00 · cb8902f1e5
commit cb8902f1e5
parent 6bc8d200ea
56 changed files with 1650 additions and 237 deletions
--- a/python/ktx-sl/semantic_layer/main.py
+++ b/python/ktx-sl/semantic_layer/main.py
@ -1,3 +1,22 @@
-from semantic_layer.cli import main
+from __future__ import annotations

-main()
+import json
+import sys
+
+from semantic_layer.cli import main as cli_main
+from semantic_layer.models import SourceDefinition
+
+
+def dump_schema() -> None:
+    json.dump(
+        SourceDefinition.model_json_schema(), sys.stdout, indent=2, sort_keys=True
+    )
+    sys.stdout.write("\n")
+
+
+if __name__ == "__main__":
+    if len(sys.argv) > 1 and sys.argv[1] in {"dump-schema", "schema"}:
+        sys.argv.pop(1)
+        dump_schema()
+    else:
+        cli_main()
--- a/python/ktx-sl/semantic_layer/loader.py
+++ b/python/ktx-sl/semantic_layer/loader.py
@ -87,18 +87,23 @@ class SourceLoader:
                sources[name] = SourceDefinition(**data)
            else:
                # Overlay — validate and compose with matching manifest entry
-                errors = validate_overlay(data)
-                if errors:
-                    raise ValueError(
-                        f"Invalid overlay '{name}' in {path}: {'; '.join(errors)}"
-                    )
                base = sources.get(name)
                if base:
+                    errors = validate_overlay(data, {c.name for c in base.columns})
+                    if errors:
+                        raise ValueError(
+                            f"Invalid overlay '{name}' in {path}: {'; '.join(errors)}"
+                        )
                    (
                        sources[name],
                        description_sources[name],
                    ) = self._compose(base, data, description_sources.get(name))
                else:
+                    errors = validate_overlay(data)
+                    if errors:
+                        raise ValueError(
+                            f"Invalid overlay '{name}' in {path}: {'; '.join(errors)}"
+                        )
                    logger.warning(
                        "Orphan overlay '%s' in %s: no matching manifest entry, skipping",
                        name,
@ -149,12 +154,55 @@ class SourceLoader:
                description_sources or None,
            )

-        # Filter columns
+        excluded = set(overlay.get("exclude_columns", []))
+        overrides = overlay.get("column_overrides", [])
+        override_names = {override.get("name") for override in overrides}
+        conflicts = sorted(name for name in override_names if name in excluded)
+        if conflicts:
+            raise ValueError(
+                "column_overrides conflict with exclude_columns: "
+                + ", ".join(conflicts)
+            )
+
+        base_by_name = {column.name: column for column in base.columns}
+
+        for override in overrides:
+            name = override.get("name")
+            base_column = base_by_name.get(name)
+            if base_column is None:
+                raise ValueError(
+                    f"column '{name}' in column_overrides does not exist on manifest source '{base.name}'"
+                )
+
        excluded = set(overlay.get("exclude_columns", []))
        source.columns = [c for c in source.columns if c.name not in excluded]

-        # Append computed columns (overlay columns with expr)
+        columns_by_name = {column.name: column for column in source.columns}
+
+        for override in overrides:
+            name = override["name"]
+            base_column = base_by_name[name]
+            merged = base_column.model_dump(mode="python", exclude_none=True)
+            base_descriptions = merged.get("descriptions") or {}
+            override_data = dict(override)
+            override_descriptions = override_data.get("descriptions") or {}
+            merged.update(override_data)
+            if base_descriptions or override_descriptions:
+                merged["descriptions"] = {
+                    **base_descriptions,
+                    **override_descriptions,
+                }
+            columns_by_name[name] = SourceColumn(**merged)
+        source.columns = list(columns_by_name.values())
+
+        # Append computed columns. Manifest column names cannot be reused here;
+        # use column_overrides for metadata patches.
        for col in overlay.get("columns", []):
+            name = col.get("name")
+            if name in base_by_name:
+                raise ValueError(
+                    f"column '{name}' in columns patches a manifest column on '{base.name}' — move it to 'column_overrides:'"
+                )
            source.columns.append(SourceColumn(**col))

        # Set measures
@ -181,6 +229,11 @@ class SourceLoader:
        ]
        source.joins = manifest_joins + new_joins

+        if not source.table and not source.sql:
+            raise ValueError("resolved source must have 'table' or 'sql'")
+        if source.table and source.sql:
+            raise ValueError("'table' and 'sql' are mutually exclusive")
+
        return source, (description_sources or None)

    def _validate_cross_references(self, sources: dict[str, SourceDefinition]) -> None:
--- a/python/ktx-sl/semantic_layer/manifest.py
+++ b/python/ktx-sl/semantic_layer/manifest.py
@ -143,7 +143,9 @@ class Manifest(BaseModel):
 # ── Projection ──────────────────────────────────────────────────────


-def validate_overlay(data: dict) -> list[str]:
+def validate_overlay(
+    data: dict, manifest_column_names: set[str] | None = None
+) -> list[str]:
    """Validate that overlay data doesn't contain structural fields.

    Returns a list of error messages (empty if valid).
@ -162,11 +164,26 @@ def validate_overlay(data: dict) -> list[str]:
            errors.append(
                f"Overlay column '{col.get('name', '?')}' must use 'descriptions'"
            )
-        if "type" in col and "expr" not in col:
+        if "expr" not in col:
            errors.append(
-                f"Overlay column '{col.get('name', '?')}' specifies 'type' without 'expr' "
-                f"(structural types are inherited from manifest — only computed columns may specify a type)"
+                f"Overlay column '{col.get('name', '?')}' in 'columns' must define "
+                f"'expr' and 'type' (use 'column_overrides' to patch manifest columns)"
            )
+        if "type" not in col:
+            errors.append(
+                f"Overlay column '{col.get('name', '?')}' in 'columns' must define "
+                f"'type' and 'expr' (use 'column_overrides' to patch manifest columns)"
+            )
+    for col in data.get("column_overrides", []):
+        name = col.get("name", "?")
+        if "description" in col:
+            errors.append(f"Column override '{name}' must use 'descriptions'")
+        if "type" in col:
+            errors.append(f"Column override '{name}' must not contain 'type'")
+        if "expr" in col:
+            errors.append(f"Column override '{name}' must not contain 'expr'")
+        if manifest_column_names is not None and name not in manifest_column_names:
+            errors.append(f"Column override '{name}' does not match a manifest column")
    return errors


--- a/python/ktx-sl/semantic_layer/models.py
+++ b/python/ktx-sl/semantic_layer/models.py
@ -3,7 +3,7 @@ from __future__ import annotations
 from enum import Enum
 from typing import Any, Literal

-from pydantic import BaseModel, Field, model_validator
+from pydantic import BaseModel, ConfigDict, Field, model_validator


 # ── Source Definition Models ──────────────────────────────────────────
@ -105,6 +105,8 @@ class DefaultTimeDimensionDbt(BaseModel):


 class SourceDefinition(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
    name: str
    description: str | None = None
    descriptions: dict[str, str] | None = None
@ -123,6 +125,8 @@ class SourceDefinition(BaseModel):
    def validate_source(self) -> SourceDefinition:
        if self.description is None:
            self.description = _resolve_description_map(self.descriptions)
+        if not self.table and not self.sql:
+            raise ValueError("resolved source must have 'table' or 'sql'")
        if self.table and self.sql:
            raise ValueError("'table' and 'sql' are mutually exclusive")
        if not self.grain:
--- a/python/ktx-sl/tests/test_loader.py
+++ b/python/ktx-sl/tests/test_loader.py
@ -148,11 +148,21 @@ class TestLoaderEdgeCases:
            with open(Path(tmpdir) / "test.yaml", "w") as f:
                yaml.dump(data, f)
            loader = SourceLoader(tmpdir)
-            try:
-                sources = loader.load_all()
-                assert "test" in sources
-            except Exception:
-                pass
+            with pytest.raises(Exception, match="unknown_field"):
+                loader.load_all()
+
+    def test_source_requires_table_or_sql(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            data = {
+                "name": "test",
+                "grain": ["id"],
+                "columns": [{"name": "id", "type": "number"}],
+            }
+            with open(Path(tmpdir) / "test.yaml", "w") as f:
+                yaml.dump(data, f)
+            loader = SourceLoader(tmpdir)
+            with pytest.raises(Exception, match="table.*sql"):
+                loader.load_file(Path(tmpdir) / "test.yaml")

    def test_subdirectory_sources(self):
        with tempfile.TemporaryDirectory() as tmpdir:
--- a/python/ktx-sl/tests/test_manifest.py
+++ b/python/ktx-sl/tests/test_manifest.py
@ -205,12 +205,15 @@ class TestValidateOverlay:
            "descriptions": {"user": "Revenue-bearing orders"},
            "grain": ["id"],
            "measures": [{"name": "revenue", "expr": "sum(total)"}],
+            "column_overrides": [
+                {"name": "status", "descriptions": {"user": "Order lifecycle status"}}
+            ],
            "columns": [
                {"name": "is_high_value", "expr": "total > 1000", "type": "boolean"}
            ],
            "exclude_columns": ["status"],
        }
-        errors = validate_overlay(data)
+        errors = validate_overlay(data, {"status", "total"})
        assert errors == []

    def test_validate_overlay_rejects_table(self):
@ -225,14 +228,13 @@ class TestValidateOverlay:
        assert len(errors) == 1
        assert "sql" in errors[0].lower()

-    def test_validate_overlay_rejects_type_without_expr(self):
+    def test_validate_overlay_rejects_column_without_expr(self):
        data = {
            "name": "orders",
            "columns": [{"name": "status", "type": "string"}],
        }
        errors = validate_overlay(data)
        assert len(errors) == 1
-        assert "type" in errors[0].lower()
        assert "expr" in errors[0].lower()

    def test_validate_overlay_allows_type_with_expr(self):
@ -243,6 +245,33 @@ class TestValidateOverlay:
        errors = validate_overlay(data)
        assert errors == []

+    def test_validate_overlay_rejects_column_override_structural_fields(self):
+        data = {
+            "name": "orders",
+            "column_overrides": [
+                {
+                    "name": "status",
+                    "description": "Status",
+                    "type": "string",
+                    "expr": "status",
+                }
+            ],
+        }
+        errors = validate_overlay(data, {"status"})
+        assert len(errors) == 3
+        assert "descriptions" in errors[0]
+        assert "type" in errors[1]
+        assert "expr" in errors[2]
+
+    def test_validate_overlay_rejects_unknown_column_override(self):
+        data = {
+            "name": "orders",
+            "column_overrides": [{"name": "missing", "descriptions": {"user": "Nope"}}],
+        }
+        errors = validate_overlay(data, {"status"})
+        assert len(errors) == 1
+        assert "does not match" in errors[0]
+

 # ── Two-Tier Loading Tests ─────────────────────────────────────────

@ -502,6 +531,77 @@ class TestTwoTierLoading:
        assert hv.expr == "total > 1000"
        assert hv.type == "boolean"

+    def test_overlay_column_overrides_patch_manifest_columns(self, tmp_path: Path):
+        schema_dir = tmp_path / "_schema"
+        _write_yaml(schema_dir / "public.yaml", _manifest_tables())
+
+        overlay = {
+            "name": "orders",
+            "column_overrides": [
+                {"name": "status", "descriptions": {"user": "Order lifecycle status"}}
+            ],
+        }
+        _write_yaml(tmp_path / "orders.yaml", overlay)
+        _write_yaml(tmp_path / "customers.yaml", {"name": "customers"})
+
+        loader = SourceLoader(tmp_path)
+        sources = loader.load_all()
+
+        status = next(c for c in sources["orders"].columns if c.name == "status")
+        assert status.type == "string"
+        assert status.description == "Order lifecycle status"
+        assert status.descriptions == {"user": "Order lifecycle status"}
+
+    def test_overlay_rejects_unknown_column_override(self, tmp_path: Path):
+        schema_dir = tmp_path / "_schema"
+        _write_yaml(schema_dir / "public.yaml", _manifest_tables())
+
+        overlay = {
+            "name": "orders",
+            "column_overrides": [
+                {"name": "missing", "descriptions": {"user": "No such column"}}
+            ],
+        }
+        _write_yaml(tmp_path / "orders.yaml", overlay)
+        _write_yaml(tmp_path / "customers.yaml", {"name": "customers"})
+
+        loader = SourceLoader(tmp_path)
+        with pytest.raises(ValueError, match="Column override 'missing'"):
+            loader.load_all()
+
+    def test_overlay_rejects_computed_column_name_collision(self, tmp_path: Path):
+        schema_dir = tmp_path / "_schema"
+        _write_yaml(schema_dir / "public.yaml", _manifest_tables())
+
+        overlay = {
+            "name": "orders",
+            "columns": [{"name": "status", "type": "string", "expr": "status"}],
+        }
+        _write_yaml(tmp_path / "orders.yaml", overlay)
+        _write_yaml(tmp_path / "customers.yaml", {"name": "customers"})
+
+        loader = SourceLoader(tmp_path)
+        with pytest.raises(ValueError, match="move it to 'column_overrides:'"):
+            loader.load_all()
+
+    def test_overlay_rejects_exclude_override_conflict(self, tmp_path: Path):
+        schema_dir = tmp_path / "_schema"
+        _write_yaml(schema_dir / "public.yaml", _manifest_tables())
+
+        overlay = {
+            "name": "orders",
+            "exclude_columns": ["status"],
+            "column_overrides": [
+                {"name": "status", "descriptions": {"user": "Hidden status"}}
+            ],
+        }
+        _write_yaml(tmp_path / "orders.yaml", overlay)
+        _write_yaml(tmp_path / "customers.yaml", {"name": "customers"})
+
+        loader = SourceLoader(tmp_path)
+        with pytest.raises(ValueError, match="conflict with exclude_columns"):
+            loader.load_all()
+
    def test_overlay_measures_set(self, tmp_path: Path):
        schema_dir = tmp_path / "_schema"
        _write_yaml(schema_dir / "public.yaml", _manifest_tables())