ktx/python/ktx-sl/AGENTS.md
2026-05-25 11:09:33 -04:00

6.6 KiB

Semantic Layer Engine

Python semantic layer that generates SQL from structured JSON queries. No from clause - sources are inferred from fully-qualified field names (source.column).

Quick Start

uv run pytest -q                    # run all tests
uv run python -m semantic_layer.cli --help

Testing Corner Cases via CLI

Use --model to pass a self-contained YAML model (list of source definitions) instead of a directory. This lets you test any join topology or edge case without creating files.

1. Create an inline model file

# /tmp/model.yaml - a YAML list of source definitions
- name: orders
  table: public.orders
  grain: [id]
  columns:
    - {name: id, type: number}
    - {name: amount, type: number}
    - {name: status, type: string}
  joins:
    - to: customers
      "on": "customer_id = customers.id"
      relationship: many_to_one
  measures:
    - {name: revenue, expr: "sum(amount)", filter: "status != 'refunded'"}

- name: customers
  table: public.customers
  grain: [id]
  columns:
    - {name: id, type: number}
    - {name: segment, type: string}

2. Run queries against it

# Basic query
uv run python -m semantic_layer.cli --model /tmp/model.yaml \
  -q '{"measures":["sum(orders.amount)"],"dimensions":["customers.segment"]}'

# Pre-defined measure + filter
uv run python -m semantic_layer.cli --model /tmp/model.yaml \
  -q '{"measures":["orders.revenue"],"dimensions":["orders.status"],"filters":["orders.status != '"'"'cancelled'"'"'"]}'

# Show resolved plan alongside SQL
uv run python -m semantic_layer.cli --model /tmp/model.yaml \
  -q '{"measures":["orders.revenue"],"dimensions":["customers.segment"]}' --plan

# Validate without generating SQL
uv run python -m semantic_layer.cli --model /tmp/model.yaml \
  -q '{"measures":["orders.revenue"],"dimensions":["customers.segment"]}' --suggest

3. Test fanout / chasm traps

Add multiple measure sources that fan out from a shared dimension hub:

# Two independent fact tables joining to the same dimension
- name: hub
  table: public.hub
  grain: [id]
  columns: [{name: id, type: number}, {name: segment, type: string}]

- name: fact_a
  table: public.fact_a
  grain: [id]
  columns: [{name: id, type: number}, {name: hub_id, type: number}, {name: val, type: number}]
  joins: [{to: hub, "on": "hub_id = hub.id", relationship: many_to_one}]

- name: fact_b
  table: public.fact_b
  grain: [id]
  columns: [{name: id, type: number}, {name: hub_id, type: number}, {name: val, type: number}]
  joins: [{to: hub, "on": "hub_id = hub.id", relationship: many_to_one}]
# This triggers aggregate locality (separate CTEs per fact table, FULL JOIN)
uv run python -m semantic_layer.cli --model /tmp/chasm.yaml \
  -q '{"measures":["sum(fact_a.val)","sum(fact_b.val)"],"dimensions":["hub.segment"]}'

4. Test derived measures

uv run python -m semantic_layer.cli --model /tmp/model.yaml \
  -q '{"measures":[{"expr":"sum(orders.amount)","name":"total"},{"expr":"count(orders.id)","name":"cnt"},{"expr":"total / cnt","name":"avg_order"}],"dimensions":["customers.segment"]}'

5. Test dialects

uv run python -m semantic_layer.cli --model /tmp/model.yaml \
  -q '{"measures":["sum(orders.amount)"],"dimensions":["customers.segment"]}' --dialect bigquery

6. Useful flags

Flag Purpose
--model FILE Single YAML file with all sources (alternative to --sources DIR)
--plan Show resolved plan + SQL
--plan-only Show plan without SQL
--suggest Validate query, show suggestions on failure
--list-sources Print all sources, columns, measures, joins
--dialect X postgres (default), bigquery, snowflake, duckdb, mysql
--compact SQL without header comment
-q JSON Pass query as JSON string
--json Read JSON query from stdin

Coding Guidelines

Expression handling - always use sqlglot AST, never regex on SQL

  • Parse expressions with sqlglot.parse_one(f"SELECT {expr}") and walk/transform the AST. Never use str.replace(), re.sub(), or string splitting on SQL fragments - these corrupt string literals, aliases, and nested expressions.
  • Quote reserved words first: always call quote_reserved_identifiers(expr) before passing to sqlglot.parse_one(). Column/source names like group, key, order will fail to parse otherwise.
  • Use the parse cache in parser.py (ExpressionParser._parse_as_select()) for read-only AST walks. Direct sqlglot.parse_one() calls are fine when you need to .transform() the tree.
  • Regex is fine for non-SQL tasks: sanitizing alias names, masking string literals before parse, etc. The rule is: don't use regex to interpret SQL structure.

Error handling

  • Never use bare except Exception: pass. At minimum add logger.debug(...) so failures are observable. Prefer catching sqlglot.errors.ParseError specifically.
  • Regex fallback paths in generator.py exist for edge cases where sqlglot can't parse user-provided SQL sources. These are acceptable as last-resort fallbacks with logging, not as primary code paths.

SQL generation strategy

  • Write postgres, transpile on output. All SQL is generated as postgres dialect. _transpile() converts to the target dialect at the very end. Never add dialect-specific SQL generation logic.
  • f-strings for SQL skeleton (SELECT/FROM/JOIN/GROUP BY) are fine and readable. Use sqlglot AST only for expression-level transformations (substitution, function translation, filter rewriting).
  • Don't build SQL via sqlglot node construction (exp.Select().from_(...)). It's harder to read and debug than f-strings for structural SQL.

Testing

  • Run uv run pytest -q after every change. All tests must pass.
  • Test CLI queries with --model /tmp/model.yaml for quick iteration on edge cases (see examples above).
  • When adding expression handling logic, test with reserved-word identifiers (group.key, order.select) and string literals containing dots (status = 'group.value').

Project Structure

semantic_layer/
  models.py      # Pydantic data models (sources, queries, plans, results)
  loader.py      # YAML source file loader
  graph.py       # Bidirectional join graph with Dijkstra + Steiner tree
  parser.py      # Expression parser (source refs, aggregate detection)
  planner.py     # 12-step query planning pipeline
  generator.py   # SQL generation (simple path + aggregate locality)
  engine.py      # Orchestrator tying loader/graph/planner/generator
  cli.py         # CLI entry point
sources/
  ecommerce/     # Test fixtures (6 YAML source definitions)
tests/           # 353 tests