ktx/docs-site/content/docs/concepts/cross-database-federation.mdx
Luca Martial a651b82e2f
feat: query_policy semantic-layer-only restricts agents to predefined semantic-layer measures (#334)
* feat(sl): add predefined_measures_only guard to semantic query planning

SemanticQuery gains a predefined_measures_only flag; the planner rejects
any measure resolved with Provenance.COMPOSED (runtime aggregate
expressions and query-time derivations) while predefined measures,
predefined derived chains, dimensions, filters, and segments pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(config): add per-connection query_policy to warehouse connections

query_policy: semantic-layer-only | read-only-sql (default) on the
warehouse connection schema, plus a policy module with the raw-SQL
guard, federated member restriction lookup, and the project-level
predicate used to gate sql_execution registration.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(cli): enforce query_policy on raw SQL through one shared executor

ktx sql and the MCP sql_execution tool now share executeProjectRawSql
(resolve, policy check, read-only validation, execute), collapsing
their duplicated validate-then-execute paths. Restricted connections
are rejected before validation; federated raw SQL is rejected when any
member is restricted. sql_execution is not registered when every SQL
connection is restricted, and connection_list marks restricted
connections so agents route to sl_query. executeProjectReadOnlySql
stays generic for ktx-internal SQL (scan, ingest, SL-generated).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(sl): compile queries with predefined_measures_only from query_policy

compileLocalSlQuery injects the flag from the connection's query_policy,
never from caller input, covering both ktx sl query and the MCP
sl_query tool through the daemon compile path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: document query_policy semantic-layer-only

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(sl): close semantic-layer-only bypasses via filters and federated hint

The predefined_measures_only guard only inspected query.measures, so a
composed aggregate written into `filters` slipped through _classify_filters
into a HAVING clause untouched — letting a restricted agent evaluate
arbitrary aggregates (e.g. threshold-probing `sum(x) BETWEEN a AND b`).
Reject filter clauses that compose an aggregate function; a HAVING that
compares a predefined measure by name (`orders.revenue > 100`) still works.

Also make the federated sl_query error policy-aware: when a member is
restricted, raw federated SQL is disabled too, so stop directing the agent
to `ktx sql -c _ktx_federated` / sql_execution (a guaranteed failure) and
point to per-connection semantic-layer queries instead.

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com>
2026-07-03 08:54:17 +00:00

161 lines
7.3 KiB
Text

---
title: Cross-database federation
description: How ktx federates postgres, mysql, sqlite, and duckdb connections so a single read-only SQL query can join across them without copying data.
---
Cross-database federation lets a single read-only SQL query join tables that
live in different databases. **ktx** achieves this by embedding DuckDB and
using its `ATTACH` mechanism to connect each member database read-only. The
join executes inside DuckDB at query time — live data, no ETL, no copy.
You run federated queries as raw SQL against the `_ktx_federated` connection
(see [Querying the federated connection
directly](#querying-the-federated-connection-directly)). Semantic-layer queries
(`ktx sl query` / the `sl_query` tool) stay per-connection; pointing one at
`_ktx_federated` returns an error telling you to use raw SQL instead.
Federation activates automatically when a `ktx.yaml` file declares two or more
attach-compatible connections. There is nothing to configure and no federation
block to add. With zero or one compatible connection the behavior is unchanged.
## Which connections participate
The v1 federation engine supports four drivers:
| Driver | Participates in federation |
|--------|---------------------------|
| `postgres` | Yes |
| `mysql` | Yes |
| `sqlite` | Yes |
| `duckdb` | Yes |
| `snowflake` | No — standalone connection |
| `bigquery` | No — standalone connection |
| `clickhouse` | No — standalone connection |
| `sqlserver` | No — standalone connection |
Non-participating connections continue to work exactly as they did. They are
queried independently; they do not appear as federation members.
## How it activates
**ktx** inspects the connections in `ktx.yaml` at startup. When it finds two or
more connections whose driver is `postgres`, `mysql`, `sqlite`, or `duckdb`, it
instantiates the DuckDB federation engine and attaches each one read-only.
There is no `federation:` key, no opt-in flag, and no connection-level setting
to enable. The engine is derived entirely from what is already declared.
A minimal `ktx.yaml` that triggers federation:
```yaml
connections:
pg_books:
driver: postgres
url: "postgres://user:pass@localhost:5432/books" # pragma: allowlist secret
sqlite_reviews:
driver: sqlite
path: ./data/reviews.db
```
Two attach-compatible connections are present, so federation is active.
## Table naming in federated queries
Inside a federated query, postgres and mysql tables use a three-part name:
`connectionId.schema.table`. SQLite and DuckDB tables use the two-part form
`connectionId.table`, since ktx addresses both as single-namespace members. In
both cases the connection's `id` field in `ktx.yaml` becomes the catalog name
inside DuckDB.
If a connection `id` is not a bare SQL identifier — for example it contains a
hyphen, like `books-db` — double-quote it in the query the same way DuckDB
quotes any identifier: `"books-db".public.books`. Writing it unquoted
(`books-db.public.books`) is a SQL syntax error, not a federation feature.
For the example above:
- `pg_books.public.books` — the `books` table in the `public` schema of the
postgres connection
- `sqlite_reviews.reviews` — the `reviews` table in the sqlite connection
These fully qualified names are what you write when you query the federated
connection with raw SQL (see [Querying the federated connection
directly](#querying-the-federated-connection-directly)). A source file's own
`table:` field is not prefixed this way — see [Source files keep member-native
table refs](#source-files-keep-member-native-table-refs) below.
## Source names in the federated view
When you list or search semantic-layer sources under the federated connection,
each source's `name` is prefixed with its member connection id — for example
`pg_books.books` and `sqlite_reviews.reviews`. The prefix keeps names unique
when two members own a table with the same name: a `users` table in each of
`pg_app` and `sqlite_app` surfaces as `pg_app.users` and `sqlite_app.users`
rather than colliding on a bare `users`.
## Source files keep member-native table refs
A source file's physical `table:` field is not prefixed with the connection id.
It stays the member-native reference the connector uses on its own —
`public.books` for the postgres member, `reviews` for the sqlite member —
because the same file backs a per-connection semantic-layer query against that
member, which runs on the member's own driver where a `pg_books.` catalog prefix
would point at a database that does not exist. The connection-id prefix is a
DuckDB catalog name that appears only in raw federated SQL; the member prefix on
the source `name` (above) is independent of it.
## Cross-database joins
Write a cross-database join as raw SQL against `_ktx_federated` — see
[Querying the federated connection
directly](#querying-the-federated-connection-directly) below for a runnable
example. DuckDB attaches both members and resolves the join live at query time.
Declaring the join in a source file's `joins:` block is not supported yet. The
semantic layer plans each connection on its own, so a `joins:` entry whose `to:`
points at a table in another member is not resolved across the federation
boundary. Until that lands, express cross-database joins as raw SQL.
## Querying the federated connection directly
The federated connection is addressable by its id,
`_ktx_federated`, anywhere **ktx** runs read-only SQL. The same id works for the
`ktx sql` command and for a data agent calling the `sql_execution` MCP tool, so
both surfaces can run a cross-database query without a source file:
```bash
ktx sql -c _ktx_federated \
"SELECT b.title, avg(r.rating) AS avg_rating
FROM pg_books.public.books b
JOIN sqlite_reviews.reviews r ON b.id = r.book_id
GROUP BY b.title"
```
Table names follow the rules from
[Table naming in federated queries](#table-naming-in-federated-queries):
three-part `connectionId.schema.table` for postgres and mysql, two-part
`connectionId.table` for sqlite and duckdb. The `_ktx_federated` id is virtual —
it is never written to `ktx.yaml` and only exists when two or more attach-compatible
connections are declared. It surfaces in `ktx connection` and in the agent's
connection list so the id is discoverable. Querying a single member database
directly with its own connection id (`ktx sql -c pg_books ...`) is unchanged.
If any member connection sets
[`query_policy: semantic-layer-only`](/docs/configuration/ktx-yaml#query-policy),
raw SQL against `_ktx_federated` is rejected as a whole: a federated query can
touch any member's tables, so one restricted member restricts the federation.
## Federated queries are read-only
DuckDB attaches every member database with read-only access. Federated queries
are `SELECT`/`WITH` only. No writes, no DDL, and no mutations reach any member
database through the federation engine.
## Current limitations
- **Raw SQL joins only.** Cross-database joins are written as raw SQL; declaring
them in a source's `joins:` block and automatic discovery of cross-database
relationships are not available yet. Intra-database relationship discovery for
each member connection is unchanged.
- **postgres, mysql, sqlite, and duckdb only.** Other drivers (snowflake,
bigquery, clickhouse, sqlserver) do not participate in federation in this
version. They remain usable as standalone connections.