apunkt/ktx

mirror of https://github.com/Kaelio/ktx.git synced 2026-07-22 11:51:01 +02:00

ktx is the context layer for analytics agents https://docs.kaelio.com/ktx

Find a file

Andrey Avtomonov 394a985d2a fix(snowflake): unblock multi-schema ingest and relationship discovery (#204 ) * feat(setup): drop redundant Snowflake schema prompt; fall back to free-text on listSchemas failure Snowflake setup previously asked for a single schema as free text, then ran a multiselect against the discovered schemas — two schema questions back-to-back, with the first being only a session bootstrap. The SDK's `schema` is optional, so the bootstrap step is unnecessary. - Remove the free-text Snowflake schema prompt; only pass `schema` to snowflake-sdk when one is configured. - When `listSchemas()` fails (e.g. role lacks SHOW SCHEMAS), prompt the user for a comma-separated list, persist it as `schema_names`, and use it as both the table-list filter and the multiselect default. Applies to every driver with a scope-discovery spec, not just Snowflake. - Update docs to lead with `schema_names`; keep `schema_name` as a documented single-schema shorthand. * fix(snowflake): keep introspecting when primary-key discovery is denied The PK query joins INFORMATION_SCHEMA.TABLE_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE, which require grants the connection role may not have. Previously a 'SQL compilation error: Object ANALYTICS.INFORMATION_SCHEMA.KEY_COLUMN_USAGE does not exist or not authorized' aborted the entire introspect — schemas, columns, and row counts were all discarded over a missing nice-to-have. Wrap the constraint query in try/catch, log a one-line warning per schema, and return an empty PK map. Columns end up with primaryKey=false; relationship inference still has FK and profiling to fall back on. * fix(scan): unblock relationship discovery on Snowflake Two adjacent bugs prevented the scan's relationship pipeline from producing any joins on a Snowflake warehouse: - relationship-profiling.ts fell through to a default `GROUP_CONCAT` branch for unknown drivers. Snowflake has no GROUP_CONCAT, so every per-table profile query failed with "Unknown function GROUP_CONCAT". Add an explicit Snowflake branch that uses LISTAGG with a literal '\x1f' delimiter (Snowflake requires the delimiter to be a constant, so CHR(31) is rejected). - description-generation.ts destructured `connector.sampleTable` and `connector.sampleColumn` into bare locals, losing the `this` binding when the class-method connectors (Snowflake, Postgres, MySQL) were invoked. Every sample call threw "Cannot read properties of undefined (reading 'assertConnection')" and degraded LLM descriptions to metadata-only prompts. Call the methods through the connector instead. Without these, even after the primary-key probe is allowed to fail softly, the scan ends up with 0 validated relationships and an empty `joins:` block in every shard YAML. * test(scan): cover table-ref helpers * feat(scan): plumb tableScope through live-database introspection port * feat(scan): apply tableScope during metadata fetch * feat(scan): enforce table scope at fetch boundary * feat(scan): pool Snowflake sessions and batch enrichment for faster ingest (#206) * feat(cli): add RSA key-pair auth option to Snowflake setup wizard Extends the interactive Snowflake setup flow with an authentication-method prompt (password vs RSA/JWT key-pair). The RSA branch collects a private-key path (env/file/absolute) and an optional passphrase; the resulting connection config records `authMethod: 'rsa'` with `privateKey` and `passphrase` instead of `password`. * feat(scan): pool Snowflake sessions * fix(scan): reuse structural snapshots and cleanup connectors * feat(scan): parallelize relationship profiling * feat(scan): batch table description generation * docs: document Snowflake ingest concurrency knobs * fix(scan): close Snowflake ingest perf verification gaps * fix(scan): keep batched description failure bounded * feat(scan): dispatch query-history probes by connection driver Extract historic-sql dialect resolution into a shared helper so the status-project readiness check and the local ingest factory agree on which connections enable query history and which probe to run. The status command now picks the postgres/snowflake/bigquery probe based on the connection's driver instead of always reporting against postgres, which previously caused snowflake connections with queryHistory.enabled to surface a misleading "driver is snowflake" failure. Also drops a noisy console.warn from Snowflake primary-key discovery — INFORMATION_SCHEMA.KEY_COLUMN_USAGE is commonly ungranted for read-only roles and the FK + profiling paths handle the empty PK map already. * fix(llm): allow StructuredOutput tool and raise maxTurns for generateObject The Claude Code agent SDK announces an internal pseudo-tool named StructuredOutput in the system/init message whenever outputFormat is set to { type: 'json_schema' }. The runtime's isolation check built its allowedToolIds set only from MCP tool ids and treated StructuredOutput as an unexpected host-injected tool, so every generateObject call threw "Claude Code runtime isolation failed: tools=StructuredOutput ..." and the table-descriptions and relationship-LLM-proposal enrichment stages recorded null output across the board. Whitelist StructuredOutput specifically in generateObject's allowedToolIds — the check also enforces missing_tools symmetry, so generateText and runAgentLoop, which do not see StructuredOutput, must not require it. generateObject also ran with maxTurns: 1, which the model intermittently breached when it emitted thinking text before the structured response. Raised to 5 to give the schema-bound call enough headroom without allowing unbounded loops. The existing tests now exercise the path with an init message that announces StructuredOutput so the regression cannot slip back in. * chore(scripts): add ktx-reset.sh project-cleanup helper Convenience script for repeatable ingest testing: takes a project directory and prunes everything except ktx.yaml and .ktx/secrets/, so the next ktx setup or ktx ingest run starts from a known-clean state.		2026-05-23 10:41:30 +02:00
.github	chore(workspace): gate dead-code with knip production mode (#196 )	2026-05-21 15:28:58 +02:00
assets	feat(docs-site): refresh nav mascot with SVG and bump size (#101 )	2026-05-14 23:45:41 +02:00
docs	chore(workspace): gate dead-code with knip production mode (#196 )	2026-05-21 15:28:58 +02:00
docs-site	fix(snowflake): unblock multi-schema ingest and relationship discovery (#204 )	2026-05-23 10:41:30 +02:00
examples	chore(workspace): gate dead-code with knip production mode (#196 )	2026-05-21 15:28:58 +02:00
packages/cli	fix(snowflake): unblock multi-schema ingest and relationship discovery (#204 )	2026-05-23 10:41:30 +02:00
python	fix(snowflake): unblock multi-schema ingest and relationship discovery (#204 )	2026-05-23 10:41:30 +02:00
scripts	fix(snowflake): unblock multi-schema ingest and relationship discovery (#204 )	2026-05-23 10:41:30 +02:00
website	feat(docs): add Fumadocs site workspace	2026-05-11 01:08:31 -07:00
.gitignore	chore: remove private planning docs (#140 )	2026-05-19 14:58:55 +02:00
.pre-commit-config.yaml	chore(workspace): gate dead-code with knip production mode (#196 )	2026-05-21 15:28:58 +02:00
.releaserc.cjs	feat: add claude-code llm backend with runtime port (#115 )	2026-05-16 12:06:34 +02:00
AGENTS.md	feat(telemetry): anonymous posthog usage telemetry across node cli and python daemon (#205 )	2026-05-22 18:18:47 +02:00
biome.json	feat: merge ingest and scan	2026-05-14 01:43:06 +02:00
CLAUDE.md	Initial open-source release	2026-05-10 23:12:26 +02:00
codecov.yml	refactor(release): drop release-policy.json runtime dep and next branch (#180 )	2026-05-20 13:53:14 +02:00
conductor.json	[codex] Add Conductor workspace scripts (#2 )	2026-05-11 09:55:42 +02:00
CONTRIBUTING.md	chore(community): rewards program, issue templates, and triage workflow (#176 )	2026-05-19 19:42:06 -04:00
GEMINI.md	Initial open-source release	2026-05-10 23:12:26 +02:00
knip.json	feat(telemetry): anonymous posthog usage telemetry across node cli and python daemon (#205 )	2026-05-22 18:18:47 +02:00
LICENSE	ci: run pre-commit checks in CI (#74 )	2026-05-13 19:49:25 +02:00
package.json	chore(workspace): gate dead-code with knip production mode (#196 )	2026-05-21 15:28:58 +02:00
pnpm-lock.yaml	feat(telemetry): anonymous posthog usage telemetry across node cli and python daemon (#205 )	2026-05-22 18:18:47 +02:00
pnpm-workspace.yaml	fix: resolve dependabot security advisories (#179 )	2026-05-20 14:17:29 +02:00
pyproject.toml	ci: add codecov coverage reporting (#82 )	2026-05-14 01:13:31 +02:00
README.md	feat(telemetry): anonymous posthog usage telemetry across node cli and python daemon (#205 )	2026-05-22 18:18:47 +02:00
release-policy.json	chore(workspace): gate dead-code with knip production mode (#196 )	2026-05-21 15:28:58 +02:00
SECURITY.md	chore(community): rewards program, issue templates, and triage workflow (#176 )	2026-05-19 19:42:06 -04:00
tsconfig.base.json	perf(setup): speed up conductor setup and make it rerun-safe (#107 )	2026-05-15 12:06:37 +02:00
uv.lock	feat(telemetry): anonymous posthog usage telemetry across node cli and python daemon (#205 )	2026-05-22 18:18:47 +02:00

README.md

The context layer for data agents

ktx is a self-improving context layer that teaches agents how to query your warehouse accurately - from approved metric definitions, joinable columns, and business knowledge it builds and maintains for you.

Works with PostgreSQL, Snowflake, BigQuery, ClickHouse, MySQL, SQL Server, and SQLite. Integrates with dbt, MetricFlow, LookML, Looker, Metabase, and Notion.

Runs with your own LLM API keys or a Claude Pro/Max subscription - no extra usage billing from ktx.

Why ktx

General-purpose agents struggle on data tasks. They re-explore your warehouse on every question, invent their own metric logic, and return numbers that don't match approved definitions.

Traditional semantic layers don't fix this. They demand constant manual upkeep and don't absorb the rest of your company's knowledge.

ktx does both, automatically:

Learns from company knowledge. Ingests wiki content, organizes it, removes duplicates, and flags contradictions for human review.
Maps the data stack. Samples tables, captures metadata and usage patterns, detects joinable columns, and annotates sources so agents write better queries.
Builds a semantic layer. Combines raw tables and high-level metrics through a join graph that automatically resolves chasm and fan traps, so agents fetch metrics declaratively instead of rewriting canonical SQL each time.
Serves agents at execution. Exposes CLI and MCP tools with combined full-text and semantic search across wiki and semantic-layer entities.

Agents can run raw SQL when they need it, or compose semantic-layer queries when they want approved metrics with reliable joins.

ktx ingestion flow from source systems through validation to wiki and semantic-layer outputs

Agent Setup

Ask an agent such as Claude Code, Codex, Cursor, or OpenCode to install and configure ktx from your project directory:

Follow instructions from
https://docs.kaelio.com/ktx/docs/agents-setup.md
to install and configure ktx

Quick Start

npm install -g @kaelio/ktx
ktx setup
ktx status

ktx setup creates or resumes a local ktx project, configures providers and connections, builds context, and installs agent integration.

Example ktx status output after setup:

ktx project: /home/user/analytics
Project ready: yes
LLM ready: yes (claude-sonnet-4-6)
Embeddings ready: yes (text-embedding-3-small)
Databases configured: yes (warehouse)
Context sources configured: yes (dbt_main)
ktx context built: yes
Agent integration ready: yes (codex:project)

Telemetry

ktx collects anonymous usage telemetry from interactive CLI runs to improve setup, command reliability, and data-agent workflows. See Telemetry for the event catalog, privacy details, and opt-out options.

Common Commands

Command	Purpose
`ktx setup`	Create, resume, or update a ktx project
`ktx status`	Check project readiness
`ktx connection`	List configured connections
`ktx connection test`	Test every configured connection
`ktx connection test <id>`	Test one connection
`ktx ingest`	Build context for every configured connection
`ktx ingest <id>`	Build context for one connection
`ktx ingest --text "..."`	Capture free-form notes into memory
`ktx ingest --file notes.md --connection-id <id>`	Capture a text file into memory
`ktx sl`	List semantic sources
`ktx sl "revenue"`	Search semantic sources
`ktx sl validate <source> --connection-id <id>`	Validate a semantic source
`ktx sl query --measure <measure> --format sql`	Compile semantic-layer SQL
`ktx sql --connection <id> "select 1"`	Execute read-only SQL
`ktx wiki`	List local wiki pages
`ktx wiki "revenue definition"`	Search local wiki pages
`ktx mcp`	Show MCP daemon status
`ktx mcp start`	Start the local MCP server for agent clients

Project resolution defaults to KTX_PROJECT_DIR, then the nearest ktx.yaml, then the current directory. Pass --project-dir <path> when scripting.

Project Layout

my-project/
├── ktx.yaml                         # Project configuration
├── semantic-layer/<connection-id>/  # YAML semantic sources
├── wiki/global/                     # Shared business context
├── wiki/user/<user-id>/             # User-scoped notes
├── raw-sources/<connection-id>/     # Ingest artifacts and reports
└── .ktx/                            # Local state and secrets, git-ignored

Commit ktx.yaml, semantic-layer/, and wiki/. Keep .ktx/ local.

Agent Usage

Install ktx integration for Claude Code, Claude Desktop, Codex, Cursor, OpenCode, and generic .agents clients:

ktx setup --agents

Pass --target <target> to install or repair one specific integration.

A typical agent workflow combines wiki and semantic-layer search before querying:

ktx sl "revenue" --json
ktx wiki "refund policy" --json
ktx sl query --connection-id warehouse --measure orders.revenue --format sql

During setup, choose Ask data questions with ktx MCP for agent clients. Choose Ask data questions + manage ktx with CLI commands when an operator agent also needs pinned ktx admin commands.

After setup, ktx prints Required before using agents with the exact commands to run. If the output includes ktx mcp start --project-dir ..., run it before opening your agent. Claude Desktop uses its own launcher and prints separate skill upload steps under .ktx/agents/claude/.

Workspace layout

Path	Purpose
`packages/cli`	TypeScript CLI package and published npm package source
`packages/cli/src/context`	Core context engine
`packages/cli/src/llm`	LLM and embedding providers
`packages/cli/src/connectors`	Database scan connectors
`python/ktx-sl`	Semantic-layer query planning
`python/ktx-daemon`	Portable compute service

Development

git clone https://github.com/kaelio/ktx.git
cd ktx
pnpm install
uv sync --all-groups
pnpm run build
pnpm run check

Use the development CLI locally:

pnpm run setup:dev
pnpm run link:dev
ktx-dev --help

ktx is a pnpm + uv workspace:

TypeScript packages live in packages/*
CLI source lives in packages/cli
Python runtime source lives in python/ktx-sl and python/ktx-daemon
Public docs live in docs-site/content/docs

Useful checks:

pnpm run type-check
pnpm run test
pnpm run dead-code
uv run pytest -q

Docs

Community

Slack — ask questions, share what you're building, and chat with maintainers and other users.
GitHub Issues — report bugs and request features.
Contributing guide — set up the repo, run tests, and open a PR.

See Community & Support for the full guide on where to ask what.

License

ktx is licensed under the Apache License, Version 2.0. See LICENSE.