* refactor(connectors): split KtxDialect into core and KtxSqlDialect Separate the dialect contract into a driver-agnostic core (display/ref formatting and type mapping) and a SQL-only extension (query generators). The catalog and entity-details paths resolve the core dialect for any snapshot driver, so it must stay free of SQL generation; this is the prerequisite refactor for adding non-SQL primary sources. - KtxDialect keeps type, formatDisplayRef, parseDisplayRef, columnDisplayTablePartCount, mapDataType, mapToDimensionType - KtxSqlDialect extends it with quoteIdentifier, formatTableName, and the query/sample/statistics generators; the 7 SQL dialects implement it - add getSqlDialectForDriver for SQL drivers; the 7 connectors and the relationship-benchmark harness consume it - thread the relationship pipeline (profiling/validation/composite/ discovery) as KtxSqlDialect | null so a non-SQL source skips coverage SQL and its candidates stay in review; local-enrichment builds the SQL dialect only when the connector advertises readOnlySql Pure extraction: no behavior change for the existing 7 drivers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(connectors): add MongoDB connector for issue #305 Add a read-only MongoDB connector that treats a database as a primary context source: collections map to tables and inferred top-level fields to columns. MongoDB is the first non-SQL source (readOnlySql: false), so ktx sql and metric compilation do not apply, but its collections flow through ingest, descriptions, and relationship discovery. - schema-inference: infer a flat column schema from the most recent sample_size documents (by _id desc, or order_by for non-ObjectId keys). Union BSON types per field, mark multi-type fields mixed (string), keep sub-documents/arrays as a single opaque json column, derive nullability from presence, treat _id as the primary key - connector: KtxMongoDbScanConnector behind an injectable client seam; strictly read-only (find/listCollections/estimatedDocumentCount only), no executeReadOnly; resolves env:/file: via resolveKtxConfigReference - core-only KtxMongoDbDialect and a live-database introspection adapter - wire the mongodb driver: driver union, dialect registry, driver registration (scopeConfigKey databases), mongodbConnectionSchema, connection-drivers, normalizeDriver, the live-database route, and the ktx setup picker. ktx sql is refused by the read-only SQL capability gate - tests: schema inference, connector snapshot via a fake client, dialect, driver-schema parsing, and the ktx sql rejection Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(integrations): document the MongoDB primary source Add a MongoDB section to the primary-sources reference: connection config (url, databases, enabled_tables, sample_size, order_by), mongodb+srv/TLS/ Atlas notes, the schema-inference explainer, a features matrix, and the non-SQL caveat. Update the frontmatter and connection field reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(connectors): address review blockers on the MongoDB connector - introspect: skip estimatedDocumentCount for views. The count command is rejected on a MongoDB view (CommandNotSupportedOnView), so counting a view aborted introspect for the whole connection; compute estimatedRows only for real collections, as ClickHouse does. - sl: refuse a semantic-layer query against a non-SQL connection instead of defaulting it to the Postgres dialect. compileLocalSlQuery (the shared CLI + MCP path) now rejects a driver with no SQL dialect via the new isSqlQueryableDriver authority, keeping MongoDB context-only per issue #305. - tests: cover input.tableScope and the empty-scope skip for the Mongo connector (the scan layer does not post-filter), the view no-count path, and the ktx sl query refusal for a mongodb connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(mongodb): compute sampled nullCount and document sampling caveats Address the non-blocking review notes: - sampleColumn now counts null/absent values over the sampled window instead of returning nullCount: null, since the documents are already in hand - warn that a custom order_by must be indexed (an unindexed sort hits MongoDB's in-memory sort limit on large collections) in the connection schema and docs - note that sampled values for nested fields are stringified, not faithfully serialized, so the json opacity is deliberate Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(examples): add a MongoDB connector example A manual, container-backed example mirroring examples/postgres-historic: - docker-compose.yml + init/seed.js seed a representative dataset (nested documents, arrays, a Decimal128, a mixed-type field, a nullable field, an ObjectId reference, and a view) on first container start - scripts/smoke.sh + introspect-smoke.mjs assert the connector's inferred schema with no LLM credentials — the same introspection entry point ktx ingest's database-schema stage uses, including the view-no-count path - README.md documents the smoke and a full keyless ktx ingest run (claude-code LLM + managed sentence-transformers embeddings) Works with Docker Compose or podman compose. Verified end to end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: ignore examples/** in knip to fix dead-code false positives The MongoDB connector example files (examples/mongodb/init/seed.js and examples/mongodb/scripts/introspect-smoke.mjs) are used at runtime but were flagged as unused by knip. Add examples/** to the ignore array, matching the existing .context/** entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0114qQV8fJ5a5ME3XbMVRzbL * fix(mongodb): refuse non-SQL connections before SQL analysis `ktx sql` and the MCP sql_execution tool resolved a SQL-analysis dialect (falling back to Postgres for a non-SQL driver) and ran read-only validation before the connector capability gate refused the connection. For a MongoDB connection that spun up the parser/daemon and produced Postgres parser diagnostics instead of a clean non-SQL refusal. Route both entry points through a shared assertSqlQueryableConnection guard before dialect selection, mirroring compileLocalSlQuery. The federated duckdb path has no driver and is exempted at each call site. Add CLI and MCP regression tests asserting validation/connector work never starts for a MongoDB connection. * fix(mongodb): pass CI gates (dialect boundary, secrets, setup test) Three latent failures in the connector surfaced once CI ran on the branch: - connector.ts imported the concrete KtxMongoDbDialect, which the connector dialect-import boundary forbids. Route it through getDialectForDriver('mongodb') and widen inferKtxMongoCollectionColumns to the base KtxDialect (it only uses mapDataType/mapToDimensionType). - detect-secrets flagged a test ObjectId hex and the mongodb+srv example URL; annotate both with allowlist pragmas. - the "shows every supported database" setup test omitted the new MongoDB option. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com> Co-authored-by: Luca Martial <lucamrtl@gmail.com> Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com> |
||
|---|---|---|
| .github | ||
| assets | ||
| docs | ||
| docs-site | ||
| examples | ||
| packages/cli | ||
| python | ||
| scripts | ||
| skills/ktx | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| .releaserc.cjs | ||
| AGENTS.md | ||
| biome.json | ||
| CLAUDE.md | ||
| codecov.yml | ||
| conductor.json | ||
| CONTRIBUTING.md | ||
| GEMINI.md | ||
| knip.json | ||
| LICENSE | ||
| package.json | ||
| pnpm-lock.yaml | ||
| pnpm-workspace.yaml | ||
| pyproject.toml | ||
| README.md | ||
| release-policy.json | ||
| SECURITY.md | ||
| skills.sh.json | ||
| tombi.toml | ||
| tsconfig.base.json | ||
| uv.lock | ||
The context layer for data agents
Quickstart · CLI Reference · Agent Setup · Slack
Built and maintained by Kaelio
ktx is a self-improving context layer that teaches agents how to query your warehouse accurately - from approved metric definitions, joinable columns, and business knowledge it builds and maintains for you.
Note
Run ktx with your own LLM API keys or a local agent sign-in — a Claude Pro/Max subscription through Claude Code, or your local Codex authentication. No extra usage billing from ktx.
Why ktx
General-purpose agents struggle on data tasks. They re-explore your warehouse on every question, invent their own metric logic, and return numbers that don't match approved definitions.
Traditional semantic layers don't fix this. They demand constant manual upkeep and don't absorb the rest of your company's knowledge.
ktx does both, automatically:
- Learns from company knowledge. Ingests wiki content, organizes it, removes duplicates, and flags contradictions for human review.
- Maps the data stack. Samples tables, captures metadata and usage patterns, detects joinable columns, and annotates sources so agents write better queries.
- Builds a semantic layer. Combines raw tables and high-level metrics through a join graph that automatically resolves chasm and fan traps, so agents fetch metrics declaratively instead of rewriting canonical SQL each time.
- Serves agents at execution. Exposes CLI and MCP tools with combined full-text and semantic search across wiki and semantic-layer entities.
How ktx compares
| General-purpose agent | Traditional semantic layer | ktx | |
|---|---|---|---|
| Builds warehouse context automatically | — | — | ✓ |
| Detects joinable columns + resolves fan/chasm traps | — | Manual | ✓ |
| Approved, reusable metric definitions | — | ✓ | ✓ |
| Absorbs wiki / Notion / team knowledge | — | — | ✓ |
| Flags contradictions across sources | — | — | ✓ |
| Ships CLI + MCP for agent execution | Partial | — | ✓ |
| Read-only by design | n/a | n/a | ✓ |
Who is ktx for
Use ktx if you:
- Want agents like Claude Code, Codex, Cursor, or OpenCode to query your warehouse with approved metric definitions
- Have business knowledge scattered across dbt, Looker, Metabase, Notion, and team wikis
- Need agents to reuse canonical SQL instead of inventing it on every prompt
Skip ktx if you:
- You don't have a SQL warehouse - ktx sits on top of one
- You only need one ad-hoc query -
psqlor a notebook will do
Works with PostgreSQL, Snowflake, BigQuery, ClickHouse, MySQL, SQL Server, and SQLite. Integrates with dbt, MetricFlow, LookML, Looker, Metabase, and Notion.
Quick Start
npm install -g @kaelio/ktx
ktx setup
ktx status
ktx setup creates or resumes a local ktx project, configures providers
and connections, builds context, and installs agent integration.
Example ktx status after setup:
ktx project: /home/user/analytics
Project ready: yes
LLM ready: yes (claude-sonnet-4-6)
Embeddings ready: yes (text-embedding-3-small)
Databases configured: yes (warehouse)
Context sources configured: yes (dbt_main)
ktx context built: yes
Agent integration ready: yes (codex:project)
Tip
Already using an agent? Ask Claude Code, Codex, Cursor, or OpenCode from your project directory:
Run npx skills add Kaelio/ktx --skill ktx and use the ktx skill to install and configure ktx in this project.
Important
If
ktx statusprintsktx mcp start --project-dir ..., run it before opening your agent client.
Upgrading
Re-run the global install with the @latest tag:
npm install -g @kaelio/ktx@latest
First commands
| Command | Purpose |
|---|---|
ktx setup |
Create, resume, or update a ktx project |
ktx status |
Check project readiness |
ktx ingest |
Build context for every configured connection |
ktx sl "revenue" |
Search semantic sources |
ktx wiki "refund policy" |
Search local wiki pages |
ktx mcp start |
Start the MCP server for agent clients |
See the CLI Reference for every command, flag, and option.
Project Layout
my-project/
├── ktx.yaml # Project configuration
├── semantic-layer/<connection-id>/ # YAML semantic sources
├── wiki/global/ # Shared business context
├── wiki/user/<user-id>/ # User-scoped notes
├── raw-sources/<connection-id>/ # Ingest artifacts and reports
└── .ktx/ # Local state and secrets, git-ignored
Commit ktx.yaml, semantic-layer/, and wiki/. Keep .ktx/ local.
Project resolution defaults to KTX_PROJECT_DIR, then the nearest ktx.yaml,
then the current directory. Pass --project-dir <path> when scripting.
FAQ
- Does ktx send my schema or query results to a hosted service? No. ktx runs locally. The only data leaving your machine is what you send to the LLM provider you configured.
- Which LLM backends are supported? Anthropic API, Google Vertex AI, AI Gateway, the local Claude Code session through the Claude Agent SDK, and your local Codex authentication through the Codex SDK. See LLM configuration.
- How is ktx different from a dbt or MetricFlow semantic layer? ktx ingests those layers and combines them with raw-table introspection and wiki content. Agents get one searchable surface instead of three disconnected ones - and ktx flags contradictions across sources.
- Does ktx need a running server?
There is no hosted service. The local MCP daemon runs on demand via
ktx mcp startwhen an agent client needs it. - Is my warehouse safe? Yes. Connections are read-only - ktx never writes to your database.
Docs
Community
- Slack — ask questions, share what you're building, and chat with maintainers.
- GitHub Issues — report bugs and request features.
- Contributing — set up the repo, run tests, and open a PR.
Development
git clone https://github.com/kaelio/ktx.git
cd ktx
pnpm install
uv sync --all-groups
pnpm run build
pnpm run check
ktx is a pnpm + uv workspace:
| Path | Purpose |
|---|---|
packages/cli |
TypeScript CLI and published npm package source |
packages/cli/src/context |
Core context engine |
packages/cli/src/llm |
LLM and embedding providers |
packages/cli/src/connectors |
Database scan connectors |
python/ktx-sl |
Semantic-layer query planning |
python/ktx-daemon |
Portable compute service |
Local development CLI:
pnpm run setup:dev
pnpm run link:dev
ktx-dev --help
Useful checks:
pnpm run type-check
pnpm run test
pnpm run dead-code
uv run pytest -q
Telemetry
ktx collects privacy-conscious usage telemetry to understand installs and
improve setup, command reliability, and data-agent workflows. Catalog telemetry
events do not record file paths, hostnames, SQL, schema names, table names,
column names, error messages, raw environment values, or argv. Error reports use
PostHog Error Tracking and can include stack frames and raw error messages,
which may contain local file paths or the local username in those paths.
ktx redacts secrets, credentials, database URLs, auth headers, argv, raw
environment values, SQL text, row data, and user-typed prompt or MCP argument
text from the explicit $exception payload. See
Telemetry for the event
catalog and opt-out options.
License
ktx is licensed under the Apache License, Version 2.0. See LICENSE.