mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
10 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2afab61417
|
feat(connectors): add MongoDB connector (#305) (#310)
* refactor(connectors): split KtxDialect into core and KtxSqlDialect Separate the dialect contract into a driver-agnostic core (display/ref formatting and type mapping) and a SQL-only extension (query generators). The catalog and entity-details paths resolve the core dialect for any snapshot driver, so it must stay free of SQL generation; this is the prerequisite refactor for adding non-SQL primary sources. - KtxDialect keeps type, formatDisplayRef, parseDisplayRef, columnDisplayTablePartCount, mapDataType, mapToDimensionType - KtxSqlDialect extends it with quoteIdentifier, formatTableName, and the query/sample/statistics generators; the 7 SQL dialects implement it - add getSqlDialectForDriver for SQL drivers; the 7 connectors and the relationship-benchmark harness consume it - thread the relationship pipeline (profiling/validation/composite/ discovery) as KtxSqlDialect | null so a non-SQL source skips coverage SQL and its candidates stay in review; local-enrichment builds the SQL dialect only when the connector advertises readOnlySql Pure extraction: no behavior change for the existing 7 drivers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(connectors): add MongoDB connector for issue #305 Add a read-only MongoDB connector that treats a database as a primary context source: collections map to tables and inferred top-level fields to columns. MongoDB is the first non-SQL source (readOnlySql: false), so ktx sql and metric compilation do not apply, but its collections flow through ingest, descriptions, and relationship discovery. - schema-inference: infer a flat column schema from the most recent sample_size documents (by _id desc, or order_by for non-ObjectId keys). Union BSON types per field, mark multi-type fields mixed (string), keep sub-documents/arrays as a single opaque json column, derive nullability from presence, treat _id as the primary key - connector: KtxMongoDbScanConnector behind an injectable client seam; strictly read-only (find/listCollections/estimatedDocumentCount only), no executeReadOnly; resolves env:/file: via resolveKtxConfigReference - core-only KtxMongoDbDialect and a live-database introspection adapter - wire the mongodb driver: driver union, dialect registry, driver registration (scopeConfigKey databases), mongodbConnectionSchema, connection-drivers, normalizeDriver, the live-database route, and the ktx setup picker. ktx sql is refused by the read-only SQL capability gate - tests: schema inference, connector snapshot via a fake client, dialect, driver-schema parsing, and the ktx sql rejection Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(integrations): document the MongoDB primary source Add a MongoDB section to the primary-sources reference: connection config (url, databases, enabled_tables, sample_size, order_by), mongodb+srv/TLS/ Atlas notes, the schema-inference explainer, a features matrix, and the non-SQL caveat. Update the frontmatter and connection field reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(connectors): address review blockers on the MongoDB connector - introspect: skip estimatedDocumentCount for views. The count command is rejected on a MongoDB view (CommandNotSupportedOnView), so counting a view aborted introspect for the whole connection; compute estimatedRows only for real collections, as ClickHouse does. - sl: refuse a semantic-layer query against a non-SQL connection instead of defaulting it to the Postgres dialect. compileLocalSlQuery (the shared CLI + MCP path) now rejects a driver with no SQL dialect via the new isSqlQueryableDriver authority, keeping MongoDB context-only per issue #305. - tests: cover input.tableScope and the empty-scope skip for the Mongo connector (the scan layer does not post-filter), the view no-count path, and the ktx sl query refusal for a mongodb connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(mongodb): compute sampled nullCount and document sampling caveats Address the non-blocking review notes: - sampleColumn now counts null/absent values over the sampled window instead of returning nullCount: null, since the documents are already in hand - warn that a custom order_by must be indexed (an unindexed sort hits MongoDB's in-memory sort limit on large collections) in the connection schema and docs - note that sampled values for nested fields are stringified, not faithfully serialized, so the json opacity is deliberate Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(examples): add a MongoDB connector example A manual, container-backed example mirroring examples/postgres-historic: - docker-compose.yml + init/seed.js seed a representative dataset (nested documents, arrays, a Decimal128, a mixed-type field, a nullable field, an ObjectId reference, and a view) on first container start - scripts/smoke.sh + introspect-smoke.mjs assert the connector's inferred schema with no LLM credentials — the same introspection entry point ktx ingest's database-schema stage uses, including the view-no-count path - README.md documents the smoke and a full keyless ktx ingest run (claude-code LLM + managed sentence-transformers embeddings) Works with Docker Compose or podman compose. Verified end to end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: ignore examples/** in knip to fix dead-code false positives The MongoDB connector example files (examples/mongodb/init/seed.js and examples/mongodb/scripts/introspect-smoke.mjs) are used at runtime but were flagged as unused by knip. Add examples/** to the ignore array, matching the existing .context/** entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0114qQV8fJ5a5ME3XbMVRzbL * fix(mongodb): refuse non-SQL connections before SQL analysis `ktx sql` and the MCP sql_execution tool resolved a SQL-analysis dialect (falling back to Postgres for a non-SQL driver) and ran read-only validation before the connector capability gate refused the connection. For a MongoDB connection that spun up the parser/daemon and produced Postgres parser diagnostics instead of a clean non-SQL refusal. Route both entry points through a shared assertSqlQueryableConnection guard before dialect selection, mirroring compileLocalSlQuery. The federated duckdb path has no driver and is exempted at each call site. Add CLI and MCP regression tests asserting validation/connector work never starts for a MongoDB connection. * fix(mongodb): pass CI gates (dialect boundary, secrets, setup test) Three latent failures in the connector surfaced once CI ran on the branch: - connector.ts imported the concrete KtxMongoDbDialect, which the connector dialect-import boundary forbids. Route it through getDialectForDriver('mongodb') and widen inferKtxMongoCollectionColumns to the base KtxDialect (it only uses mapDataType/mapToDimensionType). - detect-secrets flagged a test ObjectId hex and the mongodb+srv example URL; annotate both with allowlist pragmas. - the "shows every supported database" setup test omitted the new MongoDB option. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Luca Martial <48870843+luca-martial@users.noreply.github.com> Co-authored-by: Luca Martial <lucamrtl@gmail.com> Co-authored-by: Andrey Avtomonov <andreybavt@gmail.com> |
||
|
|
494618ab14
|
feat: add codex llm backend for ktx runtime work (#253)
* feat: add codex sdk runner foundation * feat: parse codex runtime events * feat: expose codex runtime mcp tools * feat: add codex llm runtime * feat: wire codex llm backend * test: avoid Array.fromAsync in codex runner test * docs: document codex llm backend * fix: tighten codex runtime config ownership * fix: use codex sdk env and thread options * fix: parse codex sdk event shapes * test: add codex backend live smoke * docs: clarify codex backend isolation * fix: drive codex loop metrics from mcp events * fix: enforce codex local step budget * docs: disclose codex isolation limits * fix: count all codex agent steps and stream step callbacks live The agent-loop step budget only counted completed mcp_tool_call items, so built-in command_execution steps (which the public Codex SDK/CLI surface can still expose) never decremented the budget, letting ingest/reconciliation run past stepBudget until Codex stopped on its own. onStepFinish was also replayed only after the whole stream drained, so live work_unit_step / reconciliation progress appeared stuck until the Codex process exited. collectEvents is now the single live step accumulator: it counts every completed agent-action item via a shared isCompletedAgentStep predicate (command_execution, mcp_tool_call, file_change, web_search), fires onStepFinish as each step completes, and enforces the budget on that broader count. A no-tool turn still counts as one step. toolFailures stays MCP-specific, since a non-zero command exit is normal agent exploration, not a loop failure. * test: align ingest llm-guard assertions with codex backend The skip-llm ingest guard message now lists codex as a valid backend and mentions a Claude Code/Codex session plus a codex setup hint, but this slow suite test still asserted the pre-codex wording. Update it to match the production message (already covered by the local-bundle-runtime unit test) and add the codex setup-line assertion. * fix: treat codex error:null tool calls as success The Codex SDK serializes error: null on successful mcp_tool_call items, so the failure check (item.error !== undefined) flagged every successful tool call as failed with the empty-payload default "Codex turn failed". This killed every ingest work unit under the codex backend before it could produce a patch. Key on status === 'failed' (authoritative, always set) and only treat a populated error object as a failure. Add a regression test built from a verbatim real-SDK event capture. * fix: default codex backend to gpt-5.5 and report real probe errors The previous default gpt-5.3-codex is an API-key-only model that the OpenAI API rejects under ChatGPT-account (subscription) auth, so codex status/setup failed with a misleading "authentication is not usable" message even though auth was fine. - Default codex model is now gpt-5.5 (works on both subscription and API-key auth); the curated setup picker offers gpt-5.5 / gpt-5.4 / gpt-5.4-mini and keeps free-form entry for account-specific ids (e.g. gpt-5.3-codex-spark). - runCodexAuthProbe now distinguishes "model not available" from an auth failure and surfaces the real API error: collectEvents retains stream events when the SDK throws on a non-zero exit, and the API error JSON envelope is unwrapped to its human-readable message. - The Codex isolation warning now renders inside the clack setup frame. - Docs updated to gpt-5.5 with a note that *-codex ids require API-key auth. * fix: require llm.models.default in status and match codex probe remediation Status reported a project ready when a non-none LLM backend was configured without llm.models.default, but the runtime (resolveModelSlots) hard-requires it, so ingest/scan/memory threw after `ktx status` said the project was usable. buildLlmStatus now fails for any non-none backend missing models.default and no longer invents a fallback model for claude-code/codex. Codex probe failures now carry a category-matched fix: a model-access failure steers the user at llm.models.default instead of the auth/install remediation. runCodexAuthProbe returns the fix and status consumes it; the message stays self-sufficient so setup output is unchanged. Docs: README now lists the codex backend and local Codex auth; ktx-setup.mdx states --llm-model only accepts codex/default or gpt-*/codex-* ids. Repaired four doctor fixtures that configured a backend without models.default (the now-correctly-blocked config) and added coverage for the new behavior. |
||
|
|
56985b7e09
|
test: split cli tests from source tree (#216)
* feat(cli): define full warehouse dialect contract
* test(cli): keep dialect edge tests focused
* fix(cli): stabilize dialect contract foundation
* refactor(connectors): own read-only query preparation
* refactor(connectors): resolve dialects through registry
* refactor(connectors): keep concrete dialect classes internal
* chore(workspace): enforce dialect import boundary
* refactor(cli): resolve relationship dialect at scan boundary
* refactor(cli): use dialect display parsing for entity details
* refactor(cli): use dialect display parsing for warehouse catalog
* refactor(cli): use dialect SQL in relationship workflows
* test(cli): verify solid dialect scan workflow closure
* test: split cli tests from source tree
* refactor(cli): standardize BigQuery scope listing
* feat(sqlite): implement connector scope listing
* test(connectors): cover required table listing
* feat(cli): add warehouse driver registry
* refactor(setup): route scope discovery through driver registry
* refactor(cli): route local query execution through driver registry
* refactor(historic-sql): route dialect support through driver registry
* refactor(cli): test warehouse connections through driver registry
* fix(cli): close driver registry type export gaps
* Improve setup daemon diagnostics
* refactor(setup): centralize rail-prefixed diagnostics + query-history fallback
Extract errorMessage, writePrefixedLines, and flushPrefixedBufferedCommandOutput
into clack.ts so the setup wizard, managed daemons, and embedding/agent steps
share one rail-formatted writer. setup-databases.ts also adds a
"disable query history and retry" option when the schema-context build fails
and query history is the likely culprit, surfaced via a new
failed-query-history-unavailable status.
* fix(cli): carry catalog through the picker so BigQuery/Snowflake/SQL Server scope filters match
The setup picker's KtxTableListEntry was a 2-level { schema, name }, so
qualifiedTableId always wrote db.name into enabled_tables. When BigQuery,
Snowflake, or SQL Server later ran fast ingest, their introspect step filtered
the scope set with scopedTableNames(scope, { catalog: projectId|database, db })
— catalog was non-null on the introspect side but null in the scope refs, so
every entry was rejected, the live-database adapter staged zero table files,
and detect() failed with 'Adapter "live-database" did not recognize fetched
source output'.
Align the picker boundary with the canonical 3-level KtxTableRef:
- Add catalog: string | null to KtxTableListEntry.
- BigQuery/Snowflake/SQL Server listTables populate catalog from the
resolved projectId / database; Postgres/MySQL/ClickHouse/SQLite set null.
- qualifiedTableId emits catalog.schema.name when catalog is non-null
(resolveEnabledTables already accepts the 3-part shape) and
schemasFromEnabledTables now goes through parseDottedTableEntry so it
recovers the schema correctly from both 2-part and 3-part entries.
- Export parseDottedTableEntry from enabled-tables.ts (@internal) for picker
reuse.
Update listTables expectations in all seven connector tests and the setup /
picker test fixtures. Add a picker regression test that covers the
catalog-bearing round-trip (save + refine).
* fix(cli): allow debug telemetry under opt-out env
|
||
|
|
b0dd13ce7c
|
feat(telemetry): anonymous posthog usage telemetry across node cli and python daemon (#205)
* feat: add telemetry phase 1
* feat: add node telemetry event catalog
* feat: add telemetry event helpers
* feat: emit setup and connection telemetry
* feat: emit connection and stack telemetry
* feat: emit ingest and scan telemetry
* feat: emit query telemetry
* feat: emit sampled mcp telemetry
* docs: expand telemetry event catalog
* feat: add telemetry schema sync artifact
* feat: pass telemetry project id to semantic daemon
* feat: add daemon telemetry foundation
* feat: emit semantic daemon telemetry
* feat: emit daemon lifecycle telemetry
* docs: document full telemetry event catalog
* feat(telemetry): dim first-run notice
* feat(telemetry): show first-run notice before command output
* feat(telemetry): wire ktx PostHog project for live ingestion
* docs(telemetry): drop posthog project name and host from storage section
* docs(telemetry): trim to general overview and disclaimer
* docs(agents): add short telemetry guidelines
* feat(telemetry): enable posthog geoip enrichment
* docs(telemetry): drop ip-geoip note from public overview
* refactor(telemetry): drop no-op groupIdentify, rely on capture groups field
* fix(telemetry): respect CI kill switch in python daemon identity
* fix(sql): route table-count analysis to existing analyze-batch endpoint
* fix(telemetry): emit install_first_run from notice path and derive flagsPresent from commander
* fix(telemetry): read package info via getKtxCliPackageInfo to satisfy boundary check
* fix(telemetry): make python identity env={} bypass os.environ and unset CI in tests
* fix(telemetry): unset CI kill switch in cli-program-telemetry tests
|
||
|
|
2366b00301
|
chore(workspace): gate dead-code with knip production mode (#196)
* refactor(workspace): relocate @ktx/llm source into packages/cli/src/llm * refactor(workspace): rewrite @ktx/llm imports to relative paths * refactor(workspace): fold internal packages into cli * chore(workspace): gate dead-code with knip production mode Turn on production-mode knip plus an autofix run in pre-commit and the `pnpm dead-code` script, document the `/** @internal */` convention for test-only exports in AGENTS.md, annotate test-only exports across the CLI with that JSDoc, and drop dead exports/wrappers the new gate surfaced (e.g. `cli-project.ts`, `lookerRuntimeSourceToFileAdapterSource`, `createLocalScanEnrichmentProvidersFromConfig`, `PGLITE_OWNER_PROCESS_BACKEND_CAPABILITIES`, stale type re-exports). Replace the loose `ignoreIssues` allowlist in `knip.json` with explicit production entries so cross-package barrel leaks are caught. * refactor(cli): delete internal barrel index.ts files The 34 `index.ts` re-export barrels inside `packages/cli/src/` were holdovers from the pre-fold multi-workspace structure. Post-fold-in they served no production purpose: external consumers go through the single package main entry, and in-repo callers mostly imported through them only because the path was short. Internally, knip flagged most barrel re-exports as production-dead (only reached via tests). This change: - Deletes every internal barrel except `packages/cli/src/index.ts` (the published package entry). - Rewrites ~270 source/test files to import each name directly from the file that defines it. - Moves `tools/warehouse-verification/index.ts` to `create-warehouse-verification-tools.ts` (the function it defined locally) and updates its single consumer. - Renames `search/backend-conformance.ts` → `.test-utils.ts` to match the existing test-helper file convention. - Deletes 13 dead test-only chains (dbt-descriptions/*, live-database/extracted-schema, live-database/structural-sync, relationship-* feedback/review chain) plus their tests and a cascading orphan integration test. - Updates test mocks that pointed at deleted barrel paths (notion-client, connector barrels in scan/local-scan-connectors tests) to mock the source files instead. - Points the maintainer benchmark script (`scripts/relationship-benchmark-report.mjs`) at source files instead of `dist/context/scan/index.js`. - Drops the barrel `!` entries from `knip.json`; adds explicit production entries only for the benchmark code reached via dist by the maintainer script. Net: 413 files changed, ~1.2k insertions, ~9.4k deletions. `pnpm run dead-code` (Biome + knip default + knip production) and `pnpm run type-check` are clean; 2277 tests pass. * refactor(workspace): rename @ktx/cli to @kaelio/ktx and pack it directly Promote the CLI workspace package to the public name `@kaelio/ktx` and drop the separate `scripts/build-public-npm-package.mjs` wrapper. The CLI package is now publishable in place (`publishConfig.access: public`, `provenance: true`), so artifact packing uses `pnpm pack` against `packages/cli/` instead of assembling a parallel package tree. Updates all workspace filter invocations, docs, tests, and release readiness checks to reference the new package name, and folds the tarball-name helper into `scripts/public-npm-release-metadata.mjs`. * docs: align "agent clients" and "data agents" terminology Replace "client agents" with "agent clients" and "database agents" with "data agents" across AGENTS.md, README.md, the docs-site copy, and the matching setup-agents test description, matching the canonical vocabulary in docs/terminology.md. Also moves packages/cli/tsconfig.json's tsBuildInfoFile from node_modules/.cache/ to dist/.tsbuildinfo so incremental builds survive node_modules reinstalls. * refactor(release): single source of truth for package version Make packages/cli/package.json the single source of truth for the @kaelio/ktx version. publicNpmPackageVersion() now reads it directly, so artifact filenames, release-readiness checks, and the Python wheel version all derive from one field. The duplicate release-policy.json.publicNpmPackageVersion is removed. Previously the two fields could drift: tarballs were named kaelio-ktx-0.4.1.tgz while internally containing @kaelio/ktx@0.0.0-private. - update-public-release-version.mjs rewrites both Python pyproject.toml files (ktx-daemon, ktx-sl) alongside the npm package.jsons, normalizing the version for PEP 440 (e.g. 0.1.0-rc.2 -> 0.1.0rc2). - semantic-release-config.cjs adds the two pyproject.toml files to @semantic-release/git assets so the release commit back to main carries every version source in lockstep. - The six "?? '0.0.0-private'" fallback literals across the CLI are replaced with "?? getKtxCliPackageInfo().version", and createDefaultKtxMcpServer makes its version arg required. - docs/release.md describes the actual commit-back model: the dev tree always reflects the most recent release; no sentinel pin to maintain. Verified: pnpm run artifacts:build now produces kaelio-ktx-0.4.1.tgz and kaelio_ktx-0.4.1-py3-none-any.whl with @kaelio/ktx@0.4.1 inside. Full type-check, dead-code, and 2287 vitests + 173 script tests pass. * refactor(cli): inject embedding provider resolution and detect sentence-transformers runtime Make resolveProjectEmbeddingProvider and runtimeIo injectable in ingest and scan command entrypoints so tests can stub them, and teach resolvePublicIngestRuntimeRequirements to flag the local-embeddings runtime feature when ktx.yaml selects sentence-transformers. * chore(cli): mark buildLocalStatsStatus and LocalStatsStatus as @internal Both symbols are consumed only by status-project.test.ts. Annotating with /** @internal */ keeps knip's production-mode check clean without changing runtime behavior. * fix(cli): use real package metadata in print-command-tree The stubbed package name embedded a forbidden product identifier that tripped the boundary check in CI. Read the metadata from package.json instead — keeps the rendered tree unchanged and removes a duplicate source of truth. * feat(cli): show embedding coverage in `ktx status`, drop duplicate disk counts Inline `(N embedded)` next to the Wiki scope counts and Semantic-layer source counts, computed with `SUM(embedding_json IS NOT NULL)` over `knowledge_pages` and `local_sl_sources`. Rename the "Knowledge" label to "Wiki" (canonical per `docs/terminology.md`) and rename the matching `localStats.knowledgePages` field to `localStats.wikiPages`. Drop `wiki=N md` and `semantic-layer=N yaml` from the Disk row — those duplicated the per-surface rows above. Disk now reports only actual byte usage (db, cache, raw-sources). The unused `wikiGlobalMarkdownCount` / `semanticLayerYamlCount` fields, the `isMarkdownEntry` / `isYamlEntry` helpers, and the `filter` arg on `summarizeDir` are removed. |
||
|
|
56a967278a
|
chore(docs-site): add dev shortcut and fix hero heading clipping (#190)
* chore(docs-site): add dev shortcut and fix hero heading clipping - Add `pnpm docs` script that frees port 3000 then runs the docs-site dev server, so the docs preview is one command away. - Bump hero heading line-height to 1.2 and add 0.15em bottom padding so the gradient text-clip no longer cuts off descenders. - Sync auto-generated next-env.d.ts to the current Next types path. * fix(ci): unblock CI on docs-font branch - Add lsof to knip ignoreBinaries so the new `pnpm docs` script (which uses `lsof -ti:3000` to free port 3000) does not trip the Unlisted binaries check. - Make CLI version assertions read @ktx/cli/package.json at runtime instead of hardcoding 0.0.0-private. The 0.4.0 release commit on main bumped the package version, breaking 18 hardcoded test cases in index.test.ts and admin-reindex.test.ts; reading the version dynamically keeps the suite green across future version bumps. * fix ci release version fixtures |
||
|
|
9c07038368
|
ci: simplify ktx release flow (#149) | ||
|
|
b565e44a22
|
feat: add claude-code llm backend with runtime port (#115)
* docs: revise claude-code ingest backend spec * docs: keep claude-code spec focused on ingest * docs: expand claude-code spec to full llm parity * Refine claude-code backend spec after adversarial review iteration 1 * Refine claude-code backend spec after adversarial review iteration 2 * Refine claude-code backend spec after adversarial review iteration 3 * feat: recognize claude-code llm backend * feat: add ktx llm runtime port * feat: add claude-code llm runtime * feat: route non-agent llm calls through runtime * feat: run ingest agents through llm runtime * feat: support claude-code setup and status * test: verify claude-code backend runtime * docs: add claude-code backend v1 runtime plan * fix: close claude-code runtime isolation checks * fix: warn on claude-code prompt caching during setup * chore: verify claude-code v1 closure * docs: add claude-code backend v1 isolation closure plan * fix: update claude-code ingest setup guidance * docs: add claude-code backend v1 ingest guidance closure plan * docs: align claude-code isolation spec with sdk metadata * test: cover claude-code host discovery metadata * fix: tolerate claude-code host discovery metadata * docs: clarify claude-code host discovery metadata * docs: add claude-code auth-probe isolation fix plan * chore: prepare kaelio ktx rc1 release * chore: add semantic release workflow * fix: unblock ci checks * chore(release): 0.1.0-rc.1 * feat: add Claude Code model selection to setup * fix: keep git maintenance attached in local repos |
||
|
|
0a261fe8a4
|
ci: add codecov coverage reporting (#82)
* ci: add codecov coverage reporting * ci: fix codecov and secret scan checks * ci: fix smoke and artifact checks |
||
|
|
bcb0d2f8f7
|
chore: add TypeScript dead-code checks (#60)
* chore: add TypeScript dead-code checks * chore: trim stale Knip ignores * Fix CI smoke and artifact checks |