ktx/packages/cli/src/text-ingest.ts
Andrey Avtomonov 2366b00301
chore(workspace): gate dead-code with knip production mode (#196)
* refactor(workspace): relocate @ktx/llm source into packages/cli/src/llm

* refactor(workspace): rewrite @ktx/llm imports to relative paths

* refactor(workspace): fold internal packages into cli

* chore(workspace): gate dead-code with knip production mode

Turn on production-mode knip plus an autofix run in pre-commit and the
`pnpm dead-code` script, document the `/** @internal */` convention for
test-only exports in AGENTS.md, annotate test-only exports across the
CLI with that JSDoc, and drop dead exports/wrappers the new gate
surfaced (e.g. `cli-project.ts`, `lookerRuntimeSourceToFileAdapterSource`,
`createLocalScanEnrichmentProvidersFromConfig`,
`PGLITE_OWNER_PROCESS_BACKEND_CAPABILITIES`, stale type re-exports).
Replace the loose `ignoreIssues` allowlist in `knip.json` with explicit
production entries so cross-package barrel leaks are caught.

* refactor(cli): delete internal barrel index.ts files

The 34 `index.ts` re-export barrels inside `packages/cli/src/` were
holdovers from the pre-fold multi-workspace structure. Post-fold-in they
served no production purpose: external consumers go through the single
package main entry, and in-repo callers mostly imported through them
only because the path was short. Internally, knip flagged most barrel
re-exports as production-dead (only reached via tests).

This change:
- Deletes every internal barrel except `packages/cli/src/index.ts`
  (the published package entry).
- Rewrites ~270 source/test files to import each name directly from
  the file that defines it.
- Moves `tools/warehouse-verification/index.ts` to
  `create-warehouse-verification-tools.ts` (the function it defined
  locally) and updates its single consumer.
- Renames `search/backend-conformance.ts` → `.test-utils.ts` to match
  the existing test-helper file convention.
- Deletes 13 dead test-only chains (dbt-descriptions/*,
  live-database/extracted-schema, live-database/structural-sync,
  relationship-* feedback/review chain) plus their tests and a
  cascading orphan integration test.
- Updates test mocks that pointed at deleted barrel paths
  (notion-client, connector barrels in scan/local-scan-connectors
  tests) to mock the source files instead.
- Points the maintainer benchmark script
  (`scripts/relationship-benchmark-report.mjs`) at source files
  instead of `dist/context/scan/index.js`.
- Drops the barrel `!` entries from `knip.json`; adds explicit
  production entries only for the benchmark code reached via dist by
  the maintainer script.

Net: 413 files changed, ~1.2k insertions, ~9.4k deletions.

`pnpm run dead-code` (Biome + knip default + knip production) and
`pnpm run type-check` are clean; 2277 tests pass.

* refactor(workspace): rename @ktx/cli to @kaelio/ktx and pack it directly

Promote the CLI workspace package to the public name `@kaelio/ktx` and
drop the separate `scripts/build-public-npm-package.mjs` wrapper. The
CLI package is now publishable in place (`publishConfig.access: public`,
`provenance: true`), so artifact packing uses `pnpm pack` against
`packages/cli/` instead of assembling a parallel package tree.

Updates all workspace filter invocations, docs, tests, and release
readiness checks to reference the new package name, and folds the
tarball-name helper into `scripts/public-npm-release-metadata.mjs`.

* docs: align "agent clients" and "data agents" terminology

Replace "client agents" with "agent clients" and "database agents" with
"data agents" across AGENTS.md, README.md, the docs-site copy, and the
matching setup-agents test description, matching the canonical
vocabulary in docs/terminology.md.

Also moves packages/cli/tsconfig.json's tsBuildInfoFile from
node_modules/.cache/ to dist/.tsbuildinfo so incremental builds survive
node_modules reinstalls.

* refactor(release): single source of truth for package version

Make packages/cli/package.json the single source of truth for the
@kaelio/ktx version. publicNpmPackageVersion() now reads it directly,
so artifact filenames, release-readiness checks, and the Python wheel
version all derive from one field. The duplicate
release-policy.json.publicNpmPackageVersion is removed.

Previously the two fields could drift: tarballs were named
kaelio-ktx-0.4.1.tgz while internally containing
@kaelio/ktx@0.0.0-private.

- update-public-release-version.mjs rewrites both Python pyproject.toml
  files (ktx-daemon, ktx-sl) alongside the npm package.jsons,
  normalizing the version for PEP 440 (e.g. 0.1.0-rc.2 -> 0.1.0rc2).
- semantic-release-config.cjs adds the two pyproject.toml files to
  @semantic-release/git assets so the release commit back to main
  carries every version source in lockstep.
- The six "?? '0.0.0-private'" fallback literals across the CLI are
  replaced with "?? getKtxCliPackageInfo().version", and
  createDefaultKtxMcpServer makes its version arg required.
- docs/release.md describes the actual commit-back model: the dev tree
  always reflects the most recent release; no sentinel pin to
  maintain.

Verified: pnpm run artifacts:build now produces
kaelio-ktx-0.4.1.tgz and kaelio_ktx-0.4.1-py3-none-any.whl with
@kaelio/ktx@0.4.1 inside. Full type-check, dead-code, and
2287 vitests + 173 script tests pass.

* refactor(cli): inject embedding provider resolution and detect sentence-transformers runtime

Make resolveProjectEmbeddingProvider and runtimeIo injectable in ingest and
scan command entrypoints so tests can stub them, and teach
resolvePublicIngestRuntimeRequirements to flag the local-embeddings runtime
feature when ktx.yaml selects sentence-transformers.

* chore(cli): mark buildLocalStatsStatus and LocalStatsStatus as @internal

Both symbols are consumed only by status-project.test.ts. Annotating with
/** @internal */ keeps knip's production-mode check clean without changing
runtime behavior.

* fix(cli): use real package metadata in print-command-tree

The stubbed package name embedded a forbidden product identifier that
tripped the boundary check in CI. Read the metadata from package.json
instead — keeps the rendered tree unchanged and removes a duplicate
source of truth.

* feat(cli): show embedding coverage in `ktx status`, drop duplicate disk counts

Inline `(N embedded)` next to the Wiki scope counts and Semantic-layer
source counts, computed with `SUM(embedding_json IS NOT NULL)` over
`knowledge_pages` and `local_sl_sources`. Rename the "Knowledge" label to
"Wiki" (canonical per `docs/terminology.md`) and rename the matching
`localStats.knowledgePages` field to `localStats.wikiPages`.

Drop `wiki=N md` and `semantic-layer=N yaml` from the Disk row — those
duplicated the per-surface rows above. Disk now reports only actual byte
usage (db, cache, raw-sources). The unused `wikiGlobalMarkdownCount` /
`semanticLayerYamlCount` fields, the `isMarkdownEntry` / `isYamlEntry`
helpers, and the `filter` arg on `summarizeDir` are removed.
2026-05-21 15:28:58 +02:00

357 lines
11 KiB
TypeScript

import { readFile as fsReadFile } from 'node:fs/promises';
import { basename, resolve } from 'node:path';
import { createLocalProjectMemoryIngest } from './context/memory/local-memory.js';
import type { MemoryAgentInput } from './context/memory/types.js';
import type { MemoryIngestStatus } from './context/memory/memory-runs.js';
import { loadKtxProject, type KtxLocalProject } from './context/project/project.js';
import type { KtxCliIo } from './cli-runtime.js';
import { createRepainter, initViewState, renderContextBuildView, type ContextBuildTargetState } from './context-build-view.js';
import { formatDuration } from './demo-metrics.js';
import type { KtxPublicIngestPlanTarget } from './public-ingest.js';
export interface KtxTextIngestArgs {
projectDir: string;
texts: string[];
files: string[];
connectionId?: string;
userId: string;
json: boolean;
failFast: boolean;
}
/** @internal */
export interface TextMemoryIngestPort {
ingest(input: MemoryAgentInput): Promise<{ runId: string }>;
waitForRun(runId: string): Promise<void>;
status(runId: string): Promise<MemoryIngestStatus | null>;
}
interface TextIngestItem {
label: string;
content: string;
}
interface TextIngestResult {
label: string;
runId: string | null;
status: 'done' | 'error';
captured: MemoryIngestStatus['captured'];
commitHash: string | null;
error: string | null;
}
export interface KtxTextIngestDeps {
loadProject?: (options: { projectDir: string }) => Promise<KtxLocalProject>;
createMemoryIngest?: (project: KtxLocalProject) => TextMemoryIngestPort;
readFile?: (path: string) => Promise<string>;
readStdin?: () => Promise<string>;
now?: () => number;
}
const INLINE_TEXT_LABEL_MAX_LENGTH = 50;
const ANSI_ESCAPE_PATTERN = /\x1B\[[0-?]*[ -/]*[@-~]/g;
function defaultCreateMemoryIngest(project: KtxLocalProject): TextMemoryIngestPort {
return createLocalProjectMemoryIngest(project);
}
async function defaultReadStdin(): Promise<string> {
const chunks: string[] = [];
process.stdin.setEncoding('utf-8');
for await (const chunk of process.stdin) {
chunks.push(String(chunk));
}
return chunks.join('');
}
async function defaultReadFile(path: string): Promise<string> {
return await fsReadFile(path, 'utf-8');
}
function emptyCaptured(): MemoryIngestStatus['captured'] {
return { wiki: [], sl: [], xrefs: [] };
}
function normalizedTextPreview(content: string): string {
return content
.replace(ANSI_ESCAPE_PATTERN, '')
.replace(/[\u0000-\u001f\u007f-\u009f]/g, ' ')
.replace(/\s+/g, ' ')
.trim();
}
function truncateLabel(label: string, maxLength = INLINE_TEXT_LABEL_MAX_LENGTH): string {
const chars = Array.from(label);
if (chars.length <= maxLength) {
return label;
}
return `${chars.slice(0, maxLength - 3).join('').trimEnd()}...`;
}
function quoteInlineTextLabel(label: string): string {
return JSON.stringify(label);
}
function makeUniqueLabel(label: string, usedLabels: Set<string>): string {
if (!usedLabels.has(label)) {
return label;
}
for (let index = 2; ; index++) {
const suffix = ` (${index})`;
const candidate = `${truncateLabel(label, INLINE_TEXT_LABEL_MAX_LENGTH - suffix.length)}${suffix}`;
if (!usedLabels.has(candidate)) {
return candidate;
}
}
}
function textLabel(content: string, index: number, usedLabels: Set<string>): string {
const preview = normalizedTextPreview(content);
const baseLabel = preview.length > 0 ? quoteInlineTextLabel(truncateLabel(preview)) : `text-${index + 1}`;
return makeUniqueLabel(baseLabel, usedLabels);
}
function artifactReference(label: string): string {
return label.startsWith('"') ? label : `"${label}"`;
}
function stdinLabel(items: TextIngestItem[]): string {
if (!items.some((item) => item.label === 'stdin')) {
return 'stdin';
}
return `stdin-${items.filter((item) => item.label.startsWith('stdin')).length + 1}`;
}
async function loadItems(args: KtxTextIngestArgs, deps: KtxTextIngestDeps): Promise<TextIngestItem[]> {
const items: TextIngestItem[] = [];
const usedTextLabels = new Set<string>();
args.texts.forEach((content, index) => {
const label = textLabel(content, index, usedTextLabels);
usedTextLabels.add(label);
items.push({ label, content });
});
const readFile = deps.readFile ?? defaultReadFile;
const readStdin = deps.readStdin ?? defaultReadStdin;
for (const file of args.files) {
if (file === '-') {
items.push({ label: stdinLabel(items), content: await readStdin() });
} else {
const path = resolve(file);
items.push({ label: basename(path), content: await readFile(path) });
}
}
return items;
}
function validateItems(items: TextIngestItem[], io: KtxCliIo): boolean {
if (items.length === 0) {
io.stderr.write('Provide at least one text item with --text, a file path, or - for stdin.\n');
return false;
}
for (const item of items) {
if (item.content.trim().length === 0) {
io.stderr.write(`Text item "${item.label}" is empty.\n`);
return false;
}
}
return true;
}
function makeTarget(label: string): KtxPublicIngestPlanTarget {
return {
connectionId: label,
driver: 'text',
operation: 'source-ingest',
debugCommand: '',
steps: ['memory-update'],
};
}
function allTargets(state: ReturnType<typeof initViewState>): ContextBuildTargetState[] {
return [...state.primarySources, ...state.contextSources];
}
function renderTextIngestView(state: ReturnType<typeof initViewState>, styled: boolean): string {
return renderContextBuildView(state, {
styled,
title: 'Ingesting text memory',
contextGroupLabel: 'Texts',
sourceIngestRunningText: 'capturing...',
completedItemName: { singular: 'text', plural: 'texts' },
});
}
function summarizeCaptured(captured: MemoryIngestStatus['captured']): string {
const parts = [
`wiki=${captured.wiki.length}`,
`sl=${captured.sl.length}`,
`xrefs=${captured.xrefs.length}`,
];
return parts.join(', ');
}
function resultFromStatus(label: string, status: MemoryIngestStatus): TextIngestResult {
return {
label,
runId: status.runId,
status: status.status === 'done' ? 'done' : 'error',
captured: status.captured,
commitHash: status.commitHash,
error: status.error,
};
}
function errorResult(label: string, runId: string | null, error: unknown): TextIngestResult {
return {
label,
runId,
status: 'error',
captured: emptyCaptured(),
commitHash: null,
error: error instanceof Error ? error.message : String(error),
};
}
function writeJsonResult(args: KtxTextIngestArgs, results: TextIngestResult[], io: KtxCliIo): void {
io.stdout.write(
`${JSON.stringify(
{
status: results.some((result) => result.status === 'error') ? 'failed' : 'done',
projectDir: args.projectDir,
connectionId: args.connectionId ?? null,
results,
},
null,
2,
)}\n`,
);
}
function writePlainFailures(results: TextIngestResult[], io: KtxCliIo): void {
const failures = results.filter((result) => result.status === 'error');
if (failures.length === 0) {
return;
}
io.stdout.write('\nFailed text items:\n');
for (const result of failures) {
io.stdout.write(` ${result.label}: ${result.error ?? 'failed'}\n`);
}
}
export async function runKtxTextIngest(
args: KtxTextIngestArgs,
io: KtxCliIo,
deps: KtxTextIngestDeps = {},
): Promise<number> {
const items = await loadItems(args, deps);
if (!validateItems(items, io)) {
return 1;
}
const project = await (deps.loadProject ?? loadKtxProject)({ projectDir: args.projectDir });
const memoryIngest = (deps.createMemoryIngest ?? defaultCreateMemoryIngest)(project);
const now = deps.now ?? (() => Date.now());
const batchId = now();
const state = initViewState(items.map((item) => makeTarget(item.label)));
const targets = allTargets(state);
const isTTY = io.stdout.isTTY === true && args.json !== true;
const repainter = isTTY ? createRepainter(io) : null;
const results: TextIngestResult[] = [];
state.startedAt = now();
const paint = () => repainter?.paint(renderTextIngestView(state, true));
paint();
let spinnerInterval: ReturnType<typeof setInterval> | null = null;
if (repainter) {
spinnerInterval = setInterval(() => {
const current = now();
state.frame++;
state.totalElapsedMs = state.startedAt === null ? 0 : current - state.startedAt;
for (const target of targets) {
if (target.status === 'running' && target.startedAt !== null) {
target.elapsedMs = current - target.startedAt;
}
}
paint();
}, 140);
}
try {
for (let index = 0; index < items.length; index++) {
const item = items[index]!;
const target = targets[index]!;
target.status = 'running';
target.startedAt = now();
target.detailLine = 'capturing...';
target.progressUpdatedAtMs = target.startedAt;
paint();
let runId: string | null = null;
let result: TextIngestResult;
try {
const ingestInput: MemoryAgentInput = {
userId: args.userId,
chatId: `cli-text-ingest-${batchId}-${index + 1}`,
userMessage: `Ingest external text artifact ${artifactReference(item.label)} into KTX memory.`,
assistantMessage: item.content.trim(),
...(args.connectionId ? { connectionId: args.connectionId } : {}),
sourceType: 'external_ingest',
};
const ingest = await memoryIngest.ingest(ingestInput);
runId = ingest.runId;
await memoryIngest.waitForRun(runId);
const status = await memoryIngest.status(runId);
if (!status) {
throw new Error(`Memory ingest run "${runId}" was not found.`);
}
result = resultFromStatus(item.label, status);
} catch (error) {
result = errorResult(item.label, runId, error);
}
results.push(result);
target.elapsedMs = now() - (target.startedAt ?? now());
target.detailLine = null;
target.status = result.status === 'done' ? 'done' : 'failed';
target.summaryText = result.status === 'done' ? summarizeCaptured(result.captured) : null;
target.failureText = result.status === 'error' ? result.error : null;
paint();
if (result.status === 'error' && args.failFast) {
break;
}
}
} finally {
if (spinnerInterval) {
clearInterval(spinnerInterval);
}
}
if (state.startedAt !== null) {
state.totalElapsedMs = now() - state.startedAt;
}
if (args.json) {
writeJsonResult(args, results, io);
} else if (repainter) {
repainter.paint(renderTextIngestView(state, true));
writePlainFailures(results, io);
} else {
io.stdout.write(renderTextIngestView(state, false));
writePlainFailures(results, io);
}
if (!args.json && results.length > 0) {
const duration = state.totalElapsedMs > 0 ? ` in ${formatDuration(state.totalElapsedMs)}` : '';
const outcome = results.some((result) => result.status === 'error') ? 'finished with failures' : 'finished';
io.stdout.write(`Text memory ingest ${outcome}${duration}.\n`);
}
return results.some((result) => result.status === 'error') ? 1 : 0;
}