ktx/scripts/pglite-hybrid-search-spike.mjs
Andrey Avtomonov 2366b00301
chore(workspace): gate dead-code with knip production mode (#196)
* refactor(workspace): relocate @ktx/llm source into packages/cli/src/llm

* refactor(workspace): rewrite @ktx/llm imports to relative paths

* refactor(workspace): fold internal packages into cli

* chore(workspace): gate dead-code with knip production mode

Turn on production-mode knip plus an autofix run in pre-commit and the
`pnpm dead-code` script, document the `/** @internal */` convention for
test-only exports in AGENTS.md, annotate test-only exports across the
CLI with that JSDoc, and drop dead exports/wrappers the new gate
surfaced (e.g. `cli-project.ts`, `lookerRuntimeSourceToFileAdapterSource`,
`createLocalScanEnrichmentProvidersFromConfig`,
`PGLITE_OWNER_PROCESS_BACKEND_CAPABILITIES`, stale type re-exports).
Replace the loose `ignoreIssues` allowlist in `knip.json` with explicit
production entries so cross-package barrel leaks are caught.

* refactor(cli): delete internal barrel index.ts files

The 34 `index.ts` re-export barrels inside `packages/cli/src/` were
holdovers from the pre-fold multi-workspace structure. Post-fold-in they
served no production purpose: external consumers go through the single
package main entry, and in-repo callers mostly imported through them
only because the path was short. Internally, knip flagged most barrel
re-exports as production-dead (only reached via tests).

This change:
- Deletes every internal barrel except `packages/cli/src/index.ts`
  (the published package entry).
- Rewrites ~270 source/test files to import each name directly from
  the file that defines it.
- Moves `tools/warehouse-verification/index.ts` to
  `create-warehouse-verification-tools.ts` (the function it defined
  locally) and updates its single consumer.
- Renames `search/backend-conformance.ts` → `.test-utils.ts` to match
  the existing test-helper file convention.
- Deletes 13 dead test-only chains (dbt-descriptions/*,
  live-database/extracted-schema, live-database/structural-sync,
  relationship-* feedback/review chain) plus their tests and a
  cascading orphan integration test.
- Updates test mocks that pointed at deleted barrel paths
  (notion-client, connector barrels in scan/local-scan-connectors
  tests) to mock the source files instead.
- Points the maintainer benchmark script
  (`scripts/relationship-benchmark-report.mjs`) at source files
  instead of `dist/context/scan/index.js`.
- Drops the barrel `!` entries from `knip.json`; adds explicit
  production entries only for the benchmark code reached via dist by
  the maintainer script.

Net: 413 files changed, ~1.2k insertions, ~9.4k deletions.

`pnpm run dead-code` (Biome + knip default + knip production) and
`pnpm run type-check` are clean; 2277 tests pass.

* refactor(workspace): rename @ktx/cli to @kaelio/ktx and pack it directly

Promote the CLI workspace package to the public name `@kaelio/ktx` and
drop the separate `scripts/build-public-npm-package.mjs` wrapper. The
CLI package is now publishable in place (`publishConfig.access: public`,
`provenance: true`), so artifact packing uses `pnpm pack` against
`packages/cli/` instead of assembling a parallel package tree.

Updates all workspace filter invocations, docs, tests, and release
readiness checks to reference the new package name, and folds the
tarball-name helper into `scripts/public-npm-release-metadata.mjs`.

* docs: align "agent clients" and "data agents" terminology

Replace "client agents" with "agent clients" and "database agents" with
"data agents" across AGENTS.md, README.md, the docs-site copy, and the
matching setup-agents test description, matching the canonical
vocabulary in docs/terminology.md.

Also moves packages/cli/tsconfig.json's tsBuildInfoFile from
node_modules/.cache/ to dist/.tsbuildinfo so incremental builds survive
node_modules reinstalls.

* refactor(release): single source of truth for package version

Make packages/cli/package.json the single source of truth for the
@kaelio/ktx version. publicNpmPackageVersion() now reads it directly,
so artifact filenames, release-readiness checks, and the Python wheel
version all derive from one field. The duplicate
release-policy.json.publicNpmPackageVersion is removed.

Previously the two fields could drift: tarballs were named
kaelio-ktx-0.4.1.tgz while internally containing
@kaelio/ktx@0.0.0-private.

- update-public-release-version.mjs rewrites both Python pyproject.toml
  files (ktx-daemon, ktx-sl) alongside the npm package.jsons,
  normalizing the version for PEP 440 (e.g. 0.1.0-rc.2 -> 0.1.0rc2).
- semantic-release-config.cjs adds the two pyproject.toml files to
  @semantic-release/git assets so the release commit back to main
  carries every version source in lockstep.
- The six "?? '0.0.0-private'" fallback literals across the CLI are
  replaced with "?? getKtxCliPackageInfo().version", and
  createDefaultKtxMcpServer makes its version arg required.
- docs/release.md describes the actual commit-back model: the dev tree
  always reflects the most recent release; no sentinel pin to
  maintain.

Verified: pnpm run artifacts:build now produces
kaelio-ktx-0.4.1.tgz and kaelio_ktx-0.4.1-py3-none-any.whl with
@kaelio/ktx@0.4.1 inside. Full type-check, dead-code, and
2287 vitests + 173 script tests pass.

* refactor(cli): inject embedding provider resolution and detect sentence-transformers runtime

Make resolveProjectEmbeddingProvider and runtimeIo injectable in ingest and
scan command entrypoints so tests can stub them, and teach
resolvePublicIngestRuntimeRequirements to flag the local-embeddings runtime
feature when ktx.yaml selects sentence-transformers.

* chore(cli): mark buildLocalStatsStatus and LocalStatsStatus as @internal

Both symbols are consumed only by status-project.test.ts. Annotating with
/** @internal */ keeps knip's production-mode check clean without changing
runtime behavior.

* fix(cli): use real package metadata in print-command-tree

The stubbed package name embedded a forbidden product identifier that
tripped the boundary check in CI. Read the metadata from package.json
instead — keeps the rendered tree unchanged and removes a duplicate
source of truth.

* feat(cli): show embedding coverage in `ktx status`, drop duplicate disk counts

Inline `(N embedded)` next to the Wiki scope counts and Semantic-layer
source counts, computed with `SUM(embedding_json IS NOT NULL)` over
`knowledge_pages` and `local_sl_sources`. Rename the "Knowledge" label to
"Wiki" (canonical per `docs/terminology.md`) and rename the matching
`localStats.knowledgePages` field to `localStats.wikiPages`.

Drop `wiki=N md` and `semantic-layer=N yaml` from the Disk row — those
duplicated the per-surface rows above. Disk now reports only actual byte
usage (db, cache, raw-sources). The unused `wikiGlobalMarkdownCount` /
`semanticLayerYamlCount` fields, the `isMarkdownEntry` / `isYamlEntry`
helpers, and the `filter` arg on `summarizeDir` are removed.
2026-05-21 15:28:58 +02:00

353 lines
11 KiB
JavaScript

import { readdir, readFile, realpath, rm, stat, writeFile, mkdtemp } from 'node:fs/promises';
import { createRequire } from 'node:module';
import { tmpdir } from 'node:os';
import { dirname, join, relative, resolve } from 'node:path';
import { performance } from 'node:perf_hooks';
import { fileURLToPath } from 'node:url';
const require = createRequire(import.meta.url);
const scriptDir = dirname(fileURLToPath(import.meta.url));
const ktxRoot = resolve(scriptDir, '..');
const docsDir = join(ktxRoot, 'docs');
const reportPath = join(docsDir, 'hybrid-search-pglite-spike.md');
async function timed(label, fn) {
const started = performance.now();
const value = await fn();
const durationMs = Number((performance.now() - started).toFixed(2));
return { label, durationMs, value };
}
async function directoryBytes(path) {
const entry = await stat(path);
if (entry.isFile()) {
return entry.size;
}
if (!entry.isDirectory()) {
return 0;
}
const children = await readdir(path);
const childSizes = await Promise.all(children.map((child) => directoryBytes(join(path, child))));
return childSizes.reduce((sum, size) => sum + size, 0);
}
async function resolvePackageJson(packageName) {
let currentDir = dirname(require.resolve(packageName));
while (currentDir !== dirname(currentDir)) {
const packageJsonPath = join(currentDir, 'package.json');
try {
const packageJson = JSON.parse(await readFile(packageJsonPath, 'utf8'));
if (packageJson.name === packageName) {
return { packageJsonPath, packageJson };
}
} catch (error) {
if (error?.code !== 'ENOENT') {
throw error;
}
}
currentDir = dirname(currentDir);
}
throw new Error(`Could not resolve package.json for ${packageName}`);
}
async function packageInfo(packageName) {
const { packageJsonPath, packageJson } = await resolvePackageJson(packageName);
const packageDir = await realpath(dirname(packageJsonPath));
return {
name: packageName,
version: packageJson.version,
path: relative(ktxRoot, packageDir),
bytes: await directoryBytes(packageDir),
};
}
async function createDb(PGlite, vector, pg_trgm, dataDir) {
const db = await PGlite.create({
dataDir,
extensions: {
vector,
pg_trgm,
},
});
await db.exec(`
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE IF NOT EXISTS spike_documents (
id TEXT PRIMARY KEY,
search_text TEXT NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
embedding vector(3) NOT NULL
);
CREATE INDEX IF NOT EXISTS spike_documents_fts_idx
ON spike_documents
USING GIN (to_tsvector('english', search_text));
CREATE INDEX IF NOT EXISTS spike_documents_vector_idx
ON spike_documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1);
CREATE TABLE IF NOT EXISTS spike_dictionary_values (
connection_id TEXT NOT NULL,
source_name TEXT NOT NULL,
column_name TEXT NOT NULL,
value TEXT NOT NULL,
PRIMARY KEY (connection_id, source_name, column_name, value)
);
CREATE INDEX IF NOT EXISTS spike_dictionary_values_trgm_idx
ON spike_dictionary_values
USING GIN (value gin_trgm_ops);
`);
return db;
}
async function seed(db) {
await db.query(
`
INSERT INTO spike_documents (id, search_text, metadata, embedding)
VALUES
($1, $2, $3::jsonb, $4::vector),
($5, $6, $7::jsonb, $8::vector),
($9, $10, $11::jsonb, $12::vector)
ON CONFLICT (id) DO UPDATE
SET search_text = EXCLUDED.search_text,
metadata = EXCLUDED.metadata,
embedding = EXCLUDED.embedding
`,
[
'warehouse/orders',
'orders paid revenue refund status customer',
JSON.stringify({ connectionId: 'warehouse', sourceName: 'orders' }),
JSON.stringify([1, 0, 0]),
'finance/orders',
'orders finance bookings gross margin',
JSON.stringify({ connectionId: 'finance', sourceName: 'orders' }),
JSON.stringify([0.72, 0.28, 0]),
'warehouse/customers',
'customers accounts lifecycle region',
JSON.stringify({ connectionId: 'warehouse', sourceName: 'customers' }),
JSON.stringify([0, 1, 0]),
],
);
await db.query(`
INSERT INTO spike_dictionary_values (connection_id, source_name, column_name, value)
VALUES
('warehouse', 'orders', 'status', 'refunded'),
('warehouse', 'orders', 'status', 'paid'),
('warehouse', 'customers', 'region', 'emea')
ON CONFLICT DO NOTHING
`);
}
async function closeDb(db) {
if (typeof db.close === 'function') {
await db.close();
}
}
async function main() {
const importTimer = await timed('dynamic import @electric-sql/pglite', async () => {
const [{ PGlite }, { vector }, { pg_trgm }] = await Promise.all([
import('@electric-sql/pglite'),
import('@electric-sql/pglite/vector'),
import('@electric-sql/pglite/contrib/pg_trgm'),
]);
return { PGlite, vector, pg_trgm };
});
const { PGlite, vector, pg_trgm } = importTimer.value;
const tempDir = await mkdtemp(join(tmpdir(), 'ktx-pglite-report-'));
const dataDir = join(tempDir, 'pgdata');
let db;
let reopened;
try {
const createTimer = await timed('create persistent PGlite database and load extensions', async () => {
db = await createDb(PGlite, vector, pg_trgm, dataDir);
return true;
});
const seedTimer = await timed('seed hybrid search fixture', async () => seed(db));
const ftsTimer = await timed('Postgres FTS query', () =>
db.query(
`
SELECT id
FROM spike_documents
WHERE to_tsvector('english', search_text) @@ websearch_to_tsquery('english', $1)
ORDER BY ts_rank_cd(to_tsvector('english', search_text), websearch_to_tsquery('english', $1)) DESC, id ASC
LIMIT 1
`,
['paid orders'],
),
);
const vectorTimer = await timed('pgvector cosine query', () =>
db.query(
`
SELECT id, 1 - (embedding <=> $1::vector) AS similarity
FROM spike_documents
ORDER BY embedding <=> $1::vector, id ASC
LIMIT 1
`,
[JSON.stringify([1, 0, 0])],
),
);
const trigramTimer = await timed('pg_trgm dictionary query', () =>
db.query(
`
SELECT connection_id || '/' || source_name AS id, value, similarity(value, $1) AS score
FROM spike_dictionary_values
WHERE similarity(value, $1) > 0
ORDER BY score DESC, id ASC, value ASC
LIMIT 1
`,
['refund'],
),
);
const sameInstanceTimer = await timed('same instance parallel reads', () =>
Promise.all(Array.from({ length: 4 }, () => db.query('SELECT COUNT(*)::int AS count FROM spike_documents'))),
);
let secondOpenStatus = 'opened';
let secondOpenMessage = 'Second direct opener executed SELECT 1.';
let second;
try {
second = await createDb(PGlite, vector, pg_trgm, dataDir);
await second.query('SELECT 1');
} catch (error) {
secondOpenStatus = 'blocked';
secondOpenMessage = error instanceof Error ? error.message : String(error);
} finally {
if (second) {
await closeDb(second);
}
}
await closeDb(db);
db = undefined;
const reopenTimer = await timed('reopen persistent PGlite database', async () => {
reopened = await createDb(PGlite, vector, pg_trgm, dataDir);
return reopened.query('SELECT COUNT(*)::int AS count FROM spike_documents');
});
const packages = await Promise.all([
packageInfo('@electric-sql/pglite'),
packageInfo('@electric-sql/pglite-socket'),
]);
const result = {
generatedAt: new Date().toISOString(),
node: process.version,
packages,
timingsMs: {
import: importTimer.durationMs,
createAndExtensions: createTimer.durationMs,
seed: seedTimer.durationMs,
ftsQuery: ftsTimer.durationMs,
vectorQuery: vectorTimer.durationMs,
trigramQuery: trigramTimer.durationMs,
sameInstanceParallelReads: sameInstanceTimer.durationMs,
reopen: reopenTimer.durationMs,
},
topResults: {
fts: ftsTimer.value.rows[0]?.id ?? null,
vector: vectorTimer.value.rows[0]?.id ?? null,
trigram: trigramTimer.value.rows[0]?.id ?? null,
persistedRowCount: reopenTimer.value.rows[0]?.count ?? null,
},
concurrency: {
sameInstanceReadCounts: sameInstanceTimer.value.map((queryResult) => queryResult.rows[0]?.count ?? null),
secondDirectOpenStatus: secondOpenStatus,
secondDirectOpenMessage: secondOpenMessage,
},
};
const totalPackageBytes = packages.reduce((sum, pkg) => sum + pkg.bytes, 0);
const recommendation =
secondOpenStatus === 'opened'
? 'Prototype a PGlite backend behind an explicit owner process or socket before exposing CLI plus MCP concurrent access.'
: 'Use a socket or owner-process architecture for any PGlite backend prototype because direct second opener access was blocked.';
const markdown = `# Hybrid Search PGlite Spike
Generated: ${result.generatedAt}
## Summary
PGlite loaded in Node ${result.node}, enabled vector and pg_trgm extensions, executed Postgres FTS, pgvector cosine ranking, pg_trgm dictionary ranking, and reopened a persistent filesystem database.
Recommendation: ${recommendation}
## Package Footprint
| Package | Version | Approx bytes | Resolved path |
| --- | --- | ---: | --- |
${packages.map((pkg) => `| \`${pkg.name}\` | \`${pkg.version}\` | ${pkg.bytes} | \`${pkg.path}\` |`).join('\n')}
Total measured package bytes: ${totalPackageBytes}
## Timings
| Probe | Duration ms |
| --- | ---: |
${Object.entries(result.timingsMs)
.map(([name, ms]) => `| ${name} | ${ms} |`)
.join('\n')}
## Search Feature Results
| Probe | Top result |
| --- | --- |
| Postgres FTS | \`${result.topResults.fts}\` |
| pgvector cosine | \`${result.topResults.vector}\` |
| pg_trgm dictionary | \`${result.topResults.trigram}\` |
| Reopened persisted row count | \`${result.topResults.persistedRowCount}\` |
## Concurrency Observation
Same-instance parallel read counts: \`${result.concurrency.sameInstanceReadCounts.join(', ')}\`
Second direct opener status: \`${result.concurrency.secondDirectOpenStatus}\`
Second direct opener message:
\`\`\`text
${result.concurrency.secondDirectOpenMessage}
\`\`\`
## Decision
The SQLite backend remains the production default. The next PGlite step, if approved, is an owner-process or socket-backed prototype that reuses the existing \`SearchBackendCapabilities\` and backend conformance helpers without changing the public CLI surface.
`;
await writeFile(reportPath, markdown);
process.stdout.write(`Wrote ${relative(process.cwd(), reportPath)}\n`);
process.stdout.write(JSON.stringify(result, null, 2));
process.stdout.write('\n');
} finally {
if (db) {
await closeDb(db);
}
if (reopened) {
await closeDb(reopened);
}
await rm(tempDir, { recursive: true, force: true });
}
}
main().catch((error) => {
console.error(error);
process.exitCode = 1;
});