2026-05-10 23:12:26 +02:00
import { readdir , readFile , realpath , rm , stat , writeFile , mkdtemp } from 'node:fs/promises' ;
import { createRequire } from 'node:module' ;
import { tmpdir } from 'node:os' ;
import { dirname , join , relative , resolve } from 'node:path' ;
import { performance } from 'node:perf_hooks' ;
import { fileURLToPath } from 'node:url' ;
const require = createRequire ( import . meta . url ) ;
const scriptDir = dirname ( fileURLToPath ( import . meta . url ) ) ;
chore(workspace): gate dead-code with knip production mode (#196)
* refactor(workspace): relocate @ktx/llm source into packages/cli/src/llm
* refactor(workspace): rewrite @ktx/llm imports to relative paths
* refactor(workspace): fold internal packages into cli
* chore(workspace): gate dead-code with knip production mode
Turn on production-mode knip plus an autofix run in pre-commit and the
`pnpm dead-code` script, document the `/** @internal */` convention for
test-only exports in AGENTS.md, annotate test-only exports across the
CLI with that JSDoc, and drop dead exports/wrappers the new gate
surfaced (e.g. `cli-project.ts`, `lookerRuntimeSourceToFileAdapterSource`,
`createLocalScanEnrichmentProvidersFromConfig`,
`PGLITE_OWNER_PROCESS_BACKEND_CAPABILITIES`, stale type re-exports).
Replace the loose `ignoreIssues` allowlist in `knip.json` with explicit
production entries so cross-package barrel leaks are caught.
* refactor(cli): delete internal barrel index.ts files
The 34 `index.ts` re-export barrels inside `packages/cli/src/` were
holdovers from the pre-fold multi-workspace structure. Post-fold-in they
served no production purpose: external consumers go through the single
package main entry, and in-repo callers mostly imported through them
only because the path was short. Internally, knip flagged most barrel
re-exports as production-dead (only reached via tests).
This change:
- Deletes every internal barrel except `packages/cli/src/index.ts`
(the published package entry).
- Rewrites ~270 source/test files to import each name directly from
the file that defines it.
- Moves `tools/warehouse-verification/index.ts` to
`create-warehouse-verification-tools.ts` (the function it defined
locally) and updates its single consumer.
- Renames `search/backend-conformance.ts` → `.test-utils.ts` to match
the existing test-helper file convention.
- Deletes 13 dead test-only chains (dbt-descriptions/*,
live-database/extracted-schema, live-database/structural-sync,
relationship-* feedback/review chain) plus their tests and a
cascading orphan integration test.
- Updates test mocks that pointed at deleted barrel paths
(notion-client, connector barrels in scan/local-scan-connectors
tests) to mock the source files instead.
- Points the maintainer benchmark script
(`scripts/relationship-benchmark-report.mjs`) at source files
instead of `dist/context/scan/index.js`.
- Drops the barrel `!` entries from `knip.json`; adds explicit
production entries only for the benchmark code reached via dist by
the maintainer script.
Net: 413 files changed, ~1.2k insertions, ~9.4k deletions.
`pnpm run dead-code` (Biome + knip default + knip production) and
`pnpm run type-check` are clean; 2277 tests pass.
* refactor(workspace): rename @ktx/cli to @kaelio/ktx and pack it directly
Promote the CLI workspace package to the public name `@kaelio/ktx` and
drop the separate `scripts/build-public-npm-package.mjs` wrapper. The
CLI package is now publishable in place (`publishConfig.access: public`,
`provenance: true`), so artifact packing uses `pnpm pack` against
`packages/cli/` instead of assembling a parallel package tree.
Updates all workspace filter invocations, docs, tests, and release
readiness checks to reference the new package name, and folds the
tarball-name helper into `scripts/public-npm-release-metadata.mjs`.
* docs: align "agent clients" and "data agents" terminology
Replace "client agents" with "agent clients" and "database agents" with
"data agents" across AGENTS.md, README.md, the docs-site copy, and the
matching setup-agents test description, matching the canonical
vocabulary in docs/terminology.md.
Also moves packages/cli/tsconfig.json's tsBuildInfoFile from
node_modules/.cache/ to dist/.tsbuildinfo so incremental builds survive
node_modules reinstalls.
* refactor(release): single source of truth for package version
Make packages/cli/package.json the single source of truth for the
@kaelio/ktx version. publicNpmPackageVersion() now reads it directly,
so artifact filenames, release-readiness checks, and the Python wheel
version all derive from one field. The duplicate
release-policy.json.publicNpmPackageVersion is removed.
Previously the two fields could drift: tarballs were named
kaelio-ktx-0.4.1.tgz while internally containing
@kaelio/ktx@0.0.0-private.
- update-public-release-version.mjs rewrites both Python pyproject.toml
files (ktx-daemon, ktx-sl) alongside the npm package.jsons,
normalizing the version for PEP 440 (e.g. 0.1.0-rc.2 -> 0.1.0rc2).
- semantic-release-config.cjs adds the two pyproject.toml files to
@semantic-release/git assets so the release commit back to main
carries every version source in lockstep.
- The six "?? '0.0.0-private'" fallback literals across the CLI are
replaced with "?? getKtxCliPackageInfo().version", and
createDefaultKtxMcpServer makes its version arg required.
- docs/release.md describes the actual commit-back model: the dev tree
always reflects the most recent release; no sentinel pin to
maintain.
Verified: pnpm run artifacts:build now produces
kaelio-ktx-0.4.1.tgz and kaelio_ktx-0.4.1-py3-none-any.whl with
@kaelio/ktx@0.4.1 inside. Full type-check, dead-code, and
2287 vitests + 173 script tests pass.
* refactor(cli): inject embedding provider resolution and detect sentence-transformers runtime
Make resolveProjectEmbeddingProvider and runtimeIo injectable in ingest and
scan command entrypoints so tests can stub them, and teach
resolvePublicIngestRuntimeRequirements to flag the local-embeddings runtime
feature when ktx.yaml selects sentence-transformers.
* chore(cli): mark buildLocalStatsStatus and LocalStatsStatus as @internal
Both symbols are consumed only by status-project.test.ts. Annotating with
/** @internal */ keeps knip's production-mode check clean without changing
runtime behavior.
* fix(cli): use real package metadata in print-command-tree
The stubbed package name embedded a forbidden product identifier that
tripped the boundary check in CI. Read the metadata from package.json
instead — keeps the rendered tree unchanged and removes a duplicate
source of truth.
* feat(cli): show embedding coverage in `ktx status`, drop duplicate disk counts
Inline `(N embedded)` next to the Wiki scope counts and Semantic-layer
source counts, computed with `SUM(embedding_json IS NOT NULL)` over
`knowledge_pages` and `local_sl_sources`. Rename the "Knowledge" label to
"Wiki" (canonical per `docs/terminology.md`) and rename the matching
`localStats.knowledgePages` field to `localStats.wikiPages`.
Drop `wiki=N md` and `semantic-layer=N yaml` from the Disk row — those
duplicated the per-surface rows above. Disk now reports only actual byte
usage (db, cache, raw-sources). The unused `wikiGlobalMarkdownCount` /
`semanticLayerYamlCount` fields, the `isMarkdownEntry` / `isYamlEntry`
helpers, and the `filter` arg on `summarizeDir` are removed.
2026-05-21 15:28:58 +02:00
const ktxRoot = resolve ( scriptDir , '..' ) ;
2026-05-10 23:51:24 +02:00
const docsDir = join ( ktxRoot , 'docs' ) ;
2026-05-10 23:12:26 +02:00
const reportPath = join ( docsDir , 'hybrid-search-pglite-spike.md' ) ;
async function timed ( label , fn ) {
const started = performance . now ( ) ;
const value = await fn ( ) ;
const durationMs = Number ( ( performance . now ( ) - started ) . toFixed ( 2 ) ) ;
return { label , durationMs , value } ;
}
async function directoryBytes ( path ) {
const entry = await stat ( path ) ;
if ( entry . isFile ( ) ) {
return entry . size ;
}
if ( ! entry . isDirectory ( ) ) {
return 0 ;
}
const children = await readdir ( path ) ;
const childSizes = await Promise . all ( children . map ( ( child ) => directoryBytes ( join ( path , child ) ) ) ) ;
return childSizes . reduce ( ( sum , size ) => sum + size , 0 ) ;
}
async function resolvePackageJson ( packageName ) {
let currentDir = dirname ( require . resolve ( packageName ) ) ;
while ( currentDir !== dirname ( currentDir ) ) {
const packageJsonPath = join ( currentDir , 'package.json' ) ;
try {
const packageJson = JSON . parse ( await readFile ( packageJsonPath , 'utf8' ) ) ;
if ( packageJson . name === packageName ) {
return { packageJsonPath , packageJson } ;
}
} catch ( error ) {
if ( error ? . code !== 'ENOENT' ) {
throw error ;
}
}
currentDir = dirname ( currentDir ) ;
}
throw new Error ( ` Could not resolve package.json for ${ packageName } ` ) ;
}
async function packageInfo ( packageName ) {
const { packageJsonPath , packageJson } = await resolvePackageJson ( packageName ) ;
const packageDir = await realpath ( dirname ( packageJsonPath ) ) ;
return {
name : packageName ,
version : packageJson . version ,
2026-05-10 23:51:24 +02:00
path : relative ( ktxRoot , packageDir ) ,
2026-05-10 23:12:26 +02:00
bytes : await directoryBytes ( packageDir ) ,
} ;
}
async function createDb ( PGlite , vector , pg _trgm , dataDir ) {
const db = await PGlite . create ( {
dataDir ,
extensions : {
vector ,
pg _trgm ,
} ,
} ) ;
await db . exec ( `
CREATE EXTENSION IF NOT EXISTS vector ;
CREATE EXTENSION IF NOT EXISTS pg _trgm ;
CREATE TABLE IF NOT EXISTS spike _documents (
id TEXT PRIMARY KEY ,
search _text TEXT NOT NULL ,
metadata JSONB NOT NULL DEFAULT '{}' : : jsonb ,
embedding vector ( 3 ) NOT NULL
) ;
CREATE INDEX IF NOT EXISTS spike _documents _fts _idx
ON spike _documents
USING GIN ( to _tsvector ( 'english' , search _text ) ) ;
CREATE INDEX IF NOT EXISTS spike _documents _vector _idx
ON spike _documents
USING ivfflat ( embedding vector _cosine _ops )
WITH ( lists = 1 ) ;
CREATE TABLE IF NOT EXISTS spike _dictionary _values (
connection _id TEXT NOT NULL ,
source _name TEXT NOT NULL ,
column _name TEXT NOT NULL ,
value TEXT NOT NULL ,
PRIMARY KEY ( connection _id , source _name , column _name , value )
) ;
CREATE INDEX IF NOT EXISTS spike _dictionary _values _trgm _idx
ON spike _dictionary _values
USING GIN ( value gin _trgm _ops ) ;
` );
return db ;
}
async function seed ( db ) {
await db . query (
`
INSERT INTO spike _documents ( id , search _text , metadata , embedding )
VALUES
( $1 , $2 , $3 : : jsonb , $4 : : vector ) ,
( $5 , $6 , $7 : : jsonb , $8 : : vector ) ,
( $9 , $10 , $11 : : jsonb , $12 : : vector )
ON CONFLICT ( id ) DO UPDATE
SET search _text = EXCLUDED . search _text ,
metadata = EXCLUDED . metadata ,
embedding = EXCLUDED . embedding
` ,
[
'warehouse/orders' ,
'orders paid revenue refund status customer' ,
JSON . stringify ( { connectionId : 'warehouse' , sourceName : 'orders' } ) ,
JSON . stringify ( [ 1 , 0 , 0 ] ) ,
'finance/orders' ,
'orders finance bookings gross margin' ,
JSON . stringify ( { connectionId : 'finance' , sourceName : 'orders' } ) ,
JSON . stringify ( [ 0.72 , 0.28 , 0 ] ) ,
'warehouse/customers' ,
'customers accounts lifecycle region' ,
JSON . stringify ( { connectionId : 'warehouse' , sourceName : 'customers' } ) ,
JSON . stringify ( [ 0 , 1 , 0 ] ) ,
] ,
) ;
await db . query ( `
INSERT INTO spike _dictionary _values ( connection _id , source _name , column _name , value )
VALUES
( 'warehouse' , 'orders' , 'status' , 'refunded' ) ,
( 'warehouse' , 'orders' , 'status' , 'paid' ) ,
( 'warehouse' , 'customers' , 'region' , 'emea' )
ON CONFLICT DO NOTHING
` );
}
async function closeDb ( db ) {
if ( typeof db . close === 'function' ) {
await db . close ( ) ;
}
}
async function main ( ) {
const importTimer = await timed ( 'dynamic import @electric-sql/pglite' , async ( ) => {
const [ { PGlite } , { vector } , { pg _trgm } ] = await Promise . all ( [
import ( '@electric-sql/pglite' ) ,
import ( '@electric-sql/pglite/vector' ) ,
import ( '@electric-sql/pglite/contrib/pg_trgm' ) ,
] ) ;
return { PGlite , vector , pg _trgm } ;
} ) ;
const { PGlite , vector , pg _trgm } = importTimer . value ;
2026-05-10 23:51:24 +02:00
const tempDir = await mkdtemp ( join ( tmpdir ( ) , 'ktx-pglite-report-' ) ) ;
2026-05-10 23:12:26 +02:00
const dataDir = join ( tempDir , 'pgdata' ) ;
let db ;
let reopened ;
try {
const createTimer = await timed ( 'create persistent PGlite database and load extensions' , async ( ) => {
db = await createDb ( PGlite , vector , pg _trgm , dataDir ) ;
return true ;
} ) ;
const seedTimer = await timed ( 'seed hybrid search fixture' , async ( ) => seed ( db ) ) ;
const ftsTimer = await timed ( 'Postgres FTS query' , ( ) =>
db . query (
`
SELECT id
FROM spike _documents
WHERE to _tsvector ( 'english' , search _text ) @ @ websearch _to _tsquery ( 'english' , $1 )
ORDER BY ts _rank _cd ( to _tsvector ( 'english' , search _text ) , websearch _to _tsquery ( 'english' , $1 ) ) DESC , id ASC
LIMIT 1
` ,
[ 'paid orders' ] ,
) ,
) ;
const vectorTimer = await timed ( 'pgvector cosine query' , ( ) =>
db . query (
`
SELECT id , 1 - ( embedding <= > $1 : : vector ) AS similarity
FROM spike _documents
ORDER BY embedding <= > $1 : : vector , id ASC
LIMIT 1
` ,
[ JSON . stringify ( [ 1 , 0 , 0 ] ) ] ,
) ,
) ;
const trigramTimer = await timed ( 'pg_trgm dictionary query' , ( ) =>
db . query (
`
SELECT connection _id || '/' || source _name AS id , value , similarity ( value , $1 ) AS score
FROM spike _dictionary _values
WHERE similarity ( value , $1 ) > 0
ORDER BY score DESC , id ASC , value ASC
LIMIT 1
` ,
[ 'refund' ] ,
) ,
) ;
const sameInstanceTimer = await timed ( 'same instance parallel reads' , ( ) =>
Promise . all ( Array . from ( { length : 4 } , ( ) => db . query ( 'SELECT COUNT(*)::int AS count FROM spike_documents' ) ) ) ,
) ;
let secondOpenStatus = 'opened' ;
let secondOpenMessage = 'Second direct opener executed SELECT 1.' ;
let second ;
try {
second = await createDb ( PGlite , vector , pg _trgm , dataDir ) ;
await second . query ( 'SELECT 1' ) ;
} catch ( error ) {
secondOpenStatus = 'blocked' ;
secondOpenMessage = error instanceof Error ? error . message : String ( error ) ;
} finally {
if ( second ) {
await closeDb ( second ) ;
}
}
await closeDb ( db ) ;
db = undefined ;
const reopenTimer = await timed ( 'reopen persistent PGlite database' , async ( ) => {
reopened = await createDb ( PGlite , vector , pg _trgm , dataDir ) ;
return reopened . query ( 'SELECT COUNT(*)::int AS count FROM spike_documents' ) ;
} ) ;
const packages = await Promise . all ( [
packageInfo ( '@electric-sql/pglite' ) ,
packageInfo ( '@electric-sql/pglite-socket' ) ,
] ) ;
const result = {
generatedAt : new Date ( ) . toISOString ( ) ,
node : process . version ,
packages ,
timingsMs : {
import : importTimer . durationMs ,
createAndExtensions : createTimer . durationMs ,
seed : seedTimer . durationMs ,
ftsQuery : ftsTimer . durationMs ,
vectorQuery : vectorTimer . durationMs ,
trigramQuery : trigramTimer . durationMs ,
sameInstanceParallelReads : sameInstanceTimer . durationMs ,
reopen : reopenTimer . durationMs ,
} ,
topResults : {
fts : ftsTimer . value . rows [ 0 ] ? . id ? ? null ,
vector : vectorTimer . value . rows [ 0 ] ? . id ? ? null ,
trigram : trigramTimer . value . rows [ 0 ] ? . id ? ? null ,
persistedRowCount : reopenTimer . value . rows [ 0 ] ? . count ? ? null ,
} ,
concurrency : {
sameInstanceReadCounts : sameInstanceTimer . value . map ( ( queryResult ) => queryResult . rows [ 0 ] ? . count ? ? null ) ,
secondDirectOpenStatus : secondOpenStatus ,
secondDirectOpenMessage : secondOpenMessage ,
} ,
} ;
const totalPackageBytes = packages . reduce ( ( sum , pkg ) => sum + pkg . bytes , 0 ) ;
const recommendation =
secondOpenStatus === 'opened'
? 'Prototype a PGlite backend behind an explicit owner process or socket before exposing CLI plus MCP concurrent access.'
: 'Use a socket or owner-process architecture for any PGlite backend prototype because direct second opener access was blocked.' ;
const markdown = ` # Hybrid Search PGlite Spike
Generated : $ { result . generatedAt }
# # Summary
PGlite loaded in Node $ { result . node } , enabled vector and pg _trgm extensions , executed Postgres FTS , pgvector cosine ranking , pg _trgm dictionary ranking , and reopened a persistent filesystem database .
Recommendation : $ { recommendation }
# # Package Footprint
| Package | Version | Approx bytes | Resolved path |
| -- - | -- - | -- - : | -- - |
$ { packages . map ( ( pkg ) => ` | \` ${ pkg . name } \` | \` ${ pkg . version } \` | ${ pkg . bytes } | \` ${ pkg . path } \` | ` ) . join ( '\n' ) }
Total measured package bytes : $ { totalPackageBytes }
# # Timings
| Probe | Duration ms |
| -- - | -- - : |
$ { Object . entries ( result . timingsMs )
. map ( ( [ name , ms ] ) => ` | ${ name } | ${ ms } | ` )
. join ( '\n' ) }
# # Search Feature Results
| Probe | Top result |
| -- - | -- - |
| Postgres FTS | \ ` ${ result . topResults . fts } \` |
| pgvector cosine | \ ` ${ result . topResults . vector } \` |
| pg _trgm dictionary | \ ` ${ result . topResults . trigram } \` |
| Reopened persisted row count | \ ` ${ result . topResults . persistedRowCount } \` |
# # Concurrency Observation
Same - instance parallel read counts : \ ` ${ result . concurrency . sameInstanceReadCounts . join ( ', ' ) } \`
Second direct opener status : \ ` ${ result . concurrency . secondDirectOpenStatus } \`
Second direct opener message :
\ ` \` \` text
$ { result . concurrency . secondDirectOpenMessage }
\ ` \` \`
# # Decision
The SQLite backend remains the production default . The next PGlite step , if approved , is an owner - process or socket - backed prototype that reuses the existing \ ` SearchBackendCapabilities \` and backend conformance helpers without changing the public CLI surface.
` ;
await writeFile ( reportPath , markdown ) ;
process . stdout . write ( ` Wrote ${ relative ( process . cwd ( ) , reportPath ) } \n ` ) ;
process . stdout . write ( JSON . stringify ( result , null , 2 ) ) ;
process . stdout . write ( '\n' ) ;
} finally {
if ( db ) {
await closeDb ( db ) ;
}
if ( reopened ) {
await closeDb ( reopened ) ;
}
await rm ( tempDir , { recursive : true , force : true } ) ;
}
}
main ( ) . catch ( ( error ) => {
console . error ( error ) ;
process . exitCode = 1 ;
} ) ;