diff --git a/docs/content/docs/cli-reference/ktx-setup.mdx b/docs/content/docs/cli-reference/ktx-setup.mdx index 77f8b359..943033b5 100644 --- a/docs/content/docs/cli-reference/ktx-setup.mdx +++ b/docs/content/docs/cli-reference/ktx-setup.mdx @@ -90,7 +90,8 @@ ktx setup [options] | `--enable-historic-sql` | Enable Historic SQL when the selected database supports it | `false` | | `--disable-historic-sql` | Disable Historic SQL for the selected database | `false` | | `--historic-sql-window-days ` | Historic SQL query-history window in days | — | -| `--historic-sql-min-calls ` | Postgres `pg_stat_statements` minimum calls floor | — | +| `--historic-sql-min-executions ` | Minimum executions for a Historic SQL template | — | +| `--historic-sql-min-calls ` | Alias for `--historic-sql-min-executions` for one release | — | | `--historic-sql-service-account-pattern ` | Historic SQL service-account regex; repeatable | — | | `--historic-sql-redaction-pattern ` | Historic SQL SQL-literal redaction regex; repeatable | — | diff --git a/docs/content/docs/integrations/primary-sources.mdx b/docs/content/docs/integrations/primary-sources.mdx index 8f8c1391..be71cba0 100644 --- a/docs/content/docs/integrations/primary-sources.mdx +++ b/docs/content/docs/integrations/primary-sources.mdx @@ -76,8 +76,11 @@ PostgreSQL Historic SQL mines real query patterns from `pg_stat_statements`. Thi ```yaml historicSql: - minCalls: 5 # Minimum call count to include a query template - maxTemplatesPerRun: 5000 + enabled: true + dialect: postgres + minExecutions: 5 + filters: + dropTrivialProbes: true ``` ### Dialect notes @@ -134,18 +137,27 @@ For multiple schemas: | Foreign keys | No | Not available in Snowflake | | Row count estimates | Yes | From `INFORMATION_SCHEMA.TABLES.ROW_COUNT` | | Column statistics | No | — | -| Historic SQL | Configurable | Query-history settings can be stored; local CLI Historic SQL ingest currently uses the Postgres path | +| Historic SQL | Yes | Via `SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` when enabled | | Table sampling | Yes | — | ### Historic SQL -Snowflake Historic SQL settings describe how query history should be sampled when that runtime path is available. +Snowflake Historic SQL reads aggregated query-history templates from +`SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` and feeds the same unified staged +artifact shape as Postgres and BigQuery. ```yaml historicSql: + enabled: true + dialect: snowflake windowDays: 90 + minExecutions: 5 + filters: + dropTrivialProbes: true + serviceAccounts: + patterns: ['^svc_'] + mode: exclude redactionPatterns: [] - serviceAccountUserPatterns: [] ``` ### Dialect notes @@ -200,18 +212,27 @@ The project ID is extracted automatically from the service account JSON file. | Foreign keys | No | Not available in BigQuery | | Row count estimates | Yes | From table metadata | | Column statistics | No | — | -| Historic SQL | Configurable | Query-history settings can be stored; local CLI Historic SQL ingest currently uses the Postgres path | +| Historic SQL | Yes | Via region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` when enabled | | Table sampling | Yes | — | ### Historic SQL -BigQuery Historic SQL settings describe how `INFORMATION_SCHEMA.JOBS_BY_PROJECT` should be sampled when that runtime path is available. +BigQuery Historic SQL reads aggregated query-history templates from +region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` and feeds the same unified +staged artifact shape as Postgres and Snowflake. ```yaml historicSql: + enabled: true + dialect: bigquery windowDays: 90 + minExecutions: 5 + filters: + dropTrivialProbes: true + serviceAccounts: + patterns: ['@bot\\.'] + mode: exclude redactionPatterns: [] - serviceAccountUserPatterns: [] ``` ### Dialect notes diff --git a/examples/README.md b/examples/README.md index 134b4c24..7b463bef 100644 --- a/examples/README.md +++ b/examples/README.md @@ -29,10 +29,10 @@ warehouse credential. ## postgres-historic -`postgres-historic/` is a manual Docker-backed smoke for Postgres -historic-SQL ingest via `pg_stat_statements`. It verifies setup, first-run -baseline creation, delta-only follow-up ingest, and reset handling without -requiring a managed Postgres service. +`postgres-historic/` is a manual Docker-backed smoke for Postgres historic-SQL +ingest via `pg_stat_statements`. It verifies setup, unified Historic SQL artifacts, +managed daemon batch SQL analysis, and no-WorkUnit idempotency for unchanged +bucketed table and pattern inputs. ## package-artifacts diff --git a/examples/postgres-historic/README.md b/examples/postgres-historic/README.md index f97d4b9b..9e5e6ad4 100644 --- a/examples/postgres-historic/README.md +++ b/examples/postgres-historic/README.md @@ -1,13 +1,17 @@ # Postgres Historic SQL Example -This example is a manual smoke for Postgres historic-SQL ingest through -`pg_stat_statements`. It starts Postgres 14 with the extension preloaded, -generates query workload under separate users, runs `ktx setup` with -`--enable-historic-sql`, and verifies three local ingest runs: +This example is a manual smoke for the redesigned Postgres historic-SQL ingest +path through `pg_stat_statements`. It starts Postgres 14 with the extension +preloaded, generates query workload under separate users, runs `ktx setup` with +`--enable-historic-sql`, and verifies the unified staged artifacts: -- first run creates a fresh PGSS baseline -- second run emits only positive deltas -- reset run treats `pg_stat_statements_reset()` as a fresh baseline +- `manifest.json` +- `tables/*.json` +- `patterns-input.json` + +The smoke also runs the same workload twice and verifies the second stage-only +run has `workUnitCount: 0`, which proves unchanged bucketed table and pattern +inputs do not schedule LLM work. ## Prerequisites @@ -36,8 +40,9 @@ Set `KTX_POSTGRES_HISTORIC_KEEP_DOCKER=1` to leave the container running after the script exits. The smoke validates the historic-SQL raw snapshot path without requiring LLM -credentials. It uses KTX's local stage-only ingest API after `ktx setup` so the -PGSS baseline and delta behavior can be checked independently from curation. +credentials. It uses KTX's local stage-only ingest API after `ktx setup`, so the +deterministic reader, batch SQL parser, stable artifact writer, and diff-based +WorkUnit planning are checked independently from curation. ## Manual Commands @@ -64,7 +69,7 @@ node packages/cli/dist/bin.js --project-dir /tmp/ktx-postgres-historic setup \ --database-url env:WAREHOUSE_DATABASE_URL \ --database-schema public \ --enable-historic-sql \ - --historic-sql-min-calls 2 \ + --historic-sql-min-executions 2 \ --yes \ --no-input ``` @@ -75,11 +80,16 @@ node packages/cli/dist/bin.js --project-dir /tmp/ktx-postgres-historic setup \ pnpm run ktx -- dev doctor --project-dir /tmp/ktx-postgres-historic --no-input ``` -The installed CLI form is `ktx dev doctor --project-dir -/tmp/ktx-postgres-historic --no-input`. Expected output includes `PASS Postgres -Historic SQL (warehouse)` when `pg_stat_statements` is installed, -`pg_read_all_stats` is granted, tracking is enabled, and -`pg_stat_statements.max` is at least 5000. +The installed CLI form is: + +```bash +ktx dev doctor --project-dir /tmp/ktx-postgres-historic --no-input +``` + +Expected output includes `PASS Postgres Historic SQL (warehouse)` when +`pg_stat_statements` is installed, `pg_read_all_stats` is granted, and tracking +is enabled. A low `pg_stat_statements.max` value is reported as an informational +note, not a warning. Run local historic-SQL ingest: @@ -92,7 +102,7 @@ pnpm run ktx -- dev ingest run --project-dir /tmp/ktx-postgres-historic \ --no-input ``` -The full `dev ingest run` path also runs curation work units, so it requires a +The full `dev ingest run` path also runs curation WorkUnits, so it requires a configured LLM provider. Inspect the latest manifest: @@ -101,9 +111,10 @@ Inspect the latest manifest: find /tmp/ktx-postgres-historic/raw-sources/warehouse/historic-sql -name manifest.json | sort | tail -n 1 ``` -The manifest should have `dialect: "postgres"`, `degraded: true`, -`baselineFirstRun: true` on the first run, and populated `pgServerVersion` and -`statsResetAt`. +The manifest should have `source: "historic-sql"`, `dialect: "postgres"`, +positive `snapshotRowCount`, positive `touchedTableCount`, numeric +`parseFailures`, `warnings`, and `probeWarnings`. The same directory should +contain `patterns-input.json` and one `tables/*.json` file per touched table. ## Troubleshooting @@ -111,8 +122,8 @@ The manifest should have `dialect: "postgres"`, `degraded: true`, `CREATE EXTENSION pg_stat_statements;` both happened in the `analytics` database. - Missing grants: confirm `GRANT pg_read_all_stats TO ktx_reader;`. -- Empty templates: rerun `scripts/generate-workload.sh base` and keep - `--historic-sql-min-calls 2` for the smoke. +- Empty snapshot: rerun `scripts/generate-workload.sh base` and keep + `--historic-sql-min-executions 2` for the smoke. - SQL-analysis failures: run `pnpm run ktx -- runtime doctor` from the KTX repository root and confirm `uv`, the bundled Python wheel, and the managed runtime all pass.