diff --git a/docs/spider2-dbt-benchmark.md b/docs/spider2-dbt-benchmark.md new file mode 100644 index 00000000..cb619916 --- /dev/null +++ b/docs/spider2-dbt-benchmark.md @@ -0,0 +1,544 @@ +# Spider2-DBT × KTX benchmarking — handoff + +This document is the state of the Spider2-DBT benchmarking experiment as of +2026-05-18. It is written so that a fresh agent can pick up the work, +particularly after adding a DuckDB scan connector to KTX. + +--- + +## 1. What we are benchmarking + +[Spider2-SQL](https://spider2-sql.github.io/) is an ICLR 2025 oral benchmark +for "real-world enterprise text-to-SQL workflows". It has three tracks: + +- **Spider2.0-Snow** — 547 examples, Snowflake. +- **Spider2.0-Lite** — 547 examples, BigQuery / Snowflake / SQLite. +- **Spider2.0-DBT** — 68 examples, **DuckDB-backed dbt projects**. + +We are participating in the **DBT track**. Public baselines: + +| Method | Spider2-SQL | +|-----------------|-------------| +| GPT-4o | ~10% | +| o1-preview | ~17% | +| Top published | ~30–40% | + +Repo: `https://github.com/xlang-ai/Spider2`, the DBT track is under +`spider2-dbt/`. + +### Task format + +Each instance is a self-contained dbt project (`dbt_project.yml`, +`profiles.yml`, `models/`, sometimes `seeds/`, `macros/`, `dbt_packages/`) +plus a `.duckdb` file pre-loaded with raw source tables. The instruction is +a single underspecified natural-language sentence, e.g.: + +> "Complete the project of this database to show the metrics of each traffic +> source, I believe every touchpoint in the conversion path is equally +> important, please choose the most suitable attribution method." + +The agent must edit/add models and run `dbt build` until the warehouse +contains the required tables. **All 68 instances are evaluated with +`duckdb_match`**: the official evaluator diffs specific columns of specific +tables in the agent's DuckDB against a gold DuckDB. A pass is row-set match +on `condition_cols` for each `condition_tab`. + +`spider2-dbt.jsonl` (instructions) and `evaluation_suite/gold/spider2_eval.jsonl` +(evaluator config + gold DuckDBs) are both clone-time artifacts. + +--- + +## 2. On-disk layout + +Everything benchmark-related lives outside this repo at +`/Users/klo-dev/work/spider2-ktx/`: + +``` +/Users/klo-dev/work/spider2-ktx/ +├── .venv/ # Python 3.11 (uv-managed) +├── Spider2/ # cloned `git clone xlang-ai/Spider2` +│ └── spider2-dbt/ +│ ├── examples/ # 69 dirs, 68 are in spider2-dbt.jsonl +│ │ ├── playbook001/dbt_project.yml ... +│ │ └── ... +│ ├── examples/spider2-dbt.jsonl # 68 instance instructions +│ ├── evaluation_suite/ +│ │ ├── evaluate.py # official scorer +│ │ ├── eval_utils.py # duckdb_match, table_match, ... +│ │ └── gold/ # gold .duckdb per instance +│ └── setup.py # unpacks DBT_start_db.zip + dbt_gold.zip +├── orchestrator.py # main runner (see §4) +├── agent_prompt.md # system prompt written by orchestrator +├── work/ # per-instance workspaces (ktx + dbt) +│ └── / +│ ├── ktx.yaml # generated by orchestrator +│ ├── .ktx/ # ktx state (sqlite, git, cache) +│ ├── wiki/global/*.md # OUTPUT of `ktx ingest dbt_project` +│ ├── semantic-layer/ # empty today (no DuckDB connector) +│ └── dbt/ # copy of Spider2/spider2-dbt/examples/ +│ ├── dbt_project.yml +│ ├── profiles.yml +│ ├── models/... +│ └── .duckdb +├── results/ # submission folder +│ ├── results_metadata.jsonl +│ └── /.duckdb +└── logs// + ├── ktx-init.log + ├── ktx-ingest.log + ├── claude.log # stderr from the sub-agent + └── claude-stream.jsonl # full structured trace +``` + +The two source-data zips (~1 GB) were pulled with `gdown` from the Drive +IDs in `Spider2/spider2-dbt/setup.py` and then `setup.py` was run to unpack +them in place. No need to re-do that step. + +--- + +## 3. Current ktx.yaml per instance + +Generated by `orchestrator.write_ktx_yaml()`. Same template for every +workspace, with the source_dir absolute path swapped in: + +```yaml +connections: + dbt_project: + driver: dbt + source_dir: /Users/klo-dev/work/spider2-ktx/work//dbt +storage: + state: sqlite + search: sqlite-fts5 + git: + auto_commit: false + author: ktx +llm: + provider: + backend: claude-code # uses local Claude Code OAuth — no API key + models: + default: sonnet + triage: haiku + candidateExtraction: sonnet + curator: sonnet + reconcile: sonnet + repair: sonnet +ingest: + adapters: [dbt] + embeddings: + backend: deterministic + model: deterministic + dimensions: 8 + workUnits: + stepBudget: 40 + maxConcurrency: 1 + failureMode: continue +agent: + run_research: + enabled: false + max_iterations: 20 + default_toolset: [sl_query, wiki_search, sl_read_source] +memory: + auto_commit: false +scan: + enrichment: { mode: none } + relationships: + enabled: false # disabled — no warehouse to relate against + llmProposals: false +``` + +Notes / gotchas learned the hard way: + +- `source_dir` **must be absolute** and **must not be the same as + `--project-dir`** (the dbt adapter copies the dir into + `.ktx/cache/local-ingest/` and refuses to recursively copy a parent into + itself). Hence the `work//dbt/` sub-structure. +- `llm.provider.backend: none` (the `dev init` default) makes `ktx ingest` + on the dbt adapter fail with `"requires llm.provider.backend: anthropic, + vertex, gateway, or claude-code"`. The dbt adapter is LLM-driven. +- `llm.models.default` is **required** whenever `provider.backend != none`. +- `claude-code` backend reuses the local Claude Code OAuth session, so no + `ANTHROPIC_API_KEY` env var is needed. +- `--yes` and `--no-input` are mutually exclusive on `ktx ingest`. + +--- + +## 4. Orchestrator + +`/Users/klo-dev/work/spider2-ktx/orchestrator.py` — the one moving part. + +Per-instance flow: + +1. `make_workspace(id)` — copy `Spider2/spider2-dbt/examples//` into + `work//dbt/`. +2. `ktx dev init >` and write the ktx.yaml above. +3. `ktx ingest dbt_project --plain --yes` — runs the LLM-driven dbt + adapter; output lands in `work//wiki/global/*.md`. +4. Spawn `claude --print --permission-mode bypassPermissions ...` with + - cwd = `work//dbt` (the agent works inside the dbt project) + - `--add-dir work/` (so the agent can read the wiki) + - `--allowedTools Bash,Edit,Read,Write,Glob,Grep,WebFetch,TodoWrite` + - `--system-prompt` from `SYSTEM_PROMPT` (see `agent_prompt.md`) + - the prompt is the Spider2 instruction. +5. Stream the agent's JSONL events to `logs//claude-stream.jsonl`, + capture the final `result` message as a summary string. +6. `collect_result()` — copy the largest `*.duckdb` in `work//dbt/` + into `results//.duckdb` and add an entry + `{instance_id, answer_type: "file", answer_or_path: ".duckdb"}` + to `results/results_metadata.jsonl`. Metadata is re-written after every + instance, so partial runs are recoverable. + +CLI: + +```bash +cd /Users/klo-dev/work/spider2-ktx +source .venv/bin/activate + +# One specific instance +python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500 + +# All 68 +python orchestrator.py --budget 3 --timeout 1500 --evaluate + +# Skip ingest (when workspace already has wiki) — speeds re-runs +python orchestrator.py -n provider001 --skip-ingest +``` + +Flags: + +| Flag | Default | Meaning | +|------|---------|---------| +| `-n, --instance` | none | Repeatable; restrict to listed instance ids | +| `-l, --limit` | none | First N from spider2-dbt.jsonl | +| `--model` | sonnet | Claude Code model alias | +| `--budget` | 4.0 | `--max-budget-usd` per instance | +| `--timeout` | 1800 | Wall-clock seconds per instance | +| `--force` | off | Wipe and recreate workspace | +| `--skip-ingest` | off | Reuse existing wiki | +| `--evaluate` | off | Run `evaluate.py` at the end | + +Scoring: + +```bash +cd /Users/klo-dev/work/spider2-ktx/Spider2/spider2-dbt/evaluation_suite +python evaluate.py \ + --result_dir /Users/klo-dev/work/spider2-ktx/results \ + --gold_dir ./gold +``` + +The official evaluator prints `score = passes / total`, and one line per +passing instance id. + +--- + +## 5. What ktx currently provides to the agent + +`ktx ingest dbt_project --plain --yes` on a Spider2 example emits only +**wiki pages** under `work//wiki/global/*.md`. There are **no +semantic-layer entities** — `ktx sl list` returns `items: []`. + +Example, for `playbook001`: + +``` +work/playbook001/wiki/global/ +├── acme-dbt-project.md # project overview: profile, sources, models +└── cpa-roas-definitions.md # exact CPA & ROAS formulas, grain, columns +``` + +For `asset001`: + +``` +work/asset001/wiki/global/ +├── dbt-asset-project-overview.md +├── bar-quotes.md +└── book-value.md +``` + +The wiki pages **do** carry high-signal information for these tasks — they +pre-digest the dbt project into prose with formulas, grain, columns, and +unverified-vs-verified annotations. That's what made `playbook001` score +1.0: the wiki said `CPA = total_spend / attribution_points`, `ROAS = +attribution_revenue / total_spend`, grain `(date_month, utm_source)`, and +the agent transcribed that into `cpa_and_roas.sql` directly. + +The wiki itself flags the missing piece: + +> "Run `ktx scan` on the DuckDB connection to populate the warehouse schema +> and enable SL source creation for these tables." + +Which brings us to: + +--- + +## 6. The DuckDB connector gap + +`ktx` ships connectors for: `postgres / postgresql / mysql / snowflake / +bigquery / sqlite / sqlserver / clickhouse`. **There is no DuckDB scan +connector**. References: + +- `packages/cli/src/connection.test.ts:494` — `driver: duckdb` is asserted + to be **unknown** by `createKtxCliScanConnector`. +- `packages/context/src/sl/local-query.ts:59` — `DUCKDB: 'duckdb'` is a SQL + *dialect* constant for query generation, not a connector. +- `packages/context/src/mcp/local-project-ports.ts:32` — same: dialect + hint, not a connector. + +Consequence: with the current setup we can't add a warehouse connection +that introspects each example's `.duckdb`. The dbt adapter falls back to +wiki-only output, which is why `semantic-layer/` stays empty. + +### The plan you're about to act on + +Add `packages/connector-duckdb/` modeled on `packages/connector-sqlite/`: + +| File | Source to copy from | Adapt | +|------|---------------------|-------| +| `package.json` | `connector-sqlite/package.json` | dep `better-sqlite3` → `duckdb` (or `@duckdb/node-api`) | +| `src/dialect.ts` | `connector-sqlite/src/dialect.ts` | Quote with `"`; map types: `BIGINT → number`, `VARCHAR → string`, `TIMESTAMP → time`, etc. | +| `src/connector.ts` | `connector-sqlite/src/connector.ts` | Replace `Database` with the DuckDB equivalent. Use `information_schema` instead of `sqlite_master`/`PRAGMA table_info`. For FKs DuckDB also has `information_schema.referential_constraints` + `key_column_usage`. Estimated row counts → `SELECT estimated_size FROM duckdb_tables()`. | +| `src/index.ts` | `connector-sqlite/src/index.ts` | Re-export, plus `isKtxDuckDbConnectionConfig` | +| `src/connector.test.ts` + `dialect.test.ts` | sqlite equivalents | Mirror tests; the sqlite ones are a good template for what to cover | + +Then wire it up: + +1. `packages/cli/src/local-scan-connectors.ts` — add a branch for + `driver === 'duckdb'`, mirroring the sqlite branch. +2. `packages/context/src/project/driver-schemas.ts` — extend + `KTX_WAREHOUSE_DRIVERS` with `duckdb`. Connection config takes the same + shape as sqlite (`path` or `url`). +3. Add to `pnpm-workspace.yaml` if it isn't auto-discovered. +4. `pnpm install && pnpm --filter @ktx/connector-duckdb run build && pnpm + --filter @ktx/cli run build`. + +Smoke test on `playbook001`: + +```bash +cd /Users/klo-dev/work/spider2-ktx/work/playbook001 +# edit ktx.yaml — add a duckdb connection pointing at the warehouse: +# connections: +# warehouse: +# driver: duckdb +# path: /Users/klo-dev/work/spider2-ktx/work/playbook001/dbt/playbook.duckdb +node /Users/klo-dev/conductor/workspaces/ktx/santiago/packages/cli/dist/bin.js \ + connection test warehouse +node ... scan warehouse # populates raw-sources/ +node ... ingest dbt_project --plain --yes # should now write semantic-layer/*.yaml +node ... sl list --json +``` + +After that, update `orchestrator.write_ktx_yaml()` to also emit a +`warehouse` connection per instance, pointing at +`work//dbt/.duckdb`. The `` differs per instance (e.g. +`playbook.duckdb`, `asset.duckdb`); the orchestrator already has +`discover_duckdb_name()` for that. + +--- + +## 7. Results so far (5-instance pilot) + +Final score: **1 / 5 = 20%** on the official `evaluate.py`. + +| Instance | Agent finished | Time (s) | Cost (USD) | Turns | Tool calls | Eval | +|---------------|----------------|----------|------------|-------|---------------------------------------------|------| +| playbook001 | OK | 82 | $0.28 | 30 | Bash 12, Read 5, Write 1, Edit 1 | ✅ 1.0 | +| provider001 | OK | 289 | $0.57 | 44 | Bash 16, Read 7, Edit 1, Write 2 | ❌ 0 | +| asana001 | OK | 181 | $0.54 | 44 | Bash 24, Read 1, Write 2, Edit 1 | ❌ 0 | +| shopify001 | OK | 133 | $0.50 | 41 | Bash 13, Read 15, Write 2 | ❌ 0 | +| asset001 | OK | 189 | $0.42 | 44 | Bash 14, Read 14, Write 2 | ❌ 0 | + +Total spend on the pilot: ≈ $2.30. Mean: ~175 s, ~$0.46, ~40 turns. + +**All five agent runs finished cleanly** — `dbt build` green, every target +table materialised in the DuckDB. The four failures are *value-level* +mismatches: column orderings, tie-breaks, NULL handling, or +precision/rounding diverging from gold. That's exactly the failure mode +that richer ktx context (real column dtypes, sample values, primary keys, +SL measures) should address. + +For reference, GPT-4o reported ~10% and o1-preview ~17%, so a 20% on n=5 is +roughly in band but the sample is far too small to claim a delta. + +### Why playbook001 passed + +The wiki page `cpa-roas-definitions.md` pre-derived: + +``` +CPA = total_spend / attribution_points (column: cost_per_acquisition) +ROAS = attribution_revenue / total_spend (column: return_on_advertising_spend) +Grain: (date_month, utm_source) +``` + +The agent read this page (via `KTX_PROJECT_DIR=.. ktx wiki list --json` +then plain `Read` on `../wiki/global/cpa-roas-definitions.md`), wrote the +missing `models/cpa_and_roas.sql` directly from it, and `dbt build` +produced the correct table. + +### Why the others failed (best guesses, not investigated deeply) + +- `provider001`: gold checks `provider` table columns + `[0,1,2,5,6,7,9,10,11,12,13]` and `specialty_mapping` columns `[0,1]`. + All 7 tables are produced with the right schema; the tie-break logic for + "most specific specialty" diverges from gold. +- `asana001`: 95 models materialised, 55 tests passed; the gold compares + `asana__team [0..9]` and `asana__user [0,1,2]` and our values differ on + one or more aggregations (open vs completed task counts, avg close + time). +- `shopify001` and `asset001`: similar pattern — structure right, values + off. + +--- + +## 8. Hypotheses for the next agent + +In rough order of expected impact: + +1. **DuckDB connector** (above) so `ktx scan` and `ktx ingest` together + emit `semantic-layer//.yaml` with real columns, types, + primary keys, sample values, and (if enabled) relationship proposals. + Expose those to the sub-agent via either: + - `ktx sl read ` calls from Bash, or + - the `ktx mcp stdio` server attached via `claude --mcp-config`. + +2. **Verification step in the system prompt** — currently the agent + declares success on `dbt build` green. Add: "Before declaring success, + for every target table run `SELECT * FROM ORDER BY 1 LIMIT 5` and + sanity-check column count, types, no NaN/NULL in not-null cols, row + count > 0; also compare the produced column names with the column list + in the schema.yml / wiki / sl source." Cheap fix; should turn some of + the value-mismatch fails into passes (or into productive iteration). + +3. **dbt run-stage tests** — Spider2 examples often ship `tests/`; have + the agent run `dbt test` after `dbt build` and treat any new test + failures as a signal to revise. Some examples actually have gold- + verifying tests in the project itself. + +4. **Try Opus for hard cases** — the orchestrator passes `--model + sonnet`; flipping to `opus` on the retries of failed instances may + recover some of the value-mismatch tasks. Cost goes up ~5×. + +5. **`ktx scan` query-history off** — currently `query-history` is + `skipped` because the dbt adapter doesn't expose history. Once a + warehouse connection exists, leave it skipped (DuckDB has no useful + history for these one-shot DBs). + +6. **Parallelism** — `claude` is rate-limit-sensitive but the orchestrator + is fully sequential. Two or three workers via `concurrent.futures` + would cut wall-clock to ~1h for the full 68. + +--- + +## 9. Resuming after the DuckDB connector lands + +Concrete steps for the next agent: + +1. **Confirm the connector is wired up**: + ```bash + cd /Users/klo-dev/conductor/workspaces/ktx/santiago + pnpm run build + pnpm run ktx -- dev schema | jq '.properties.connections.additionalProperties.oneOf[].properties.driver.const' | sort -u + # should include "duckdb" + ``` + +2. **Update the orchestrator's ktx.yaml template** in + `/Users/klo-dev/work/spider2-ktx/orchestrator.py` (`write_ktx_yaml`). + Pseudocode: + + ```python + db_name = discover_duckdb_name(ws / "dbt") # e.g. "playbook.duckdb" + ... + connections: + dbt_project: + driver: dbt + source_dir: {ws}/dbt + warehouse: + driver: duckdb + path: {ws}/dbt/{db_name} + ``` + + Also re-enable scan relationship discovery if it gives useful output: + ```yaml + scan: + enrichment: { mode: deterministic } + relationships: + enabled: true + llmProposals: true + ``` + +3. **Verify on a known-passing instance first** (`playbook001`) to make + sure the dbt+warehouse combo still emits the same wiki pages it did + before, plus new SL YAML, and the score stays at 1.0: + + ```bash + cd /Users/klo-dev/work/spider2-ktx + source .venv/bin/activate + python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500 + cd Spider2/spider2-dbt/evaluation_suite + python evaluate.py --result_dir ../../../results --gold_dir ./gold + ``` + +4. **Optionally** improve the system prompt in `orchestrator.SYSTEM_PROMPT` + to instruct the agent to use SL tools: + - `ktx sl list --json` + - `ktx sl read ` + - `ktx sl query --connection-id warehouse --measure ...` + +5. **Re-run a small batch** with diverse failures (`provider001`, + `asana001`, `shopify001`, `asset001`) to see whether SL access lifts + those scores from 0 → 1. If it moves the needle, run the full 68: + + ```bash + python orchestrator.py --budget 3 --timeout 1500 --evaluate + ``` + + Sequential 68 × ~3 min ≈ 3.5 h, ~$25 at current rates. + +6. **Write the result back** — append a section to this doc with the new + score and a one-line note per failing instance, so we accumulate + evidence over iterations rather than losing it. + +--- + +## 10. Misc references + +- KTX MCP tools (see `packages/context/src/mcp/context-tools.ts`): + `connection_list`, `wiki_search`, `wiki_read`, `sl_read_source`, + `sl_query`, `entity_details`, `dictionary_search`, `discover_data`, + `sql_execution`, `memory_ingest`, `memory_ingest_status`. + `sql_execution` will work for DuckDB once the connector exists; today + it has no transport for it. +- The sqlite connector at `packages/connector-sqlite/src/connector.ts` + is the closest template for DuckDB. +- `packages/context/src/ingest/adapters/dbt/` is the dbt adapter that + generates the wiki pages — `parse.ts` reads `dbt_project.yml`, + `schema.yml`, models; `chunk.ts` breaks them into work units; + `dbt.adapter.ts` orchestrates. +- Evaluator code is at + `Spider2/spider2-dbt/evaluation_suite/{evaluate.py, eval_utils.py}`. + `duckdb_match` is the only function that matters here. +- Spider2 paper: https://arxiv.org/abs/2411.07763 + +--- + +## 11. Quick sanity checks for a fresh agent + +```bash +# Toolchain +which node pnpm uv claude +source /Users/klo-dev/work/spider2-ktx/.venv/bin/activate && python -c "import dbt, duckdb, anthropic" + +# KTX CLI build still works +cd /Users/klo-dev/conductor/workspaces/ktx/santiago +pnpm run ktx -- --help + +# Orchestrator runnable +cd /Users/klo-dev/work/spider2-ktx +source .venv/bin/activate +python orchestrator.py -h + +# A previous result still scores 1.0 +cd Spider2/spider2-dbt/evaluation_suite +python evaluate.py --result_dir ../../../results --gold_dir ./gold +# expects: 0.2 1 5 (current state) +``` + +If any of those fail before you do anything else, the environment has +drifted — fix that before adding the connector.