docs: add spider2 dbt benchmark handoff

2026-06-10 08:05:14 +02:00 · 2026-05-18 14:12:59 +02:00 · 2026-05-18 14:12:59 +02:00 · e07262b2b9
commit e07262b2b9
parent cd319861c1
1 changed files with 544 additions and 0 deletions
--- a/docs/spider2-dbt-benchmark.md
+++ b/docs/spider2-dbt-benchmark.md
@ -0,0 +1,544 @@
+# Spider2-DBT × KTX benchmarking — handoff
+
+This document is the state of the Spider2-DBT benchmarking experiment as of
+2026-05-18. It is written so that a fresh agent can pick up the work,
+particularly after adding a DuckDB scan connector to KTX.
+
+---
+
+## 1. What we are benchmarking
+
+[Spider2-SQL](https://spider2-sql.github.io/) is an ICLR 2025 oral benchmark
+for "real-world enterprise text-to-SQL workflows". It has three tracks:
+
+- **Spider2.0-Snow** — 547 examples, Snowflake.
+- **Spider2.0-Lite** — 547 examples, BigQuery / Snowflake / SQLite.
+- **Spider2.0-DBT** — 68 examples, **DuckDB-backed dbt projects**.
+
+We are participating in the **DBT track**. Public baselines:
+
+| Method          | Spider2-SQL |
+|-----------------|-------------|
+| GPT-4o          | ~10%        |
+| o1-preview      | ~17%        |
+| Top published   | ~30–40%     |
+
+Repo: `https://github.com/xlang-ai/Spider2`, the DBT track is under
+`spider2-dbt/`.
+
+### Task format
+
+Each instance is a self-contained dbt project (`dbt_project.yml`,
+`profiles.yml`, `models/`, sometimes `seeds/`, `macros/`, `dbt_packages/`)
+plus a `.duckdb` file pre-loaded with raw source tables. The instruction is
+a single underspecified natural-language sentence, e.g.:
+
+> "Complete the project of this database to show the metrics of each traffic
+> source, I believe every touchpoint in the conversion path is equally
+> important, please choose the most suitable attribution method."
+
+The agent must edit/add models and run `dbt build` until the warehouse
+contains the required tables. **All 68 instances are evaluated with
+`duckdb_match`**: the official evaluator diffs specific columns of specific
+tables in the agent's DuckDB against a gold DuckDB. A pass is row-set match
+on `condition_cols` for each `condition_tab`.
+
+`spider2-dbt.jsonl` (instructions) and `evaluation_suite/gold/spider2_eval.jsonl`
+(evaluator config + gold DuckDBs) are both clone-time artifacts.
+
+---
+
+## 2. On-disk layout
+
+Everything benchmark-related lives outside this repo at
+`/Users/klo-dev/work/spider2-ktx/`:
+
+```
+/Users/klo-dev/work/spider2-ktx/
+├── .venv/                              # Python 3.11 (uv-managed)
+├── Spider2/                            # cloned `git clone xlang-ai/Spider2`
+│   └── spider2-dbt/
+│       ├── examples/                   # 69 dirs, 68 are in spider2-dbt.jsonl
+│       │   ├── playbook001/dbt_project.yml ...
+│       │   └── ...
+│       ├── examples/spider2-dbt.jsonl  # 68 instance instructions
+│       ├── evaluation_suite/
+│       │   ├── evaluate.py             # official scorer
+│       │   ├── eval_utils.py           # duckdb_match, table_match, ...
+│       │   └── gold/                   # gold .duckdb per instance
+│       └── setup.py                    # unpacks DBT_start_db.zip + dbt_gold.zip
+├── orchestrator.py                     # main runner (see §4)
+├── agent_prompt.md                     # system prompt written by orchestrator
+├── work/                               # per-instance workspaces (ktx + dbt)
+│   └── <instance_id>/
+│       ├── ktx.yaml                    # generated by orchestrator
+│       ├── .ktx/                       # ktx state (sqlite, git, cache)
+│       ├── wiki/global/*.md            # OUTPUT of `ktx ingest dbt_project`
+│       ├── semantic-layer/             # empty today (no DuckDB connector)
+│       └── dbt/                        # copy of Spider2/spider2-dbt/examples/<id>
+│           ├── dbt_project.yml
+│           ├── profiles.yml
+│           ├── models/...
+│           └── <name>.duckdb
+├── results/                            # submission folder
+│   ├── results_metadata.jsonl
+│   └── <instance_id>/<name>.duckdb
+└── logs/<instance_id>/
+    ├── ktx-init.log
+    ├── ktx-ingest.log
+    ├── claude.log                      # stderr from the sub-agent
+    └── claude-stream.jsonl             # full structured trace
+```
+
+The two source-data zips (~1 GB) were pulled with `gdown` from the Drive
+IDs in `Spider2/spider2-dbt/setup.py` and then `setup.py` was run to unpack
+them in place. No need to re-do that step.
+
+---
+
+## 3. Current ktx.yaml per instance
+
+Generated by `orchestrator.write_ktx_yaml()`. Same template for every
+workspace, with the source_dir absolute path swapped in:
+
+```yaml
+connections:
+  dbt_project:
+    driver: dbt
+    source_dir: /Users/klo-dev/work/spider2-ktx/work/<id>/dbt
+storage:
+  state: sqlite
+  search: sqlite-fts5
+  git:
+    auto_commit: false
+    author: ktx <ktx@example.com>
+llm:
+  provider:
+    backend: claude-code      # uses local Claude Code OAuth — no API key
+  models:
+    default: sonnet
+    triage: haiku
+    candidateExtraction: sonnet
+    curator: sonnet
+    reconcile: sonnet
+    repair: sonnet
+ingest:
+  adapters: [dbt]
+  embeddings:
+    backend: deterministic
+    model: deterministic
+    dimensions: 8
+  workUnits:
+    stepBudget: 40
+    maxConcurrency: 1
+    failureMode: continue
+agent:
+  run_research:
+    enabled: false
+    max_iterations: 20
+    default_toolset: [sl_query, wiki_search, sl_read_source]
+memory:
+  auto_commit: false
+scan:
+  enrichment: { mode: none }
+  relationships:
+    enabled: false        # disabled — no warehouse to relate against
+    llmProposals: false
+```
+
+Notes / gotchas learned the hard way:
+
+- `source_dir` **must be absolute** and **must not be the same as
+  `--project-dir`** (the dbt adapter copies the dir into
+  `.ktx/cache/local-ingest/` and refuses to recursively copy a parent into
+  itself). Hence the `work/<id>/dbt/` sub-structure.
+- `llm.provider.backend: none` (the `dev init` default) makes `ktx ingest`
+  on the dbt adapter fail with `"requires llm.provider.backend: anthropic,
+  vertex, gateway, or claude-code"`. The dbt adapter is LLM-driven.
+- `llm.models.default` is **required** whenever `provider.backend != none`.
+- `claude-code` backend reuses the local Claude Code OAuth session, so no
+  `ANTHROPIC_API_KEY` env var is needed.
+- `--yes` and `--no-input` are mutually exclusive on `ktx ingest`.
+
+---
+
+## 4. Orchestrator
+
+`/Users/klo-dev/work/spider2-ktx/orchestrator.py` — the one moving part.
+
+Per-instance flow:
+
+1. `make_workspace(id)` — copy `Spider2/spider2-dbt/examples/<id>/` into
+   `work/<id>/dbt/`.
+2. `ktx dev init <work/<id>>` and write the ktx.yaml above.
+3. `ktx ingest dbt_project --plain --yes` — runs the LLM-driven dbt
+   adapter; output lands in `work/<id>/wiki/global/*.md`.
+4. Spawn `claude --print --permission-mode bypassPermissions ...` with
+   - cwd = `work/<id>/dbt` (the agent works inside the dbt project)
+   - `--add-dir work/<id>` (so the agent can read the wiki)
+   - `--allowedTools Bash,Edit,Read,Write,Glob,Grep,WebFetch,TodoWrite`
+   - `--system-prompt` from `SYSTEM_PROMPT` (see `agent_prompt.md`)
+   - the prompt is the Spider2 instruction.
+5. Stream the agent's JSONL events to `logs/<id>/claude-stream.jsonl`,
+   capture the final `result` message as a summary string.
+6. `collect_result()` — copy the largest `*.duckdb` in `work/<id>/dbt/`
+   into `results/<id>/<name>.duckdb` and add an entry
+   `{instance_id, answer_type: "file", answer_or_path: "<name>.duckdb"}`
+   to `results/results_metadata.jsonl`. Metadata is re-written after every
+   instance, so partial runs are recoverable.
+
+CLI:
+
+```bash
+cd /Users/klo-dev/work/spider2-ktx
+source .venv/bin/activate
+
+# One specific instance
+python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
+
+# All 68
+python orchestrator.py --budget 3 --timeout 1500 --evaluate
+
+# Skip ingest (when workspace already has wiki) — speeds re-runs
+python orchestrator.py -n provider001 --skip-ingest
+```
+
+Flags:
+
+| Flag | Default | Meaning |
+|------|---------|---------|
+| `-n, --instance` | none | Repeatable; restrict to listed instance ids |
+| `-l, --limit`    | none | First N from spider2-dbt.jsonl |
+| `--model`        | sonnet | Claude Code model alias |
+| `--budget`       | 4.0  | `--max-budget-usd` per instance |
+| `--timeout`      | 1800 | Wall-clock seconds per instance |
+| `--force`        | off  | Wipe and recreate workspace |
+| `--skip-ingest`  | off  | Reuse existing wiki |
+| `--evaluate`     | off  | Run `evaluate.py` at the end |
+
+Scoring:
+
+```bash
+cd /Users/klo-dev/work/spider2-ktx/Spider2/spider2-dbt/evaluation_suite
+python evaluate.py \
+    --result_dir /Users/klo-dev/work/spider2-ktx/results \
+    --gold_dir ./gold
+```
+
+The official evaluator prints `score = passes / total`, and one line per
+passing instance id.
+
+---
+
+## 5. What ktx currently provides to the agent
+
+`ktx ingest dbt_project --plain --yes` on a Spider2 example emits only
+**wiki pages** under `work/<id>/wiki/global/*.md`. There are **no
+semantic-layer entities** — `ktx sl list` returns `items: []`.
+
+Example, for `playbook001`:
+
+```
+work/playbook001/wiki/global/
+├── acme-dbt-project.md         # project overview: profile, sources, models
+└── cpa-roas-definitions.md     # exact CPA & ROAS formulas, grain, columns
+```
+
+For `asset001`:
+
+```
+work/asset001/wiki/global/
+├── dbt-asset-project-overview.md
+├── bar-quotes.md
+└── book-value.md
+```
+
+The wiki pages **do** carry high-signal information for these tasks — they
+pre-digest the dbt project into prose with formulas, grain, columns, and
+unverified-vs-verified annotations. That's what made `playbook001` score
+1.0: the wiki said `CPA = total_spend / attribution_points`, `ROAS =
+attribution_revenue / total_spend`, grain `(date_month, utm_source)`, and
+the agent transcribed that into `cpa_and_roas.sql` directly.
+
+The wiki itself flags the missing piece:
+
+> "Run `ktx scan` on the DuckDB connection to populate the warehouse schema
+> and enable SL source creation for these tables."
+
+Which brings us to:
+
+---
+
+## 6. The DuckDB connector gap
+
+`ktx` ships connectors for: `postgres / postgresql / mysql / snowflake /
+bigquery / sqlite / sqlserver / clickhouse`. **There is no DuckDB scan
+connector**. References:
+
+- `packages/cli/src/connection.test.ts:494` — `driver: duckdb` is asserted
+  to be **unknown** by `createKtxCliScanConnector`.
+- `packages/context/src/sl/local-query.ts:59` — `DUCKDB: 'duckdb'` is a SQL
+  *dialect* constant for query generation, not a connector.
+- `packages/context/src/mcp/local-project-ports.ts:32` — same: dialect
+  hint, not a connector.
+
+Consequence: with the current setup we can't add a warehouse connection
+that introspects each example's `.duckdb`. The dbt adapter falls back to
+wiki-only output, which is why `semantic-layer/` stays empty.
+
+### The plan you're about to act on
+
+Add `packages/connector-duckdb/` modeled on `packages/connector-sqlite/`:
+
+| File | Source to copy from | Adapt |
+|------|---------------------|-------|
+| `package.json` | `connector-sqlite/package.json` | dep `better-sqlite3` → `duckdb` (or `@duckdb/node-api`) |
+| `src/dialect.ts` | `connector-sqlite/src/dialect.ts` | Quote with `"`; map types: `BIGINT → number`, `VARCHAR → string`, `TIMESTAMP → time`, etc. |
+| `src/connector.ts` | `connector-sqlite/src/connector.ts` | Replace `Database` with the DuckDB equivalent. Use `information_schema` instead of `sqlite_master`/`PRAGMA table_info`. For FKs DuckDB also has `information_schema.referential_constraints` + `key_column_usage`. Estimated row counts → `SELECT estimated_size FROM duckdb_tables()`. |
+| `src/index.ts` | `connector-sqlite/src/index.ts` | Re-export, plus `isKtxDuckDbConnectionConfig` |
+| `src/connector.test.ts` + `dialect.test.ts` | sqlite equivalents | Mirror tests; the sqlite ones are a good template for what to cover |
+
+Then wire it up:
+
+1. `packages/cli/src/local-scan-connectors.ts` — add a branch for
+   `driver === 'duckdb'`, mirroring the sqlite branch.
+2. `packages/context/src/project/driver-schemas.ts` — extend
+   `KTX_WAREHOUSE_DRIVERS` with `duckdb`. Connection config takes the same
+   shape as sqlite (`path` or `url`).
+3. Add to `pnpm-workspace.yaml` if it isn't auto-discovered.
+4. `pnpm install && pnpm --filter @ktx/connector-duckdb run build && pnpm
+   --filter @ktx/cli run build`.
+
+Smoke test on `playbook001`:
+
+```bash
+cd /Users/klo-dev/work/spider2-ktx/work/playbook001
+# edit ktx.yaml — add a duckdb connection pointing at the warehouse:
+#   connections:
+#     warehouse:
+#       driver: duckdb
+#       path: /Users/klo-dev/work/spider2-ktx/work/playbook001/dbt/playbook.duckdb
+node /Users/klo-dev/conductor/workspaces/ktx/santiago/packages/cli/dist/bin.js \
+    connection test warehouse
+node ... scan warehouse           # populates raw-sources/
+node ... ingest dbt_project --plain --yes   # should now write semantic-layer/*.yaml
+node ... sl list --json
+```
+
+After that, update `orchestrator.write_ktx_yaml()` to also emit a
+`warehouse` connection per instance, pointing at
+`work/<id>/dbt/<name>.duckdb`. The `<name>` differs per instance (e.g.
+`playbook.duckdb`, `asset.duckdb`); the orchestrator already has
+`discover_duckdb_name()` for that.
+
+---
+
+## 7. Results so far (5-instance pilot)
+
+Final score: **1 / 5 = 20%** on the official `evaluate.py`.
+
+| Instance      | Agent finished | Time (s) | Cost (USD) | Turns | Tool calls                                  | Eval |
+|---------------|----------------|----------|------------|-------|---------------------------------------------|------|
+| playbook001   | OK             | 82       | $0.28      | 30    | Bash 12, Read 5, Write 1, Edit 1            | ✅ 1.0 |
+| provider001   | OK             | 289      | $0.57      | 44    | Bash 16, Read 7, Edit 1, Write 2            | ❌ 0 |
+| asana001      | OK             | 181      | $0.54      | 44    | Bash 24, Read 1, Write 2, Edit 1            | ❌ 0 |
+| shopify001    | OK             | 133      | $0.50      | 41    | Bash 13, Read 15, Write 2                    | ❌ 0 |
+| asset001      | OK             | 189      | $0.42      | 44    | Bash 14, Read 14, Write 2                    | ❌ 0 |
+
+Total spend on the pilot: ≈ $2.30. Mean: ~175 s, ~$0.46, ~40 turns.
+
+**All five agent runs finished cleanly** — `dbt build` green, every target
+table materialised in the DuckDB. The four failures are *value-level*
+mismatches: column orderings, tie-breaks, NULL handling, or
+precision/rounding diverging from gold. That's exactly the failure mode
+that richer ktx context (real column dtypes, sample values, primary keys,
+SL measures) should address.
+
+For reference, GPT-4o reported ~10% and o1-preview ~17%, so a 20% on n=5 is
+roughly in band but the sample is far too small to claim a delta.
+
+### Why playbook001 passed
+
+The wiki page `cpa-roas-definitions.md` pre-derived:
+
+```
+CPA  = total_spend / attribution_points          (column: cost_per_acquisition)
+ROAS = attribution_revenue / total_spend         (column: return_on_advertising_spend)
+Grain: (date_month, utm_source)
+```
+
+The agent read this page (via `KTX_PROJECT_DIR=.. ktx wiki list --json`
+then plain `Read` on `../wiki/global/cpa-roas-definitions.md`), wrote the
+missing `models/cpa_and_roas.sql` directly from it, and `dbt build`
+produced the correct table.
+
+### Why the others failed (best guesses, not investigated deeply)
+
+- `provider001`: gold checks `provider` table columns
+  `[0,1,2,5,6,7,9,10,11,12,13]` and `specialty_mapping` columns `[0,1]`.
+  All 7 tables are produced with the right schema; the tie-break logic for
+  "most specific specialty" diverges from gold.
+- `asana001`: 95 models materialised, 55 tests passed; the gold compares
+  `asana__team [0..9]` and `asana__user [0,1,2]` and our values differ on
+  one or more aggregations (open vs completed task counts, avg close
+  time).
+- `shopify001` and `asset001`: similar pattern — structure right, values
+  off.
+
+---
+
+## 8. Hypotheses for the next agent
+
+In rough order of expected impact:
+
+1. **DuckDB connector** (above) so `ktx scan` and `ktx ingest` together
+   emit `semantic-layer/<conn>/<source>.yaml` with real columns, types,
+   primary keys, sample values, and (if enabled) relationship proposals.
+   Expose those to the sub-agent via either:
+   - `ktx sl read <source>` calls from Bash, or
+   - the `ktx mcp stdio` server attached via `claude --mcp-config`.
+
+2. **Verification step in the system prompt** — currently the agent
+   declares success on `dbt build` green. Add: "Before declaring success,
+   for every target table run `SELECT * FROM <t> ORDER BY 1 LIMIT 5` and
+   sanity-check column count, types, no NaN/NULL in not-null cols, row
+   count > 0; also compare the produced column names with the column list
+   in the schema.yml / wiki / sl source." Cheap fix; should turn some of
+   the value-mismatch fails into passes (or into productive iteration).
+
+3. **dbt run-stage tests** — Spider2 examples often ship `tests/`; have
+   the agent run `dbt test` after `dbt build` and treat any new test
+   failures as a signal to revise. Some examples actually have gold-
+   verifying tests in the project itself.
+
+4. **Try Opus for hard cases** — the orchestrator passes `--model
+   sonnet`; flipping to `opus` on the retries of failed instances may
+   recover some of the value-mismatch tasks. Cost goes up ~5×.
+
+5. **`ktx scan` query-history off** — currently `query-history` is
+   `skipped` because the dbt adapter doesn't expose history. Once a
+   warehouse connection exists, leave it skipped (DuckDB has no useful
+   history for these one-shot DBs).
+
+6. **Parallelism** — `claude` is rate-limit-sensitive but the orchestrator
+   is fully sequential. Two or three workers via `concurrent.futures`
+   would cut wall-clock to ~1h for the full 68.
+
+---
+
+## 9. Resuming after the DuckDB connector lands
+
+Concrete steps for the next agent:
+
+1. **Confirm the connector is wired up**:
+   ```bash
+   cd /Users/klo-dev/conductor/workspaces/ktx/santiago
+   pnpm run build
+   pnpm run ktx -- dev schema | jq '.properties.connections.additionalProperties.oneOf[].properties.driver.const' | sort -u
+   # should include "duckdb"
+   ```
+
+2. **Update the orchestrator's ktx.yaml template** in
+   `/Users/klo-dev/work/spider2-ktx/orchestrator.py` (`write_ktx_yaml`).
+   Pseudocode:
+
+   ```python
+   db_name = discover_duckdb_name(ws / "dbt")     # e.g. "playbook.duckdb"
+   ...
+   connections:
+     dbt_project:
+       driver: dbt
+       source_dir: {ws}/dbt
+     warehouse:
+       driver: duckdb
+       path: {ws}/dbt/{db_name}
+   ```
+
+   Also re-enable scan relationship discovery if it gives useful output:
+   ```yaml
+   scan:
+     enrichment: { mode: deterministic }
+     relationships:
+       enabled: true
+       llmProposals: true
+   ```
+
+3. **Verify on a known-passing instance first** (`playbook001`) to make
+   sure the dbt+warehouse combo still emits the same wiki pages it did
+   before, plus new SL YAML, and the score stays at 1.0:
+
+   ```bash
+   cd /Users/klo-dev/work/spider2-ktx
+   source .venv/bin/activate
+   python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
+   cd Spider2/spider2-dbt/evaluation_suite
+   python evaluate.py --result_dir ../../../results --gold_dir ./gold
+   ```
+
+4. **Optionally** improve the system prompt in `orchestrator.SYSTEM_PROMPT`
+   to instruct the agent to use SL tools:
+   - `ktx sl list --json`
+   - `ktx sl read <source>`
+   - `ktx sl query --connection-id warehouse --measure ...`
+
+5. **Re-run a small batch** with diverse failures (`provider001`,
+   `asana001`, `shopify001`, `asset001`) to see whether SL access lifts
+   those scores from 0 → 1. If it moves the needle, run the full 68:
+
+   ```bash
+   python orchestrator.py --budget 3 --timeout 1500 --evaluate
+   ```
+
+   Sequential 68 × ~3 min ≈ 3.5 h, ~$25 at current rates.
+
+6. **Write the result back** — append a section to this doc with the new
+   score and a one-line note per failing instance, so we accumulate
+   evidence over iterations rather than losing it.
+
+---
+
+## 10. Misc references
+
+- KTX MCP tools (see `packages/context/src/mcp/context-tools.ts`):
+  `connection_list`, `wiki_search`, `wiki_read`, `sl_read_source`,
+  `sl_query`, `entity_details`, `dictionary_search`, `discover_data`,
+  `sql_execution`, `memory_ingest`, `memory_ingest_status`.
+  `sql_execution` will work for DuckDB once the connector exists; today
+  it has no transport for it.
+- The sqlite connector at `packages/connector-sqlite/src/connector.ts`
+  is the closest template for DuckDB.
+- `packages/context/src/ingest/adapters/dbt/` is the dbt adapter that
+  generates the wiki pages — `parse.ts` reads `dbt_project.yml`,
+  `schema.yml`, models; `chunk.ts` breaks them into work units;
+  `dbt.adapter.ts` orchestrates.
+- Evaluator code is at
+  `Spider2/spider2-dbt/evaluation_suite/{evaluate.py, eval_utils.py}`.
+  `duckdb_match` is the only function that matters here.
+- Spider2 paper: https://arxiv.org/abs/2411.07763
+
+---
+
+## 11. Quick sanity checks for a fresh agent
+
+```bash
+# Toolchain
+which node pnpm uv claude
+source /Users/klo-dev/work/spider2-ktx/.venv/bin/activate && python -c "import dbt, duckdb, anthropic"
+
+# KTX CLI build still works
+cd /Users/klo-dev/conductor/workspaces/ktx/santiago
+pnpm run ktx -- --help
+
+# Orchestrator runnable
+cd /Users/klo-dev/work/spider2-ktx
+source .venv/bin/activate
+python orchestrator.py -h
+
+# A previous result still scores 1.0
+cd Spider2/spider2-dbt/evaluation_suite
+python evaluate.py --result_dir ../../../results --gold_dir ./gold
+# expects: 0.2 1 5 (current state)
+```
+
+If any of those fail before you do anything else, the environment has
+drifted — fix that before adding the connector.