21 KiB
Spider2-DBT × KTX benchmarking — handoff
This document is the state of the Spider2-DBT benchmarking experiment as of 2026-05-18. It is written so that a fresh agent can pick up the work, particularly after adding a DuckDB scan connector to KTX.
1. What we are benchmarking
Spider2-SQL is an ICLR 2025 oral benchmark for "real-world enterprise text-to-SQL workflows". It has three tracks:
- Spider2.0-Snow — 547 examples, Snowflake.
- Spider2.0-Lite — 547 examples, BigQuery / Snowflake / SQLite.
- Spider2.0-DBT — 68 examples, DuckDB-backed dbt projects.
We are participating in the DBT track. Public baselines:
| Method | Spider2-SQL |
|---|---|
| GPT-4o | ~10% |
| o1-preview | ~17% |
| Top published | ~30–40% |
Repo: https://github.com/xlang-ai/Spider2, the DBT track is under
spider2-dbt/.
Task format
Each instance is a self-contained dbt project (dbt_project.yml,
profiles.yml, models/, sometimes seeds/, macros/, dbt_packages/)
plus a .duckdb file pre-loaded with raw source tables. The instruction is
a single underspecified natural-language sentence, e.g.:
"Complete the project of this database to show the metrics of each traffic source, I believe every touchpoint in the conversion path is equally important, please choose the most suitable attribution method."
The agent must edit/add models and run dbt build until the warehouse
contains the required tables. All 68 instances are evaluated with
duckdb_match: the official evaluator diffs specific columns of specific
tables in the agent's DuckDB against a gold DuckDB. A pass is row-set match
on condition_cols for each condition_tab.
spider2-dbt.jsonl (instructions) and evaluation_suite/gold/spider2_eval.jsonl
(evaluator config + gold DuckDBs) are both clone-time artifacts.
2. On-disk layout
Everything benchmark-related lives outside this repo at
/Users/klo-dev/work/spider2-ktx/:
/Users/klo-dev/work/spider2-ktx/
├── .venv/ # Python 3.11 (uv-managed)
├── Spider2/ # cloned `git clone xlang-ai/Spider2`
│ └── spider2-dbt/
│ ├── examples/ # 69 dirs, 68 are in spider2-dbt.jsonl
│ │ ├── playbook001/dbt_project.yml ...
│ │ └── ...
│ ├── examples/spider2-dbt.jsonl # 68 instance instructions
│ ├── evaluation_suite/
│ │ ├── evaluate.py # official scorer
│ │ ├── eval_utils.py # duckdb_match, table_match, ...
│ │ └── gold/ # gold .duckdb per instance
│ └── setup.py # unpacks DBT_start_db.zip + dbt_gold.zip
├── orchestrator.py # main runner (see §4)
├── agent_prompt.md # system prompt written by orchestrator
├── work/ # per-instance workspaces (ktx + dbt)
│ └── <instance_id>/
│ ├── ktx.yaml # generated by orchestrator
│ ├── .ktx/ # ktx state (sqlite, git, cache)
│ ├── wiki/global/*.md # OUTPUT of `ktx ingest dbt_project`
│ ├── semantic-layer/ # empty today (no DuckDB connector)
│ └── dbt/ # copy of Spider2/spider2-dbt/examples/<id>
│ ├── dbt_project.yml
│ ├── profiles.yml
│ ├── models/...
│ └── <name>.duckdb
├── results/ # submission folder
│ ├── results_metadata.jsonl
│ └── <instance_id>/<name>.duckdb
└── logs/<instance_id>/
├── ktx-init.log
├── ktx-ingest.log
├── claude.log # stderr from the sub-agent
└── claude-stream.jsonl # full structured trace
The two source-data zips (~1 GB) were pulled with gdown from the Drive
IDs in Spider2/spider2-dbt/setup.py and then setup.py was run to unpack
them in place. No need to re-do that step.
3. Current ktx.yaml per instance
Generated by orchestrator.write_ktx_yaml(). Same template for every
workspace, with the source_dir absolute path swapped in:
connections:
dbt_project:
driver: dbt
source_dir: /Users/klo-dev/work/spider2-ktx/work/<id>/dbt
storage:
state: sqlite
search: sqlite-fts5
git:
auto_commit: false
author: ktx <ktx@example.com>
llm:
provider:
backend: claude-code # uses local Claude Code OAuth — no API key
models:
default: sonnet
triage: haiku
candidateExtraction: sonnet
curator: sonnet
reconcile: sonnet
repair: sonnet
ingest:
adapters: [dbt]
embeddings:
backend: deterministic
model: deterministic
dimensions: 8
workUnits:
stepBudget: 40
maxConcurrency: 1
failureMode: continue
agent:
run_research:
enabled: false
max_iterations: 20
default_toolset: [sl_query, wiki_search, sl_read_source]
memory:
auto_commit: false
scan:
enrichment: { mode: none }
relationships:
enabled: false # disabled — no warehouse to relate against
llmProposals: false
Notes / gotchas learned the hard way:
source_dirmust be absolute and must not be the same as--project-dir(the dbt adapter copies the dir into.ktx/cache/local-ingest/and refuses to recursively copy a parent into itself). Hence thework/<id>/dbt/sub-structure.llm.provider.backend: none(thedev initdefault) makesktx ingeston the dbt adapter fail with"requires llm.provider.backend: anthropic, vertex, gateway, or claude-code". The dbt adapter is LLM-driven.llm.models.defaultis required wheneverprovider.backend != none.claude-codebackend reuses the local Claude Code OAuth session, so noANTHROPIC_API_KEYenv var is needed.--yesand--no-inputare mutually exclusive onktx ingest.
4. Orchestrator
/Users/klo-dev/work/spider2-ktx/orchestrator.py — the one moving part.
Per-instance flow:
make_workspace(id)— copySpider2/spider2-dbt/examples/<id>/intowork/<id>/dbt/.ktx dev init <work/<id>>and write the ktx.yaml above.ktx ingest dbt_project --plain --yes— runs the LLM-driven dbt adapter; output lands inwork/<id>/wiki/global/*.md.- Spawn
claude --print --permission-mode bypassPermissions ...with- cwd =
work/<id>/dbt(the agent works inside the dbt project) --add-dir work/<id>(so the agent can read the wiki)--allowedTools Bash,Edit,Read,Write,Glob,Grep,WebFetch,TodoWrite--system-promptfromSYSTEM_PROMPT(seeagent_prompt.md)- the prompt is the Spider2 instruction.
- cwd =
- Stream the agent's JSONL events to
logs/<id>/claude-stream.jsonl, capture the finalresultmessage as a summary string. collect_result()— copy the largest*.duckdbinwork/<id>/dbt/intoresults/<id>/<name>.duckdband add an entry{instance_id, answer_type: "file", answer_or_path: "<name>.duckdb"}toresults/results_metadata.jsonl. Metadata is re-written after every instance, so partial runs are recoverable.
CLI:
cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
# One specific instance
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
# All 68
python orchestrator.py --budget 3 --timeout 1500 --evaluate
# Skip ingest (when workspace already has wiki) — speeds re-runs
python orchestrator.py -n provider001 --skip-ingest
Flags:
| Flag | Default | Meaning |
|---|---|---|
-n, --instance |
none | Repeatable; restrict to listed instance ids |
-l, --limit |
none | First N from spider2-dbt.jsonl |
--model |
sonnet | Claude Code model alias |
--budget |
4.0 | --max-budget-usd per instance |
--timeout |
1800 | Wall-clock seconds per instance |
--force |
off | Wipe and recreate workspace |
--skip-ingest |
off | Reuse existing wiki |
--evaluate |
off | Run evaluate.py at the end |
Scoring:
cd /Users/klo-dev/work/spider2-ktx/Spider2/spider2-dbt/evaluation_suite
python evaluate.py \
--result_dir /Users/klo-dev/work/spider2-ktx/results \
--gold_dir ./gold
The official evaluator prints score = passes / total, and one line per
passing instance id.
5. What ktx currently provides to the agent
ktx ingest dbt_project --plain --yes on a Spider2 example emits only
wiki pages under work/<id>/wiki/global/*.md. There are no
semantic-layer entities — ktx sl list returns items: [].
Example, for playbook001:
work/playbook001/wiki/global/
├── acme-dbt-project.md # project overview: profile, sources, models
└── cpa-roas-definitions.md # exact CPA & ROAS formulas, grain, columns
For asset001:
work/asset001/wiki/global/
├── dbt-asset-project-overview.md
├── bar-quotes.md
└── book-value.md
The wiki pages do carry high-signal information for these tasks — they
pre-digest the dbt project into prose with formulas, grain, columns, and
unverified-vs-verified annotations. That's what made playbook001 score
1.0: the wiki said CPA = total_spend / attribution_points, ROAS = attribution_revenue / total_spend, grain (date_month, utm_source), and
the agent transcribed that into cpa_and_roas.sql directly.
The wiki itself flags the missing piece:
"Run
ktx scanon the DuckDB connection to populate the warehouse schema and enable SL source creation for these tables."
Which brings us to:
6. The DuckDB connector gap
ktx ships connectors for: postgres / postgresql / mysql / snowflake / bigquery / sqlite / sqlserver / clickhouse. There is no DuckDB scan
connector. References:
packages/cli/src/connection.test.ts:494—driver: duckdbis asserted to be unknown bycreateKtxCliScanConnector.packages/context/src/sl/local-query.ts:59—DUCKDB: 'duckdb'is a SQL dialect constant for query generation, not a connector.packages/context/src/mcp/local-project-ports.ts:32— same: dialect hint, not a connector.
Consequence: with the current setup we can't add a warehouse connection
that introspects each example's .duckdb. The dbt adapter falls back to
wiki-only output, which is why semantic-layer/ stays empty.
The plan you're about to act on
Add packages/connector-duckdb/ modeled on packages/connector-sqlite/:
| File | Source to copy from | Adapt |
|---|---|---|
package.json |
connector-sqlite/package.json |
dep better-sqlite3 → duckdb (or @duckdb/node-api) |
src/dialect.ts |
connector-sqlite/src/dialect.ts |
Quote with "; map types: BIGINT → number, VARCHAR → string, TIMESTAMP → time, etc. |
src/connector.ts |
connector-sqlite/src/connector.ts |
Replace Database with the DuckDB equivalent. Use information_schema instead of sqlite_master/PRAGMA table_info. For FKs DuckDB also has information_schema.referential_constraints + key_column_usage. Estimated row counts → SELECT estimated_size FROM duckdb_tables(). |
src/index.ts |
connector-sqlite/src/index.ts |
Re-export, plus isKtxDuckDbConnectionConfig |
src/connector.test.ts + dialect.test.ts |
sqlite equivalents | Mirror tests; the sqlite ones are a good template for what to cover |
Then wire it up:
packages/cli/src/local-scan-connectors.ts— add a branch fordriver === 'duckdb', mirroring the sqlite branch.packages/context/src/project/driver-schemas.ts— extendKTX_WAREHOUSE_DRIVERSwithduckdb. Connection config takes the same shape as sqlite (pathorurl).- Add to
pnpm-workspace.yamlif it isn't auto-discovered. pnpm install && pnpm --filter @ktx/connector-duckdb run build && pnpm --filter @ktx/cli run build.
Smoke test on playbook001:
cd /Users/klo-dev/work/spider2-ktx/work/playbook001
# edit ktx.yaml — add a duckdb connection pointing at the warehouse:
# connections:
# warehouse:
# driver: duckdb
# path: /Users/klo-dev/work/spider2-ktx/work/playbook001/dbt/playbook.duckdb
node /Users/klo-dev/conductor/workspaces/ktx/santiago/packages/cli/dist/bin.js \
connection test warehouse
node ... scan warehouse # populates raw-sources/
node ... ingest dbt_project --plain --yes # should now write semantic-layer/*.yaml
node ... sl list --json
After that, update orchestrator.write_ktx_yaml() to also emit a
warehouse connection per instance, pointing at
work/<id>/dbt/<name>.duckdb. The <name> differs per instance (e.g.
playbook.duckdb, asset.duckdb); the orchestrator already has
discover_duckdb_name() for that.
7. Results so far (5-instance pilot)
Final score: 1 / 5 = 20% on the official evaluate.py.
| Instance | Agent finished | Time (s) | Cost (USD) | Turns | Tool calls | Eval |
|---|---|---|---|---|---|---|
| playbook001 | OK | 82 | $0.28 | 30 | Bash 12, Read 5, Write 1, Edit 1 | ✅ 1.0 |
| provider001 | OK | 289 | $0.57 | 44 | Bash 16, Read 7, Edit 1, Write 2 | ❌ 0 |
| asana001 | OK | 181 | $0.54 | 44 | Bash 24, Read 1, Write 2, Edit 1 | ❌ 0 |
| shopify001 | OK | 133 | $0.50 | 41 | Bash 13, Read 15, Write 2 | ❌ 0 |
| asset001 | OK | 189 | $0.42 | 44 | Bash 14, Read 14, Write 2 | ❌ 0 |
Total spend on the pilot: ≈ $2.30. Mean: ~175 s, ~$0.46, ~40 turns.
All five agent runs finished cleanly — dbt build green, every target
table materialised in the DuckDB. The four failures are value-level
mismatches: column orderings, tie-breaks, NULL handling, or
precision/rounding diverging from gold. That's exactly the failure mode
that richer ktx context (real column dtypes, sample values, primary keys,
SL measures) should address.
For reference, GPT-4o reported ~10% and o1-preview ~17%, so a 20% on n=5 is roughly in band but the sample is far too small to claim a delta.
Why playbook001 passed
The wiki page cpa-roas-definitions.md pre-derived:
CPA = total_spend / attribution_points (column: cost_per_acquisition)
ROAS = attribution_revenue / total_spend (column: return_on_advertising_spend)
Grain: (date_month, utm_source)
The agent read this page (via KTX_PROJECT_DIR=.. ktx wiki list --json
then plain Read on ../wiki/global/cpa-roas-definitions.md), wrote the
missing models/cpa_and_roas.sql directly from it, and dbt build
produced the correct table.
Why the others failed (best guesses, not investigated deeply)
provider001: gold checksprovidertable columns[0,1,2,5,6,7,9,10,11,12,13]andspecialty_mappingcolumns[0,1]. All 7 tables are produced with the right schema; the tie-break logic for "most specific specialty" diverges from gold.asana001: 95 models materialised, 55 tests passed; the gold comparesasana__team [0..9]andasana__user [0,1,2]and our values differ on one or more aggregations (open vs completed task counts, avg close time).shopify001andasset001: similar pattern — structure right, values off.
8. Hypotheses for the next agent
In rough order of expected impact:
-
DuckDB connector (above) so
ktx scanandktx ingesttogether emitsemantic-layer/<conn>/<source>.yamlwith real columns, types, primary keys, sample values, and (if enabled) relationship proposals. Expose those to the sub-agent via either:ktx sl read <source>calls from Bash, or- the
ktx mcp stdioserver attached viaclaude --mcp-config.
-
Verification step in the system prompt — currently the agent declares success on
dbt buildgreen. Add: "Before declaring success, for every target table runSELECT * FROM <t> ORDER BY 1 LIMIT 5and sanity-check column count, types, no NaN/NULL in not-null cols, row count > 0; also compare the produced column names with the column list in the schema.yml / wiki / sl source." Cheap fix; should turn some of the value-mismatch fails into passes (or into productive iteration). -
dbt run-stage tests — Spider2 examples often ship
tests/; have the agent rundbt testafterdbt buildand treat any new test failures as a signal to revise. Some examples actually have gold- verifying tests in the project itself. -
Try Opus for hard cases — the orchestrator passes
--model sonnet; flipping toopuson the retries of failed instances may recover some of the value-mismatch tasks. Cost goes up ~5×. -
ktx scanquery-history off — currentlyquery-historyisskippedbecause the dbt adapter doesn't expose history. Once a warehouse connection exists, leave it skipped (DuckDB has no useful history for these one-shot DBs). -
Parallelism —
claudeis rate-limit-sensitive but the orchestrator is fully sequential. Two or three workers viaconcurrent.futureswould cut wall-clock to ~1h for the full 68.
9. Resuming after the DuckDB connector lands
Concrete steps for the next agent:
-
Confirm the connector is wired up:
cd /Users/klo-dev/conductor/workspaces/ktx/santiago pnpm run build pnpm run ktx -- dev schema | jq '.properties.connections.additionalProperties.oneOf[].properties.driver.const' | sort -u # should include "duckdb" -
Update the orchestrator's ktx.yaml template in
/Users/klo-dev/work/spider2-ktx/orchestrator.py(write_ktx_yaml). Pseudocode:db_name = discover_duckdb_name(ws / "dbt") # e.g. "playbook.duckdb" ... connections: dbt_project: driver: dbt source_dir: {ws}/dbt warehouse: driver: duckdb path: {ws}/dbt/{db_name}Also re-enable scan relationship discovery if it gives useful output:
scan: enrichment: { mode: deterministic } relationships: enabled: true llmProposals: true -
Verify on a known-passing instance first (
playbook001) to make sure the dbt+warehouse combo still emits the same wiki pages it did before, plus new SL YAML, and the score stays at 1.0:cd /Users/klo-dev/work/spider2-ktx source .venv/bin/activate python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500 cd Spider2/spider2-dbt/evaluation_suite python evaluate.py --result_dir ../../../results --gold_dir ./gold -
Optionally improve the system prompt in
orchestrator.SYSTEM_PROMPTto instruct the agent to use SL tools:ktx sl list --jsonktx sl read <source>ktx sl query --connection-id warehouse --measure ...
-
Re-run a small batch with diverse failures (
provider001,asana001,shopify001,asset001) to see whether SL access lifts those scores from 0 → 1. If it moves the needle, run the full 68:python orchestrator.py --budget 3 --timeout 1500 --evaluateSequential 68 × ~3 min ≈ 3.5 h, ~$25 at current rates.
-
Write the result back — append a section to this doc with the new score and a one-line note per failing instance, so we accumulate evidence over iterations rather than losing it.
10. Misc references
- KTX MCP tools (see
packages/context/src/mcp/context-tools.ts):connection_list,wiki_search,wiki_read,sl_read_source,sl_query,entity_details,dictionary_search,discover_data,sql_execution,memory_ingest,memory_ingest_status.sql_executionwill work for DuckDB once the connector exists; today it has no transport for it. - The sqlite connector at
packages/connector-sqlite/src/connector.tsis the closest template for DuckDB. packages/context/src/ingest/adapters/dbt/is the dbt adapter that generates the wiki pages —parse.tsreadsdbt_project.yml,schema.yml, models;chunk.tsbreaks them into work units;dbt.adapter.tsorchestrates.- Evaluator code is at
Spider2/spider2-dbt/evaluation_suite/{evaluate.py, eval_utils.py}.duckdb_matchis the only function that matters here. - Spider2 paper: https://arxiv.org/abs/2411.07763
11. Quick sanity checks for a fresh agent
# Toolchain
which node pnpm uv claude
source /Users/klo-dev/work/spider2-ktx/.venv/bin/activate && python -c "import dbt, duckdb, anthropic"
# KTX CLI build still works
cd /Users/klo-dev/conductor/workspaces/ktx/santiago
pnpm run ktx -- --help
# Orchestrator runnable
cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
python orchestrator.py -h
# A previous result still scores 1.0
cd Spider2/spider2-dbt/evaluation_suite
python evaluate.py --result_dir ../../../results --gold_dir ./gold
# expects: 0.2 1 5 (current state)
If any of those fail before you do anything else, the environment has drifted — fix that before adding the connector.