ktx/docs/spider2-dbt-benchmark.md
2026-05-18 14:12:59 +02:00

21 KiB
Raw Blame History

Spider2-DBT × KTX benchmarking — handoff

This document is the state of the Spider2-DBT benchmarking experiment as of 2026-05-18. It is written so that a fresh agent can pick up the work, particularly after adding a DuckDB scan connector to KTX.


1. What we are benchmarking

Spider2-SQL is an ICLR 2025 oral benchmark for "real-world enterprise text-to-SQL workflows". It has three tracks:

  • Spider2.0-Snow — 547 examples, Snowflake.
  • Spider2.0-Lite — 547 examples, BigQuery / Snowflake / SQLite.
  • Spider2.0-DBT — 68 examples, DuckDB-backed dbt projects.

We are participating in the DBT track. Public baselines:

Method Spider2-SQL
GPT-4o ~10%
o1-preview ~17%
Top published ~3040%

Repo: https://github.com/xlang-ai/Spider2, the DBT track is under spider2-dbt/.

Task format

Each instance is a self-contained dbt project (dbt_project.yml, profiles.yml, models/, sometimes seeds/, macros/, dbt_packages/) plus a .duckdb file pre-loaded with raw source tables. The instruction is a single underspecified natural-language sentence, e.g.:

"Complete the project of this database to show the metrics of each traffic source, I believe every touchpoint in the conversion path is equally important, please choose the most suitable attribution method."

The agent must edit/add models and run dbt build until the warehouse contains the required tables. All 68 instances are evaluated with duckdb_match: the official evaluator diffs specific columns of specific tables in the agent's DuckDB against a gold DuckDB. A pass is row-set match on condition_cols for each condition_tab.

spider2-dbt.jsonl (instructions) and evaluation_suite/gold/spider2_eval.jsonl (evaluator config + gold DuckDBs) are both clone-time artifacts.


2. On-disk layout

Everything benchmark-related lives outside this repo at /Users/klo-dev/work/spider2-ktx/:

/Users/klo-dev/work/spider2-ktx/
├── .venv/                              # Python 3.11 (uv-managed)
├── Spider2/                            # cloned `git clone xlang-ai/Spider2`
│   └── spider2-dbt/
│       ├── examples/                   # 69 dirs, 68 are in spider2-dbt.jsonl
│       │   ├── playbook001/dbt_project.yml ...
│       │   └── ...
│       ├── examples/spider2-dbt.jsonl  # 68 instance instructions
│       ├── evaluation_suite/
│       │   ├── evaluate.py             # official scorer
│       │   ├── eval_utils.py           # duckdb_match, table_match, ...
│       │   └── gold/                   # gold .duckdb per instance
│       └── setup.py                    # unpacks DBT_start_db.zip + dbt_gold.zip
├── orchestrator.py                     # main runner (see §4)
├── agent_prompt.md                     # system prompt written by orchestrator
├── work/                               # per-instance workspaces (ktx + dbt)
│   └── <instance_id>/
│       ├── ktx.yaml                    # generated by orchestrator
│       ├── .ktx/                       # ktx state (sqlite, git, cache)
│       ├── wiki/global/*.md            # OUTPUT of `ktx ingest dbt_project`
│       ├── semantic-layer/             # empty today (no DuckDB connector)
│       └── dbt/                        # copy of Spider2/spider2-dbt/examples/<id>
│           ├── dbt_project.yml
│           ├── profiles.yml
│           ├── models/...
│           └── <name>.duckdb
├── results/                            # submission folder
│   ├── results_metadata.jsonl
│   └── <instance_id>/<name>.duckdb
└── logs/<instance_id>/
    ├── ktx-init.log
    ├── ktx-ingest.log
    ├── claude.log                      # stderr from the sub-agent
    └── claude-stream.jsonl             # full structured trace

The two source-data zips (~1 GB) were pulled with gdown from the Drive IDs in Spider2/spider2-dbt/setup.py and then setup.py was run to unpack them in place. No need to re-do that step.


3. Current ktx.yaml per instance

Generated by orchestrator.write_ktx_yaml(). Same template for every workspace, with the source_dir absolute path swapped in:

connections:
  dbt_project:
    driver: dbt
    source_dir: /Users/klo-dev/work/spider2-ktx/work/<id>/dbt
storage:
  state: sqlite
  search: sqlite-fts5
  git:
    auto_commit: false
    author: ktx <ktx@example.com>
llm:
  provider:
    backend: claude-code      # uses local Claude Code OAuth — no API key
  models:
    default: sonnet
    triage: haiku
    candidateExtraction: sonnet
    curator: sonnet
    reconcile: sonnet
    repair: sonnet
ingest:
  adapters: [dbt]
  embeddings:
    backend: deterministic
    model: deterministic
    dimensions: 8
  workUnits:
    stepBudget: 40
    maxConcurrency: 1
    failureMode: continue
agent:
  run_research:
    enabled: false
    max_iterations: 20
    default_toolset: [sl_query, wiki_search, sl_read_source]
memory:
  auto_commit: false
scan:
  enrichment: { mode: none }
  relationships:
    enabled: false        # disabled — no warehouse to relate against
    llmProposals: false

Notes / gotchas learned the hard way:

  • source_dir must be absolute and must not be the same as --project-dir (the dbt adapter copies the dir into .ktx/cache/local-ingest/ and refuses to recursively copy a parent into itself). Hence the work/<id>/dbt/ sub-structure.
  • llm.provider.backend: none (the dev init default) makes ktx ingest on the dbt adapter fail with "requires llm.provider.backend: anthropic, vertex, gateway, or claude-code". The dbt adapter is LLM-driven.
  • llm.models.default is required whenever provider.backend != none.
  • claude-code backend reuses the local Claude Code OAuth session, so no ANTHROPIC_API_KEY env var is needed.
  • --yes and --no-input are mutually exclusive on ktx ingest.

4. Orchestrator

/Users/klo-dev/work/spider2-ktx/orchestrator.py — the one moving part.

Per-instance flow:

  1. make_workspace(id) — copy Spider2/spider2-dbt/examples/<id>/ into work/<id>/dbt/.
  2. ktx dev init <work/<id>> and write the ktx.yaml above.
  3. ktx ingest dbt_project --plain --yes — runs the LLM-driven dbt adapter; output lands in work/<id>/wiki/global/*.md.
  4. Spawn claude --print --permission-mode bypassPermissions ... with
    • cwd = work/<id>/dbt (the agent works inside the dbt project)
    • --add-dir work/<id> (so the agent can read the wiki)
    • --allowedTools Bash,Edit,Read,Write,Glob,Grep,WebFetch,TodoWrite
    • --system-prompt from SYSTEM_PROMPT (see agent_prompt.md)
    • the prompt is the Spider2 instruction.
  5. Stream the agent's JSONL events to logs/<id>/claude-stream.jsonl, capture the final result message as a summary string.
  6. collect_result() — copy the largest *.duckdb in work/<id>/dbt/ into results/<id>/<name>.duckdb and add an entry {instance_id, answer_type: "file", answer_or_path: "<name>.duckdb"} to results/results_metadata.jsonl. Metadata is re-written after every instance, so partial runs are recoverable.

CLI:

cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate

# One specific instance
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500

# All 68
python orchestrator.py --budget 3 --timeout 1500 --evaluate

# Skip ingest (when workspace already has wiki) — speeds re-runs
python orchestrator.py -n provider001 --skip-ingest

Flags:

Flag Default Meaning
-n, --instance none Repeatable; restrict to listed instance ids
-l, --limit none First N from spider2-dbt.jsonl
--model sonnet Claude Code model alias
--budget 4.0 --max-budget-usd per instance
--timeout 1800 Wall-clock seconds per instance
--force off Wipe and recreate workspace
--skip-ingest off Reuse existing wiki
--evaluate off Run evaluate.py at the end

Scoring:

cd /Users/klo-dev/work/spider2-ktx/Spider2/spider2-dbt/evaluation_suite
python evaluate.py \
    --result_dir /Users/klo-dev/work/spider2-ktx/results \
    --gold_dir ./gold

The official evaluator prints score = passes / total, and one line per passing instance id.


5. What ktx currently provides to the agent

ktx ingest dbt_project --plain --yes on a Spider2 example emits only wiki pages under work/<id>/wiki/global/*.md. There are no semantic-layer entitiesktx sl list returns items: [].

Example, for playbook001:

work/playbook001/wiki/global/
├── acme-dbt-project.md         # project overview: profile, sources, models
└── cpa-roas-definitions.md     # exact CPA & ROAS formulas, grain, columns

For asset001:

work/asset001/wiki/global/
├── dbt-asset-project-overview.md
├── bar-quotes.md
└── book-value.md

The wiki pages do carry high-signal information for these tasks — they pre-digest the dbt project into prose with formulas, grain, columns, and unverified-vs-verified annotations. That's what made playbook001 score 1.0: the wiki said CPA = total_spend / attribution_points, ROAS = attribution_revenue / total_spend, grain (date_month, utm_source), and the agent transcribed that into cpa_and_roas.sql directly.

The wiki itself flags the missing piece:

"Run ktx scan on the DuckDB connection to populate the warehouse schema and enable SL source creation for these tables."

Which brings us to:


6. The DuckDB connector gap

ktx ships connectors for: postgres / postgresql / mysql / snowflake / bigquery / sqlite / sqlserver / clickhouse. There is no DuckDB scan connector. References:

  • packages/cli/src/connection.test.ts:494driver: duckdb is asserted to be unknown by createKtxCliScanConnector.
  • packages/context/src/sl/local-query.ts:59DUCKDB: 'duckdb' is a SQL dialect constant for query generation, not a connector.
  • packages/context/src/mcp/local-project-ports.ts:32 — same: dialect hint, not a connector.

Consequence: with the current setup we can't add a warehouse connection that introspects each example's .duckdb. The dbt adapter falls back to wiki-only output, which is why semantic-layer/ stays empty.

The plan you're about to act on

Add packages/connector-duckdb/ modeled on packages/connector-sqlite/:

File Source to copy from Adapt
package.json connector-sqlite/package.json dep better-sqlite3duckdb (or @duckdb/node-api)
src/dialect.ts connector-sqlite/src/dialect.ts Quote with "; map types: BIGINT → number, VARCHAR → string, TIMESTAMP → time, etc.
src/connector.ts connector-sqlite/src/connector.ts Replace Database with the DuckDB equivalent. Use information_schema instead of sqlite_master/PRAGMA table_info. For FKs DuckDB also has information_schema.referential_constraints + key_column_usage. Estimated row counts → SELECT estimated_size FROM duckdb_tables().
src/index.ts connector-sqlite/src/index.ts Re-export, plus isKtxDuckDbConnectionConfig
src/connector.test.ts + dialect.test.ts sqlite equivalents Mirror tests; the sqlite ones are a good template for what to cover

Then wire it up:

  1. packages/cli/src/local-scan-connectors.ts — add a branch for driver === 'duckdb', mirroring the sqlite branch.
  2. packages/context/src/project/driver-schemas.ts — extend KTX_WAREHOUSE_DRIVERS with duckdb. Connection config takes the same shape as sqlite (path or url).
  3. Add to pnpm-workspace.yaml if it isn't auto-discovered.
  4. pnpm install && pnpm --filter @ktx/connector-duckdb run build && pnpm --filter @ktx/cli run build.

Smoke test on playbook001:

cd /Users/klo-dev/work/spider2-ktx/work/playbook001
# edit ktx.yaml — add a duckdb connection pointing at the warehouse:
#   connections:
#     warehouse:
#       driver: duckdb
#       path: /Users/klo-dev/work/spider2-ktx/work/playbook001/dbt/playbook.duckdb
node /Users/klo-dev/conductor/workspaces/ktx/santiago/packages/cli/dist/bin.js \
    connection test warehouse
node ... scan warehouse           # populates raw-sources/
node ... ingest dbt_project --plain --yes   # should now write semantic-layer/*.yaml
node ... sl list --json

After that, update orchestrator.write_ktx_yaml() to also emit a warehouse connection per instance, pointing at work/<id>/dbt/<name>.duckdb. The <name> differs per instance (e.g. playbook.duckdb, asset.duckdb); the orchestrator already has discover_duckdb_name() for that.


7. Results so far (5-instance pilot)

Final score: 1 / 5 = 20% on the official evaluate.py.

Instance Agent finished Time (s) Cost (USD) Turns Tool calls Eval
playbook001 OK 82 $0.28 30 Bash 12, Read 5, Write 1, Edit 1 1.0
provider001 OK 289 $0.57 44 Bash 16, Read 7, Edit 1, Write 2 0
asana001 OK 181 $0.54 44 Bash 24, Read 1, Write 2, Edit 1 0
shopify001 OK 133 $0.50 41 Bash 13, Read 15, Write 2 0
asset001 OK 189 $0.42 44 Bash 14, Read 14, Write 2 0

Total spend on the pilot: ≈ $2.30. Mean: ~175 s, ~$0.46, ~40 turns.

All five agent runs finished cleanlydbt build green, every target table materialised in the DuckDB. The four failures are value-level mismatches: column orderings, tie-breaks, NULL handling, or precision/rounding diverging from gold. That's exactly the failure mode that richer ktx context (real column dtypes, sample values, primary keys, SL measures) should address.

For reference, GPT-4o reported ~10% and o1-preview ~17%, so a 20% on n=5 is roughly in band but the sample is far too small to claim a delta.

Why playbook001 passed

The wiki page cpa-roas-definitions.md pre-derived:

CPA  = total_spend / attribution_points          (column: cost_per_acquisition)
ROAS = attribution_revenue / total_spend         (column: return_on_advertising_spend)
Grain: (date_month, utm_source)

The agent read this page (via KTX_PROJECT_DIR=.. ktx wiki list --json then plain Read on ../wiki/global/cpa-roas-definitions.md), wrote the missing models/cpa_and_roas.sql directly from it, and dbt build produced the correct table.

Why the others failed (best guesses, not investigated deeply)

  • provider001: gold checks provider table columns [0,1,2,5,6,7,9,10,11,12,13] and specialty_mapping columns [0,1]. All 7 tables are produced with the right schema; the tie-break logic for "most specific specialty" diverges from gold.
  • asana001: 95 models materialised, 55 tests passed; the gold compares asana__team [0..9] and asana__user [0,1,2] and our values differ on one or more aggregations (open vs completed task counts, avg close time).
  • shopify001 and asset001: similar pattern — structure right, values off.

8. Hypotheses for the next agent

In rough order of expected impact:

  1. DuckDB connector (above) so ktx scan and ktx ingest together emit semantic-layer/<conn>/<source>.yaml with real columns, types, primary keys, sample values, and (if enabled) relationship proposals. Expose those to the sub-agent via either:

    • ktx sl read <source> calls from Bash, or
    • the ktx mcp stdio server attached via claude --mcp-config.
  2. Verification step in the system prompt — currently the agent declares success on dbt build green. Add: "Before declaring success, for every target table run SELECT * FROM <t> ORDER BY 1 LIMIT 5 and sanity-check column count, types, no NaN/NULL in not-null cols, row count > 0; also compare the produced column names with the column list in the schema.yml / wiki / sl source." Cheap fix; should turn some of the value-mismatch fails into passes (or into productive iteration).

  3. dbt run-stage tests — Spider2 examples often ship tests/; have the agent run dbt test after dbt build and treat any new test failures as a signal to revise. Some examples actually have gold- verifying tests in the project itself.

  4. Try Opus for hard cases — the orchestrator passes --model sonnet; flipping to opus on the retries of failed instances may recover some of the value-mismatch tasks. Cost goes up ~5×.

  5. ktx scan query-history off — currently query-history is skipped because the dbt adapter doesn't expose history. Once a warehouse connection exists, leave it skipped (DuckDB has no useful history for these one-shot DBs).

  6. Parallelismclaude is rate-limit-sensitive but the orchestrator is fully sequential. Two or three workers via concurrent.futures would cut wall-clock to ~1h for the full 68.


9. Resuming after the DuckDB connector lands

Concrete steps for the next agent:

  1. Confirm the connector is wired up:

    cd /Users/klo-dev/conductor/workspaces/ktx/santiago
    pnpm run build
    pnpm run ktx -- dev schema | jq '.properties.connections.additionalProperties.oneOf[].properties.driver.const' | sort -u
    # should include "duckdb"
    
  2. Update the orchestrator's ktx.yaml template in /Users/klo-dev/work/spider2-ktx/orchestrator.py (write_ktx_yaml). Pseudocode:

    db_name = discover_duckdb_name(ws / "dbt")     # e.g. "playbook.duckdb"
    ...
    connections:
      dbt_project:
        driver: dbt
        source_dir: {ws}/dbt
      warehouse:
        driver: duckdb
        path: {ws}/dbt/{db_name}
    

    Also re-enable scan relationship discovery if it gives useful output:

    scan:
      enrichment: { mode: deterministic }
      relationships:
        enabled: true
        llmProposals: true
    
  3. Verify on a known-passing instance first (playbook001) to make sure the dbt+warehouse combo still emits the same wiki pages it did before, plus new SL YAML, and the score stays at 1.0:

    cd /Users/klo-dev/work/spider2-ktx
    source .venv/bin/activate
    python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
    cd Spider2/spider2-dbt/evaluation_suite
    python evaluate.py --result_dir ../../../results --gold_dir ./gold
    
  4. Optionally improve the system prompt in orchestrator.SYSTEM_PROMPT to instruct the agent to use SL tools:

    • ktx sl list --json
    • ktx sl read <source>
    • ktx sl query --connection-id warehouse --measure ...
  5. Re-run a small batch with diverse failures (provider001, asana001, shopify001, asset001) to see whether SL access lifts those scores from 0 → 1. If it moves the needle, run the full 68:

    python orchestrator.py --budget 3 --timeout 1500 --evaluate
    

    Sequential 68 × ~3 min ≈ 3.5 h, ~$25 at current rates.

  6. Write the result back — append a section to this doc with the new score and a one-line note per failing instance, so we accumulate evidence over iterations rather than losing it.


10. Misc references

  • KTX MCP tools (see packages/context/src/mcp/context-tools.ts): connection_list, wiki_search, wiki_read, sl_read_source, sl_query, entity_details, dictionary_search, discover_data, sql_execution, memory_ingest, memory_ingest_status. sql_execution will work for DuckDB once the connector exists; today it has no transport for it.
  • The sqlite connector at packages/connector-sqlite/src/connector.ts is the closest template for DuckDB.
  • packages/context/src/ingest/adapters/dbt/ is the dbt adapter that generates the wiki pages — parse.ts reads dbt_project.yml, schema.yml, models; chunk.ts breaks them into work units; dbt.adapter.ts orchestrates.
  • Evaluator code is at Spider2/spider2-dbt/evaluation_suite/{evaluate.py, eval_utils.py}. duckdb_match is the only function that matters here.
  • Spider2 paper: https://arxiv.org/abs/2411.07763

11. Quick sanity checks for a fresh agent

# Toolchain
which node pnpm uv claude
source /Users/klo-dev/work/spider2-ktx/.venv/bin/activate && python -c "import dbt, duckdb, anthropic"

# KTX CLI build still works
cd /Users/klo-dev/conductor/workspaces/ktx/santiago
pnpm run ktx -- --help

# Orchestrator runnable
cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
python orchestrator.py -h

# A previous result still scores 1.0
cd Spider2/spider2-dbt/evaluation_suite
python evaluate.py --result_dir ../../../results --gold_dir ./gold
# expects: 0.2 1 5 (current state)

If any of those fail before you do anything else, the environment has drifted — fix that before adding the connector.