apunkt/ktx

mirror of https://github.com/Kaelio/ktx.git synced 2026-06-16 08:25:14 +02:00

Andrey Avtomonov e07262b2b9 docs: add spider2 dbt benchmark handoff

2026-05-18 14:12:59 +02:00

21 KiB

Raw Blame History

Spider2-DBT × KTX benchmarking — handoff

This document is the state of the Spider2-DBT benchmarking experiment as of 2026-05-18. It is written so that a fresh agent can pick up the work, particularly after adding a DuckDB scan connector to KTX.

1. What we are benchmarking

Spider2-SQL is an ICLR 2025 oral benchmark for "real-world enterprise text-to-SQL workflows". It has three tracks:

Spider2.0-Snow — 547 examples, Snowflake.
Spider2.0-Lite — 547 examples, BigQuery / Snowflake / SQLite.
Spider2.0-DBT — 68 examples, DuckDB-backed dbt projects.

We are participating in the DBT track. Public baselines:

Method	Spider2-SQL
GPT-4o	~10%
o1-preview	~17%
Top published	~30–40%

Repo: https://github.com/xlang-ai/Spider2, the DBT track is under spider2-dbt/.

Task format

Each instance is a self-contained dbt project (dbt_project.yml, profiles.yml, models/, sometimes seeds/, macros/, dbt_packages/) plus a .duckdb file pre-loaded with raw source tables. The instruction is a single underspecified natural-language sentence, e.g.:

"Complete the project of this database to show the metrics of each traffic source, I believe every touchpoint in the conversion path is equally important, please choose the most suitable attribution method."

The agent must edit/add models and run dbt build until the warehouse contains the required tables. All 68 instances are evaluated with duckdb_match: the official evaluator diffs specific columns of specific tables in the agent's DuckDB against a gold DuckDB. A pass is row-set match on condition_cols for each condition_tab.

spider2-dbt.jsonl (instructions) and evaluation_suite/gold/spider2_eval.jsonl (evaluator config + gold DuckDBs) are both clone-time artifacts.

2. On-disk layout

Everything benchmark-related lives outside this repo at /Users/klo-dev/work/spider2-ktx/:

/Users/klo-dev/work/spider2-ktx/
├── .venv/                              # Python 3.11 (uv-managed)
├── Spider2/                            # cloned `git clone xlang-ai/Spider2`
│   └── spider2-dbt/
│       ├── examples/                   # 69 dirs, 68 are in spider2-dbt.jsonl
│       │   ├── playbook001/dbt_project.yml ...
│       │   └── ...
│       ├── examples/spider2-dbt.jsonl  # 68 instance instructions
│       ├── evaluation_suite/
│       │   ├── evaluate.py             # official scorer
│       │   ├── eval_utils.py           # duckdb_match, table_match, ...
│       │   └── gold/                   # gold .duckdb per instance
│       └── setup.py                    # unpacks DBT_start_db.zip + dbt_gold.zip
├── orchestrator.py                     # main runner (see §4)
├── agent_prompt.md                     # system prompt written by orchestrator
├── work/                               # per-instance workspaces (ktx + dbt)
│   └── <instance_id>/
│       ├── ktx.yaml                    # generated by orchestrator
│       ├── .ktx/                       # ktx state (sqlite, git, cache)
│       ├── wiki/global/*.md            # OUTPUT of `ktx ingest dbt_project`
│       ├── semantic-layer/             # empty today (no DuckDB connector)
│       └── dbt/                        # copy of Spider2/spider2-dbt/examples/<id>
│           ├── dbt_project.yml
│           ├── profiles.yml
│           ├── models/...
│           └── <name>.duckdb
├── results/                            # submission folder
│   ├── results_metadata.jsonl
│   └── <instance_id>/<name>.duckdb
└── logs/<instance_id>/
    ├── ktx-init.log
    ├── ktx-ingest.log
    ├── claude.log                      # stderr from the sub-agent
    └── claude-stream.jsonl             # full structured trace

The two source-data zips (~1 GB) were pulled with gdown from the Drive IDs in Spider2/spider2-dbt/setup.py and then setup.py was run to unpack them in place. No need to re-do that step.

3. Current ktx.yaml per instance

Generated by orchestrator.write_ktx_yaml(). Same template for every workspace, with the source_dir absolute path swapped in:

connections:
  dbt_project:
    driver: dbt
    source_dir: /Users/klo-dev/work/spider2-ktx/work/<id>/dbt
storage:
  state: sqlite
  search: sqlite-fts5
  git:
    auto_commit: false
    author: ktx <ktx@example.com>
llm:
  provider:
    backend: claude-code      # uses local Claude Code OAuth — no API key
  models:
    default: sonnet
    triage: haiku
    candidateExtraction: sonnet
    curator: sonnet
    reconcile: sonnet
    repair: sonnet
ingest:
  adapters: [dbt]
  embeddings:
    backend: deterministic
    model: deterministic
    dimensions: 8
  workUnits:
    stepBudget: 40
    maxConcurrency: 1
    failureMode: continue
agent:
  run_research:
    enabled: false
    max_iterations: 20
    default_toolset: [sl_query, wiki_search, sl_read_source]
memory:
  auto_commit: false
scan:
  enrichment: { mode: none }
  relationships:
    enabled: false        # disabled — no warehouse to relate against
    llmProposals: false

Notes / gotchas learned the hard way:

source_dir must be absolute and must not be the same as --project-dir (the dbt adapter copies the dir into .ktx/cache/local-ingest/ and refuses to recursively copy a parent into itself). Hence the work/<id>/dbt/ sub-structure.
llm.provider.backend: none (the dev init default) makes ktx ingest on the dbt adapter fail with "requires llm.provider.backend: anthropic, vertex, gateway, or claude-code". The dbt adapter is LLM-driven.
llm.models.default is required whenever provider.backend != none.
claude-code backend reuses the local Claude Code OAuth session, so no ANTHROPIC_API_KEY env var is needed.
--yes and --no-input are mutually exclusive on ktx ingest.

4. Orchestrator

/Users/klo-dev/work/spider2-ktx/orchestrator.py — the one moving part.

Per-instance flow:

make_workspace(id) — copy Spider2/spider2-dbt/examples/<id>/ into work/<id>/dbt/.
ktx dev init <work/<id>> and write the ktx.yaml above.
ktx ingest dbt_project --plain --yes — runs the LLM-driven dbt adapter; output lands in work/<id>/wiki/global/*.md.
Spawn claude --print --permission-mode bypassPermissions ... with
- cwd = work/<id>/dbt (the agent works inside the dbt project)
- --add-dir work/<id> (so the agent can read the wiki)
- --allowedTools Bash,Edit,Read,Write,Glob,Grep,WebFetch,TodoWrite
- --system-prompt from SYSTEM_PROMPT (see agent_prompt.md)
- the prompt is the Spider2 instruction.
Stream the agent's JSONL events to logs/<id>/claude-stream.jsonl, capture the final result message as a summary string.
collect_result() — copy the largest *.duckdb in work/<id>/dbt/ into results/<id>/<name>.duckdb and add an entry {instance_id, answer_type: "file", answer_or_path: "<name>.duckdb"} to results/results_metadata.jsonl. Metadata is re-written after every instance, so partial runs are recoverable.

CLI:

cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate

# One specific instance
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500

# All 68
python orchestrator.py --budget 3 --timeout 1500 --evaluate

# Skip ingest (when workspace already has wiki) — speeds re-runs
python orchestrator.py -n provider001 --skip-ingest

Flags:

Flag	Default	Meaning
`-n, --instance`	none	Repeatable; restrict to listed instance ids
`-l, --limit`	none	First N from spider2-dbt.jsonl
`--model`	sonnet	Claude Code model alias
`--budget`	4.0	`--max-budget-usd` per instance
`--timeout`	1800	Wall-clock seconds per instance
`--force`	off	Wipe and recreate workspace
`--skip-ingest`	off	Reuse existing wiki
`--evaluate`	off	Run `evaluate.py` at the end

Scoring:

cd /Users/klo-dev/work/spider2-ktx/Spider2/spider2-dbt/evaluation_suite
python evaluate.py \
    --result_dir /Users/klo-dev/work/spider2-ktx/results \
    --gold_dir ./gold

The official evaluator prints score = passes / total, and one line per passing instance id.

5. What ktx currently provides to the agent

ktx ingest dbt_project --plain --yes on a Spider2 example emits only wiki pages under work/<id>/wiki/global/*.md. There are no semantic-layer entities — ktx sl list returns items: [].

Example, for playbook001:

work/playbook001/wiki/global/
├── acme-dbt-project.md         # project overview: profile, sources, models
└── cpa-roas-definitions.md     # exact CPA & ROAS formulas, grain, columns

For asset001:

work/asset001/wiki/global/
├── dbt-asset-project-overview.md
├── bar-quotes.md
└── book-value.md

The wiki pages do carry high-signal information for these tasks — they pre-digest the dbt project into prose with formulas, grain, columns, and unverified-vs-verified annotations. That's what made playbook001 score 1.0: the wiki said CPA = total_spend / attribution_points, ROAS = attribution_revenue / total_spend, grain (date_month, utm_source), and the agent transcribed that into cpa_and_roas.sql directly.

The wiki itself flags the missing piece:

"Run ktx scan on the DuckDB connection to populate the warehouse schema and enable SL source creation for these tables."

Which brings us to:

6. The DuckDB connector gap

ktx ships connectors for: postgres / postgresql / mysql / snowflake / bigquery / sqlite / sqlserver / clickhouse. There is no DuckDB scan connector. References:

packages/cli/src/connection.test.ts:494 — driver: duckdb is asserted to be unknown by createKtxCliScanConnector.
packages/context/src/sl/local-query.ts:59 — DUCKDB: 'duckdb' is a SQL dialect constant for query generation, not a connector.
packages/context/src/mcp/local-project-ports.ts:32 — same: dialect hint, not a connector.

Consequence: with the current setup we can't add a warehouse connection that introspects each example's .duckdb. The dbt adapter falls back to wiki-only output, which is why semantic-layer/ stays empty.

The plan you're about to act on

Add packages/connector-duckdb/ modeled on packages/connector-sqlite/:

File	Source to copy from	Adapt
`package.json`	`connector-sqlite/package.json`	dep `better-sqlite3` → `duckdb` (or `@duckdb/node-api`)
`src/dialect.ts`	`connector-sqlite/src/dialect.ts`	Quote with `"`; map types: `BIGINT → number`, `VARCHAR → string`, `TIMESTAMP → time`, etc.
`src/connector.ts`	`connector-sqlite/src/connector.ts`	Replace `Database` with the DuckDB equivalent. Use `information_schema` instead of `sqlite_master`/`PRAGMA table_info`. For FKs DuckDB also has `information_schema.referential_constraints` + `key_column_usage`. Estimated row counts → `SELECT estimated_size FROM duckdb_tables()`.
`src/index.ts`	`connector-sqlite/src/index.ts`	Re-export, plus `isKtxDuckDbConnectionConfig`
`src/connector.test.ts` + `dialect.test.ts`	sqlite equivalents	Mirror tests; the sqlite ones are a good template for what to cover

Then wire it up:

packages/cli/src/local-scan-connectors.ts — add a branch for driver === 'duckdb', mirroring the sqlite branch.
packages/context/src/project/driver-schemas.ts — extend KTX_WAREHOUSE_DRIVERS with duckdb. Connection config takes the same shape as sqlite (path or url).
Add to pnpm-workspace.yaml if it isn't auto-discovered.
pnpm install && pnpm --filter @ktx/connector-duckdb run build && pnpm --filter @ktx/cli run build.

Smoke test on playbook001:

cd /Users/klo-dev/work/spider2-ktx/work/playbook001
# edit ktx.yaml — add a duckdb connection pointing at the warehouse:
#   connections:
#     warehouse:
#       driver: duckdb
#       path: /Users/klo-dev/work/spider2-ktx/work/playbook001/dbt/playbook.duckdb
node /Users/klo-dev/conductor/workspaces/ktx/santiago/packages/cli/dist/bin.js \
    connection test warehouse
node ... scan warehouse           # populates raw-sources/
node ... ingest dbt_project --plain --yes   # should now write semantic-layer/*.yaml
node ... sl list --json

After that, update orchestrator.write_ktx_yaml() to also emit a warehouse connection per instance, pointing at work/<id>/dbt/<name>.duckdb. The <name> differs per instance (e.g. playbook.duckdb, asset.duckdb); the orchestrator already has discover_duckdb_name() for that.

7. Results so far (5-instance pilot)

Final score: 1 / 5 = 20% on the official evaluate.py.

Instance	Agent finished	Time (s)	Cost (USD)	Turns	Tool calls	Eval
playbook001	OK	82	$0.28	30	Bash 12, Read 5, Write 1, Edit 1	✅ 1.0
provider001	OK	289	$0.57	44	Bash 16, Read 7, Edit 1, Write 2	❌ 0
asana001	OK	181	$0.54	44	Bash 24, Read 1, Write 2, Edit 1	❌ 0
shopify001	OK	133	$0.50	41	Bash 13, Read 15, Write 2	❌ 0
asset001	OK	189	$0.42	44	Bash 14, Read 14, Write 2	❌ 0

Total spend on the pilot: ≈ $2.30. Mean: ~175 s, ~$0.46, ~40 turns.

All five agent runs finished cleanly — dbt build green, every target table materialised in the DuckDB. The four failures are value-level mismatches: column orderings, tie-breaks, NULL handling, or precision/rounding diverging from gold. That's exactly the failure mode that richer ktx context (real column dtypes, sample values, primary keys, SL measures) should address.

For reference, GPT-4o reported ~10% and o1-preview ~17%, so a 20% on n=5 is roughly in band but the sample is far too small to claim a delta.

Why playbook001 passed

The wiki page cpa-roas-definitions.md pre-derived:

CPA  = total_spend / attribution_points          (column: cost_per_acquisition)
ROAS = attribution_revenue / total_spend         (column: return_on_advertising_spend)
Grain: (date_month, utm_source)

The agent read this page (via KTX_PROJECT_DIR=.. ktx wiki list --json then plain Read on ../wiki/global/cpa-roas-definitions.md), wrote the missing models/cpa_and_roas.sql directly from it, and dbt build produced the correct table.

Why the others failed (best guesses, not investigated deeply)

provider001: gold checks provider table columns [0,1,2,5,6,7,9,10,11,12,13] and specialty_mapping columns [0,1]. All 7 tables are produced with the right schema; the tie-break logic for "most specific specialty" diverges from gold.
asana001: 95 models materialised, 55 tests passed; the gold compares asana__team [0..9] and asana__user [0,1,2] and our values differ on one or more aggregations (open vs completed task counts, avg close time).
shopify001 and asset001: similar pattern — structure right, values off.

8. Hypotheses for the next agent

In rough order of expected impact:

DuckDB connector (above) so ktx scan and ktx ingest together emit semantic-layer/<conn>/<source>.yaml with real columns, types, primary keys, sample values, and (if enabled) relationship proposals. Expose those to the sub-agent via either:
- ktx sl read <source> calls from Bash, or
- the ktx mcp stdio server attached via claude --mcp-config.
Verification step in the system prompt — currently the agent declares success on dbt build green. Add: "Before declaring success, for every target table run SELECT * FROM <t> ORDER BY 1 LIMIT 5 and sanity-check column count, types, no NaN/NULL in not-null cols, row count > 0; also compare the produced column names with the column list in the schema.yml / wiki / sl source." Cheap fix; should turn some of the value-mismatch fails into passes (or into productive iteration).
dbt run-stage tests — Spider2 examples often ship tests/; have the agent run dbt test after dbt build and treat any new test failures as a signal to revise. Some examples actually have gold- verifying tests in the project itself.
Try Opus for hard cases — the orchestrator passes --model sonnet; flipping to opus on the retries of failed instances may recover some of the value-mismatch tasks. Cost goes up ~5×.
ktx scan query-history off — currently query-history is skipped because the dbt adapter doesn't expose history. Once a warehouse connection exists, leave it skipped (DuckDB has no useful history for these one-shot DBs).
Parallelism — claude is rate-limit-sensitive but the orchestrator is fully sequential. Two or three workers via concurrent.futures would cut wall-clock to ~1h for the full 68.

9. Resuming after the DuckDB connector lands

Concrete steps for the next agent:

Confirm the connector is wired up:

cd /Users/klo-dev/conductor/workspaces/ktx/santiago
pnpm run build
pnpm run ktx -- dev schema | jq '.properties.connections.additionalProperties.oneOf[].properties.driver.const' | sort -u
# should include "duckdb"

Update the orchestrator's ktx.yaml template in /Users/klo-dev/work/spider2-ktx/orchestrator.py (write_ktx_yaml). Pseudocode:

db_name = discover_duckdb_name(ws / "dbt")     # e.g. "playbook.duckdb"
...
connections:
  dbt_project:
    driver: dbt
    source_dir: {ws}/dbt
  warehouse:
    driver: duckdb
    path: {ws}/dbt/{db_name}

Also re-enable scan relationship discovery if it gives useful output:

scan:
  enrichment: { mode: deterministic }
  relationships:
    enabled: true
    llmProposals: true

Verify on a known-passing instance first (playbook001) to make sure the dbt+warehouse combo still emits the same wiki pages it did before, plus new SL YAML, and the score stays at 1.0:

cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
cd Spider2/spider2-dbt/evaluation_suite
python evaluate.py --result_dir ../../../results --gold_dir ./gold

Optionally improve the system prompt in orchestrator.SYSTEM_PROMPT to instruct the agent to use SL tools:
- ktx sl list --json
- ktx sl read <source>
- ktx sl query --connection-id warehouse --measure ...
Re-run a small batch with diverse failures (provider001, asana001, shopify001, asset001) to see whether SL access lifts those scores from 0 → 1. If it moves the needle, run the full 68:
```
python orchestrator.py --budget 3 --timeout 1500 --evaluate
```
Sequential 68 × ~3 min ≈ 3.5 h, ~$25 at current rates.
Write the result back — append a section to this doc with the new score and a one-line note per failing instance, so we accumulate evidence over iterations rather than losing it.

10. Misc references

KTX MCP tools (see packages/context/src/mcp/context-tools.ts): connection_list, wiki_search, wiki_read, sl_read_source, sl_query, entity_details, dictionary_search, discover_data, sql_execution, memory_ingest, memory_ingest_status. sql_execution will work for DuckDB once the connector exists; today it has no transport for it.
The sqlite connector at packages/connector-sqlite/src/connector.ts is the closest template for DuckDB.
packages/context/src/ingest/adapters/dbt/ is the dbt adapter that generates the wiki pages — parse.ts reads dbt_project.yml, schema.yml, models; chunk.ts breaks them into work units; dbt.adapter.ts orchestrates.
Evaluator code is at Spider2/spider2-dbt/evaluation_suite/{evaluate.py, eval_utils.py}. duckdb_match is the only function that matters here.
Spider2 paper: https://arxiv.org/abs/2411.07763

11. Quick sanity checks for a fresh agent

# Toolchain
which node pnpm uv claude
source /Users/klo-dev/work/spider2-ktx/.venv/bin/activate && python -c "import dbt, duckdb, anthropic"

# KTX CLI build still works
cd /Users/klo-dev/conductor/workspaces/ktx/santiago
pnpm run ktx -- --help

# Orchestrator runnable
cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
python orchestrator.py -h

# A previous result still scores 1.0
cd Spider2/spider2-dbt/evaluation_suite
python evaluate.py --result_dir ../../../results --gold_dir ./gold
# expects: 0.2 1 5 (current state)

If any of those fail before you do anything else, the environment has drifted — fix that before adding the connector.

21 KiB Raw Blame History Unescape Escape