ktx/docs/spider2-dbt-benchmark.md
2026-05-18 14:12:59 +02:00

544 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Spider2-DBT × KTX benchmarking — handoff
This document is the state of the Spider2-DBT benchmarking experiment as of
2026-05-18. It is written so that a fresh agent can pick up the work,
particularly after adding a DuckDB scan connector to KTX.
---
## 1. What we are benchmarking
[Spider2-SQL](https://spider2-sql.github.io/) is an ICLR 2025 oral benchmark
for "real-world enterprise text-to-SQL workflows". It has three tracks:
- **Spider2.0-Snow** — 547 examples, Snowflake.
- **Spider2.0-Lite** — 547 examples, BigQuery / Snowflake / SQLite.
- **Spider2.0-DBT** — 68 examples, **DuckDB-backed dbt projects**.
We are participating in the **DBT track**. Public baselines:
| Method | Spider2-SQL |
|-----------------|-------------|
| GPT-4o | ~10% |
| o1-preview | ~17% |
| Top published | ~3040% |
Repo: `https://github.com/xlang-ai/Spider2`, the DBT track is under
`spider2-dbt/`.
### Task format
Each instance is a self-contained dbt project (`dbt_project.yml`,
`profiles.yml`, `models/`, sometimes `seeds/`, `macros/`, `dbt_packages/`)
plus a `.duckdb` file pre-loaded with raw source tables. The instruction is
a single underspecified natural-language sentence, e.g.:
> "Complete the project of this database to show the metrics of each traffic
> source, I believe every touchpoint in the conversion path is equally
> important, please choose the most suitable attribution method."
The agent must edit/add models and run `dbt build` until the warehouse
contains the required tables. **All 68 instances are evaluated with
`duckdb_match`**: the official evaluator diffs specific columns of specific
tables in the agent's DuckDB against a gold DuckDB. A pass is row-set match
on `condition_cols` for each `condition_tab`.
`spider2-dbt.jsonl` (instructions) and `evaluation_suite/gold/spider2_eval.jsonl`
(evaluator config + gold DuckDBs) are both clone-time artifacts.
---
## 2. On-disk layout
Everything benchmark-related lives outside this repo at
`/Users/klo-dev/work/spider2-ktx/`:
```
/Users/klo-dev/work/spider2-ktx/
├── .venv/ # Python 3.11 (uv-managed)
├── Spider2/ # cloned `git clone xlang-ai/Spider2`
│ └── spider2-dbt/
│ ├── examples/ # 69 dirs, 68 are in spider2-dbt.jsonl
│ │ ├── playbook001/dbt_project.yml ...
│ │ └── ...
│ ├── examples/spider2-dbt.jsonl # 68 instance instructions
│ ├── evaluation_suite/
│ │ ├── evaluate.py # official scorer
│ │ ├── eval_utils.py # duckdb_match, table_match, ...
│ │ └── gold/ # gold .duckdb per instance
│ └── setup.py # unpacks DBT_start_db.zip + dbt_gold.zip
├── orchestrator.py # main runner (see §4)
├── agent_prompt.md # system prompt written by orchestrator
├── work/ # per-instance workspaces (ktx + dbt)
│ └── <instance_id>/
│ ├── ktx.yaml # generated by orchestrator
│ ├── .ktx/ # ktx state (sqlite, git, cache)
│ ├── wiki/global/*.md # OUTPUT of `ktx ingest dbt_project`
│ ├── semantic-layer/ # empty today (no DuckDB connector)
│ └── dbt/ # copy of Spider2/spider2-dbt/examples/<id>
│ ├── dbt_project.yml
│ ├── profiles.yml
│ ├── models/...
│ └── <name>.duckdb
├── results/ # submission folder
│ ├── results_metadata.jsonl
│ └── <instance_id>/<name>.duckdb
└── logs/<instance_id>/
├── ktx-init.log
├── ktx-ingest.log
├── claude.log # stderr from the sub-agent
└── claude-stream.jsonl # full structured trace
```
The two source-data zips (~1 GB) were pulled with `gdown` from the Drive
IDs in `Spider2/spider2-dbt/setup.py` and then `setup.py` was run to unpack
them in place. No need to re-do that step.
---
## 3. Current ktx.yaml per instance
Generated by `orchestrator.write_ktx_yaml()`. Same template for every
workspace, with the source_dir absolute path swapped in:
```yaml
connections:
dbt_project:
driver: dbt
source_dir: /Users/klo-dev/work/spider2-ktx/work/<id>/dbt
storage:
state: sqlite
search: sqlite-fts5
git:
auto_commit: false
author: ktx <ktx@example.com>
llm:
provider:
backend: claude-code # uses local Claude Code OAuth — no API key
models:
default: sonnet
triage: haiku
candidateExtraction: sonnet
curator: sonnet
reconcile: sonnet
repair: sonnet
ingest:
adapters: [dbt]
embeddings:
backend: deterministic
model: deterministic
dimensions: 8
workUnits:
stepBudget: 40
maxConcurrency: 1
failureMode: continue
agent:
run_research:
enabled: false
max_iterations: 20
default_toolset: [sl_query, wiki_search, sl_read_source]
memory:
auto_commit: false
scan:
enrichment: { mode: none }
relationships:
enabled: false # disabled — no warehouse to relate against
llmProposals: false
```
Notes / gotchas learned the hard way:
- `source_dir` **must be absolute** and **must not be the same as
`--project-dir`** (the dbt adapter copies the dir into
`.ktx/cache/local-ingest/` and refuses to recursively copy a parent into
itself). Hence the `work/<id>/dbt/` sub-structure.
- `llm.provider.backend: none` (the `dev init` default) makes `ktx ingest`
on the dbt adapter fail with `"requires llm.provider.backend: anthropic,
vertex, gateway, or claude-code"`. The dbt adapter is LLM-driven.
- `llm.models.default` is **required** whenever `provider.backend != none`.
- `claude-code` backend reuses the local Claude Code OAuth session, so no
`ANTHROPIC_API_KEY` env var is needed.
- `--yes` and `--no-input` are mutually exclusive on `ktx ingest`.
---
## 4. Orchestrator
`/Users/klo-dev/work/spider2-ktx/orchestrator.py` — the one moving part.
Per-instance flow:
1. `make_workspace(id)` — copy `Spider2/spider2-dbt/examples/<id>/` into
`work/<id>/dbt/`.
2. `ktx dev init <work/<id>>` and write the ktx.yaml above.
3. `ktx ingest dbt_project --plain --yes` — runs the LLM-driven dbt
adapter; output lands in `work/<id>/wiki/global/*.md`.
4. Spawn `claude --print --permission-mode bypassPermissions ...` with
- cwd = `work/<id>/dbt` (the agent works inside the dbt project)
- `--add-dir work/<id>` (so the agent can read the wiki)
- `--allowedTools Bash,Edit,Read,Write,Glob,Grep,WebFetch,TodoWrite`
- `--system-prompt` from `SYSTEM_PROMPT` (see `agent_prompt.md`)
- the prompt is the Spider2 instruction.
5. Stream the agent's JSONL events to `logs/<id>/claude-stream.jsonl`,
capture the final `result` message as a summary string.
6. `collect_result()` — copy the largest `*.duckdb` in `work/<id>/dbt/`
into `results/<id>/<name>.duckdb` and add an entry
`{instance_id, answer_type: "file", answer_or_path: "<name>.duckdb"}`
to `results/results_metadata.jsonl`. Metadata is re-written after every
instance, so partial runs are recoverable.
CLI:
```bash
cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
# One specific instance
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
# All 68
python orchestrator.py --budget 3 --timeout 1500 --evaluate
# Skip ingest (when workspace already has wiki) — speeds re-runs
python orchestrator.py -n provider001 --skip-ingest
```
Flags:
| Flag | Default | Meaning |
|------|---------|---------|
| `-n, --instance` | none | Repeatable; restrict to listed instance ids |
| `-l, --limit` | none | First N from spider2-dbt.jsonl |
| `--model` | sonnet | Claude Code model alias |
| `--budget` | 4.0 | `--max-budget-usd` per instance |
| `--timeout` | 1800 | Wall-clock seconds per instance |
| `--force` | off | Wipe and recreate workspace |
| `--skip-ingest` | off | Reuse existing wiki |
| `--evaluate` | off | Run `evaluate.py` at the end |
Scoring:
```bash
cd /Users/klo-dev/work/spider2-ktx/Spider2/spider2-dbt/evaluation_suite
python evaluate.py \
--result_dir /Users/klo-dev/work/spider2-ktx/results \
--gold_dir ./gold
```
The official evaluator prints `score = passes / total`, and one line per
passing instance id.
---
## 5. What ktx currently provides to the agent
`ktx ingest dbt_project --plain --yes` on a Spider2 example emits only
**wiki pages** under `work/<id>/wiki/global/*.md`. There are **no
semantic-layer entities** — `ktx sl list` returns `items: []`.
Example, for `playbook001`:
```
work/playbook001/wiki/global/
├── acme-dbt-project.md # project overview: profile, sources, models
└── cpa-roas-definitions.md # exact CPA & ROAS formulas, grain, columns
```
For `asset001`:
```
work/asset001/wiki/global/
├── dbt-asset-project-overview.md
├── bar-quotes.md
└── book-value.md
```
The wiki pages **do** carry high-signal information for these tasks — they
pre-digest the dbt project into prose with formulas, grain, columns, and
unverified-vs-verified annotations. That's what made `playbook001` score
1.0: the wiki said `CPA = total_spend / attribution_points`, `ROAS =
attribution_revenue / total_spend`, grain `(date_month, utm_source)`, and
the agent transcribed that into `cpa_and_roas.sql` directly.
The wiki itself flags the missing piece:
> "Run `ktx scan` on the DuckDB connection to populate the warehouse schema
> and enable SL source creation for these tables."
Which brings us to:
---
## 6. The DuckDB connector gap
`ktx` ships connectors for: `postgres / postgresql / mysql / snowflake /
bigquery / sqlite / sqlserver / clickhouse`. **There is no DuckDB scan
connector**. References:
- `packages/cli/src/connection.test.ts:494``driver: duckdb` is asserted
to be **unknown** by `createKtxCliScanConnector`.
- `packages/context/src/sl/local-query.ts:59``DUCKDB: 'duckdb'` is a SQL
*dialect* constant for query generation, not a connector.
- `packages/context/src/mcp/local-project-ports.ts:32` — same: dialect
hint, not a connector.
Consequence: with the current setup we can't add a warehouse connection
that introspects each example's `.duckdb`. The dbt adapter falls back to
wiki-only output, which is why `semantic-layer/` stays empty.
### The plan you're about to act on
Add `packages/connector-duckdb/` modeled on `packages/connector-sqlite/`:
| File | Source to copy from | Adapt |
|------|---------------------|-------|
| `package.json` | `connector-sqlite/package.json` | dep `better-sqlite3``duckdb` (or `@duckdb/node-api`) |
| `src/dialect.ts` | `connector-sqlite/src/dialect.ts` | Quote with `"`; map types: `BIGINT → number`, `VARCHAR → string`, `TIMESTAMP → time`, etc. |
| `src/connector.ts` | `connector-sqlite/src/connector.ts` | Replace `Database` with the DuckDB equivalent. Use `information_schema` instead of `sqlite_master`/`PRAGMA table_info`. For FKs DuckDB also has `information_schema.referential_constraints` + `key_column_usage`. Estimated row counts → `SELECT estimated_size FROM duckdb_tables()`. |
| `src/index.ts` | `connector-sqlite/src/index.ts` | Re-export, plus `isKtxDuckDbConnectionConfig` |
| `src/connector.test.ts` + `dialect.test.ts` | sqlite equivalents | Mirror tests; the sqlite ones are a good template for what to cover |
Then wire it up:
1. `packages/cli/src/local-scan-connectors.ts` — add a branch for
`driver === 'duckdb'`, mirroring the sqlite branch.
2. `packages/context/src/project/driver-schemas.ts` — extend
`KTX_WAREHOUSE_DRIVERS` with `duckdb`. Connection config takes the same
shape as sqlite (`path` or `url`).
3. Add to `pnpm-workspace.yaml` if it isn't auto-discovered.
4. `pnpm install && pnpm --filter @ktx/connector-duckdb run build && pnpm
--filter @ktx/cli run build`.
Smoke test on `playbook001`:
```bash
cd /Users/klo-dev/work/spider2-ktx/work/playbook001
# edit ktx.yaml — add a duckdb connection pointing at the warehouse:
# connections:
# warehouse:
# driver: duckdb
# path: /Users/klo-dev/work/spider2-ktx/work/playbook001/dbt/playbook.duckdb
node /Users/klo-dev/conductor/workspaces/ktx/santiago/packages/cli/dist/bin.js \
connection test warehouse
node ... scan warehouse # populates raw-sources/
node ... ingest dbt_project --plain --yes # should now write semantic-layer/*.yaml
node ... sl list --json
```
After that, update `orchestrator.write_ktx_yaml()` to also emit a
`warehouse` connection per instance, pointing at
`work/<id>/dbt/<name>.duckdb`. The `<name>` differs per instance (e.g.
`playbook.duckdb`, `asset.duckdb`); the orchestrator already has
`discover_duckdb_name()` for that.
---
## 7. Results so far (5-instance pilot)
Final score: **1 / 5 = 20%** on the official `evaluate.py`.
| Instance | Agent finished | Time (s) | Cost (USD) | Turns | Tool calls | Eval |
|---------------|----------------|----------|------------|-------|---------------------------------------------|------|
| playbook001 | OK | 82 | $0.28 | 30 | Bash 12, Read 5, Write 1, Edit 1 | ✅ 1.0 |
| provider001 | OK | 289 | $0.57 | 44 | Bash 16, Read 7, Edit 1, Write 2 | ❌ 0 |
| asana001 | OK | 181 | $0.54 | 44 | Bash 24, Read 1, Write 2, Edit 1 | ❌ 0 |
| shopify001 | OK | 133 | $0.50 | 41 | Bash 13, Read 15, Write 2 | ❌ 0 |
| asset001 | OK | 189 | $0.42 | 44 | Bash 14, Read 14, Write 2 | ❌ 0 |
Total spend on the pilot: ≈ $2.30. Mean: ~175 s, ~$0.46, ~40 turns.
**All five agent runs finished cleanly** — `dbt build` green, every target
table materialised in the DuckDB. The four failures are *value-level*
mismatches: column orderings, tie-breaks, NULL handling, or
precision/rounding diverging from gold. That's exactly the failure mode
that richer ktx context (real column dtypes, sample values, primary keys,
SL measures) should address.
For reference, GPT-4o reported ~10% and o1-preview ~17%, so a 20% on n=5 is
roughly in band but the sample is far too small to claim a delta.
### Why playbook001 passed
The wiki page `cpa-roas-definitions.md` pre-derived:
```
CPA = total_spend / attribution_points (column: cost_per_acquisition)
ROAS = attribution_revenue / total_spend (column: return_on_advertising_spend)
Grain: (date_month, utm_source)
```
The agent read this page (via `KTX_PROJECT_DIR=.. ktx wiki list --json`
then plain `Read` on `../wiki/global/cpa-roas-definitions.md`), wrote the
missing `models/cpa_and_roas.sql` directly from it, and `dbt build`
produced the correct table.
### Why the others failed (best guesses, not investigated deeply)
- `provider001`: gold checks `provider` table columns
`[0,1,2,5,6,7,9,10,11,12,13]` and `specialty_mapping` columns `[0,1]`.
All 7 tables are produced with the right schema; the tie-break logic for
"most specific specialty" diverges from gold.
- `asana001`: 95 models materialised, 55 tests passed; the gold compares
`asana__team [0..9]` and `asana__user [0,1,2]` and our values differ on
one or more aggregations (open vs completed task counts, avg close
time).
- `shopify001` and `asset001`: similar pattern — structure right, values
off.
---
## 8. Hypotheses for the next agent
In rough order of expected impact:
1. **DuckDB connector** (above) so `ktx scan` and `ktx ingest` together
emit `semantic-layer/<conn>/<source>.yaml` with real columns, types,
primary keys, sample values, and (if enabled) relationship proposals.
Expose those to the sub-agent via either:
- `ktx sl read <source>` calls from Bash, or
- the `ktx mcp stdio` server attached via `claude --mcp-config`.
2. **Verification step in the system prompt** — currently the agent
declares success on `dbt build` green. Add: "Before declaring success,
for every target table run `SELECT * FROM <t> ORDER BY 1 LIMIT 5` and
sanity-check column count, types, no NaN/NULL in not-null cols, row
count > 0; also compare the produced column names with the column list
in the schema.yml / wiki / sl source." Cheap fix; should turn some of
the value-mismatch fails into passes (or into productive iteration).
3. **dbt run-stage tests** — Spider2 examples often ship `tests/`; have
the agent run `dbt test` after `dbt build` and treat any new test
failures as a signal to revise. Some examples actually have gold-
verifying tests in the project itself.
4. **Try Opus for hard cases** — the orchestrator passes `--model
sonnet`; flipping to `opus` on the retries of failed instances may
recover some of the value-mismatch tasks. Cost goes up ~5×.
5. **`ktx scan` query-history off** — currently `query-history` is
`skipped` because the dbt adapter doesn't expose history. Once a
warehouse connection exists, leave it skipped (DuckDB has no useful
history for these one-shot DBs).
6. **Parallelism** — `claude` is rate-limit-sensitive but the orchestrator
is fully sequential. Two or three workers via `concurrent.futures`
would cut wall-clock to ~1h for the full 68.
---
## 9. Resuming after the DuckDB connector lands
Concrete steps for the next agent:
1. **Confirm the connector is wired up**:
```bash
cd /Users/klo-dev/conductor/workspaces/ktx/santiago
pnpm run build
pnpm run ktx -- dev schema | jq '.properties.connections.additionalProperties.oneOf[].properties.driver.const' | sort -u
# should include "duckdb"
```
2. **Update the orchestrator's ktx.yaml template** in
`/Users/klo-dev/work/spider2-ktx/orchestrator.py` (`write_ktx_yaml`).
Pseudocode:
```python
db_name = discover_duckdb_name(ws / "dbt") # e.g. "playbook.duckdb"
...
connections:
dbt_project:
driver: dbt
source_dir: {ws}/dbt
warehouse:
driver: duckdb
path: {ws}/dbt/{db_name}
```
Also re-enable scan relationship discovery if it gives useful output:
```yaml
scan:
enrichment: { mode: deterministic }
relationships:
enabled: true
llmProposals: true
```
3. **Verify on a known-passing instance first** (`playbook001`) to make
sure the dbt+warehouse combo still emits the same wiki pages it did
before, plus new SL YAML, and the score stays at 1.0:
```bash
cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
cd Spider2/spider2-dbt/evaluation_suite
python evaluate.py --result_dir ../../../results --gold_dir ./gold
```
4. **Optionally** improve the system prompt in `orchestrator.SYSTEM_PROMPT`
to instruct the agent to use SL tools:
- `ktx sl list --json`
- `ktx sl read <source>`
- `ktx sl query --connection-id warehouse --measure ...`
5. **Re-run a small batch** with diverse failures (`provider001`,
`asana001`, `shopify001`, `asset001`) to see whether SL access lifts
those scores from 0 → 1. If it moves the needle, run the full 68:
```bash
python orchestrator.py --budget 3 --timeout 1500 --evaluate
```
Sequential 68 × ~3 min ≈ 3.5 h, ~$25 at current rates.
6. **Write the result back** — append a section to this doc with the new
score and a one-line note per failing instance, so we accumulate
evidence over iterations rather than losing it.
---
## 10. Misc references
- KTX MCP tools (see `packages/context/src/mcp/context-tools.ts`):
`connection_list`, `wiki_search`, `wiki_read`, `sl_read_source`,
`sl_query`, `entity_details`, `dictionary_search`, `discover_data`,
`sql_execution`, `memory_ingest`, `memory_ingest_status`.
`sql_execution` will work for DuckDB once the connector exists; today
it has no transport for it.
- The sqlite connector at `packages/connector-sqlite/src/connector.ts`
is the closest template for DuckDB.
- `packages/context/src/ingest/adapters/dbt/` is the dbt adapter that
generates the wiki pages — `parse.ts` reads `dbt_project.yml`,
`schema.yml`, models; `chunk.ts` breaks them into work units;
`dbt.adapter.ts` orchestrates.
- Evaluator code is at
`Spider2/spider2-dbt/evaluation_suite/{evaluate.py, eval_utils.py}`.
`duckdb_match` is the only function that matters here.
- Spider2 paper: https://arxiv.org/abs/2411.07763
---
## 11. Quick sanity checks for a fresh agent
```bash
# Toolchain
which node pnpm uv claude
source /Users/klo-dev/work/spider2-ktx/.venv/bin/activate && python -c "import dbt, duckdb, anthropic"
# KTX CLI build still works
cd /Users/klo-dev/conductor/workspaces/ktx/santiago
pnpm run ktx -- --help
# Orchestrator runnable
cd /Users/klo-dev/work/spider2-ktx
source .venv/bin/activate
python orchestrator.py -h
# A previous result still scores 1.0
cd Spider2/spider2-dbt/evaluation_suite
python evaluate.py --result_dir ../../../results --gold_dir ./gold
# expects: 0.2 1 5 (current state)
```
If any of those fail before you do anything else, the environment has
drifted — fix that before adding the connector.