mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
545 lines
21 KiB
Markdown
545 lines
21 KiB
Markdown
|
|
# Spider2-DBT × KTX benchmarking — handoff
|
|||
|
|
|
|||
|
|
This document is the state of the Spider2-DBT benchmarking experiment as of
|
|||
|
|
2026-05-18. It is written so that a fresh agent can pick up the work,
|
|||
|
|
particularly after adding a DuckDB scan connector to KTX.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. What we are benchmarking
|
|||
|
|
|
|||
|
|
[Spider2-SQL](https://spider2-sql.github.io/) is an ICLR 2025 oral benchmark
|
|||
|
|
for "real-world enterprise text-to-SQL workflows". It has three tracks:
|
|||
|
|
|
|||
|
|
- **Spider2.0-Snow** — 547 examples, Snowflake.
|
|||
|
|
- **Spider2.0-Lite** — 547 examples, BigQuery / Snowflake / SQLite.
|
|||
|
|
- **Spider2.0-DBT** — 68 examples, **DuckDB-backed dbt projects**.
|
|||
|
|
|
|||
|
|
We are participating in the **DBT track**. Public baselines:
|
|||
|
|
|
|||
|
|
| Method | Spider2-SQL |
|
|||
|
|
|-----------------|-------------|
|
|||
|
|
| GPT-4o | ~10% |
|
|||
|
|
| o1-preview | ~17% |
|
|||
|
|
| Top published | ~30–40% |
|
|||
|
|
|
|||
|
|
Repo: `https://github.com/xlang-ai/Spider2`, the DBT track is under
|
|||
|
|
`spider2-dbt/`.
|
|||
|
|
|
|||
|
|
### Task format
|
|||
|
|
|
|||
|
|
Each instance is a self-contained dbt project (`dbt_project.yml`,
|
|||
|
|
`profiles.yml`, `models/`, sometimes `seeds/`, `macros/`, `dbt_packages/`)
|
|||
|
|
plus a `.duckdb` file pre-loaded with raw source tables. The instruction is
|
|||
|
|
a single underspecified natural-language sentence, e.g.:
|
|||
|
|
|
|||
|
|
> "Complete the project of this database to show the metrics of each traffic
|
|||
|
|
> source, I believe every touchpoint in the conversion path is equally
|
|||
|
|
> important, please choose the most suitable attribution method."
|
|||
|
|
|
|||
|
|
The agent must edit/add models and run `dbt build` until the warehouse
|
|||
|
|
contains the required tables. **All 68 instances are evaluated with
|
|||
|
|
`duckdb_match`**: the official evaluator diffs specific columns of specific
|
|||
|
|
tables in the agent's DuckDB against a gold DuckDB. A pass is row-set match
|
|||
|
|
on `condition_cols` for each `condition_tab`.
|
|||
|
|
|
|||
|
|
`spider2-dbt.jsonl` (instructions) and `evaluation_suite/gold/spider2_eval.jsonl`
|
|||
|
|
(evaluator config + gold DuckDBs) are both clone-time artifacts.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. On-disk layout
|
|||
|
|
|
|||
|
|
Everything benchmark-related lives outside this repo at
|
|||
|
|
`/Users/klo-dev/work/spider2-ktx/`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
/Users/klo-dev/work/spider2-ktx/
|
|||
|
|
├── .venv/ # Python 3.11 (uv-managed)
|
|||
|
|
├── Spider2/ # cloned `git clone xlang-ai/Spider2`
|
|||
|
|
│ └── spider2-dbt/
|
|||
|
|
│ ├── examples/ # 69 dirs, 68 are in spider2-dbt.jsonl
|
|||
|
|
│ │ ├── playbook001/dbt_project.yml ...
|
|||
|
|
│ │ └── ...
|
|||
|
|
│ ├── examples/spider2-dbt.jsonl # 68 instance instructions
|
|||
|
|
│ ├── evaluation_suite/
|
|||
|
|
│ │ ├── evaluate.py # official scorer
|
|||
|
|
│ │ ├── eval_utils.py # duckdb_match, table_match, ...
|
|||
|
|
│ │ └── gold/ # gold .duckdb per instance
|
|||
|
|
│ └── setup.py # unpacks DBT_start_db.zip + dbt_gold.zip
|
|||
|
|
├── orchestrator.py # main runner (see §4)
|
|||
|
|
├── agent_prompt.md # system prompt written by orchestrator
|
|||
|
|
├── work/ # per-instance workspaces (ktx + dbt)
|
|||
|
|
│ └── <instance_id>/
|
|||
|
|
│ ├── ktx.yaml # generated by orchestrator
|
|||
|
|
│ ├── .ktx/ # ktx state (sqlite, git, cache)
|
|||
|
|
│ ├── wiki/global/*.md # OUTPUT of `ktx ingest dbt_project`
|
|||
|
|
│ ├── semantic-layer/ # empty today (no DuckDB connector)
|
|||
|
|
│ └── dbt/ # copy of Spider2/spider2-dbt/examples/<id>
|
|||
|
|
│ ├── dbt_project.yml
|
|||
|
|
│ ├── profiles.yml
|
|||
|
|
│ ├── models/...
|
|||
|
|
│ └── <name>.duckdb
|
|||
|
|
├── results/ # submission folder
|
|||
|
|
│ ├── results_metadata.jsonl
|
|||
|
|
│ └── <instance_id>/<name>.duckdb
|
|||
|
|
└── logs/<instance_id>/
|
|||
|
|
├── ktx-init.log
|
|||
|
|
├── ktx-ingest.log
|
|||
|
|
├── claude.log # stderr from the sub-agent
|
|||
|
|
└── claude-stream.jsonl # full structured trace
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The two source-data zips (~1 GB) were pulled with `gdown` from the Drive
|
|||
|
|
IDs in `Spider2/spider2-dbt/setup.py` and then `setup.py` was run to unpack
|
|||
|
|
them in place. No need to re-do that step.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Current ktx.yaml per instance
|
|||
|
|
|
|||
|
|
Generated by `orchestrator.write_ktx_yaml()`. Same template for every
|
|||
|
|
workspace, with the source_dir absolute path swapped in:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
connections:
|
|||
|
|
dbt_project:
|
|||
|
|
driver: dbt
|
|||
|
|
source_dir: /Users/klo-dev/work/spider2-ktx/work/<id>/dbt
|
|||
|
|
storage:
|
|||
|
|
state: sqlite
|
|||
|
|
search: sqlite-fts5
|
|||
|
|
git:
|
|||
|
|
auto_commit: false
|
|||
|
|
author: ktx <ktx@example.com>
|
|||
|
|
llm:
|
|||
|
|
provider:
|
|||
|
|
backend: claude-code # uses local Claude Code OAuth — no API key
|
|||
|
|
models:
|
|||
|
|
default: sonnet
|
|||
|
|
triage: haiku
|
|||
|
|
candidateExtraction: sonnet
|
|||
|
|
curator: sonnet
|
|||
|
|
reconcile: sonnet
|
|||
|
|
repair: sonnet
|
|||
|
|
ingest:
|
|||
|
|
adapters: [dbt]
|
|||
|
|
embeddings:
|
|||
|
|
backend: deterministic
|
|||
|
|
model: deterministic
|
|||
|
|
dimensions: 8
|
|||
|
|
workUnits:
|
|||
|
|
stepBudget: 40
|
|||
|
|
maxConcurrency: 1
|
|||
|
|
failureMode: continue
|
|||
|
|
agent:
|
|||
|
|
run_research:
|
|||
|
|
enabled: false
|
|||
|
|
max_iterations: 20
|
|||
|
|
default_toolset: [sl_query, wiki_search, sl_read_source]
|
|||
|
|
memory:
|
|||
|
|
auto_commit: false
|
|||
|
|
scan:
|
|||
|
|
enrichment: { mode: none }
|
|||
|
|
relationships:
|
|||
|
|
enabled: false # disabled — no warehouse to relate against
|
|||
|
|
llmProposals: false
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Notes / gotchas learned the hard way:
|
|||
|
|
|
|||
|
|
- `source_dir` **must be absolute** and **must not be the same as
|
|||
|
|
`--project-dir`** (the dbt adapter copies the dir into
|
|||
|
|
`.ktx/cache/local-ingest/` and refuses to recursively copy a parent into
|
|||
|
|
itself). Hence the `work/<id>/dbt/` sub-structure.
|
|||
|
|
- `llm.provider.backend: none` (the `dev init` default) makes `ktx ingest`
|
|||
|
|
on the dbt adapter fail with `"requires llm.provider.backend: anthropic,
|
|||
|
|
vertex, gateway, or claude-code"`. The dbt adapter is LLM-driven.
|
|||
|
|
- `llm.models.default` is **required** whenever `provider.backend != none`.
|
|||
|
|
- `claude-code` backend reuses the local Claude Code OAuth session, so no
|
|||
|
|
`ANTHROPIC_API_KEY` env var is needed.
|
|||
|
|
- `--yes` and `--no-input` are mutually exclusive on `ktx ingest`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Orchestrator
|
|||
|
|
|
|||
|
|
`/Users/klo-dev/work/spider2-ktx/orchestrator.py` — the one moving part.
|
|||
|
|
|
|||
|
|
Per-instance flow:
|
|||
|
|
|
|||
|
|
1. `make_workspace(id)` — copy `Spider2/spider2-dbt/examples/<id>/` into
|
|||
|
|
`work/<id>/dbt/`.
|
|||
|
|
2. `ktx dev init <work/<id>>` and write the ktx.yaml above.
|
|||
|
|
3. `ktx ingest dbt_project --plain --yes` — runs the LLM-driven dbt
|
|||
|
|
adapter; output lands in `work/<id>/wiki/global/*.md`.
|
|||
|
|
4. Spawn `claude --print --permission-mode bypassPermissions ...` with
|
|||
|
|
- cwd = `work/<id>/dbt` (the agent works inside the dbt project)
|
|||
|
|
- `--add-dir work/<id>` (so the agent can read the wiki)
|
|||
|
|
- `--allowedTools Bash,Edit,Read,Write,Glob,Grep,WebFetch,TodoWrite`
|
|||
|
|
- `--system-prompt` from `SYSTEM_PROMPT` (see `agent_prompt.md`)
|
|||
|
|
- the prompt is the Spider2 instruction.
|
|||
|
|
5. Stream the agent's JSONL events to `logs/<id>/claude-stream.jsonl`,
|
|||
|
|
capture the final `result` message as a summary string.
|
|||
|
|
6. `collect_result()` — copy the largest `*.duckdb` in `work/<id>/dbt/`
|
|||
|
|
into `results/<id>/<name>.duckdb` and add an entry
|
|||
|
|
`{instance_id, answer_type: "file", answer_or_path: "<name>.duckdb"}`
|
|||
|
|
to `results/results_metadata.jsonl`. Metadata is re-written after every
|
|||
|
|
instance, so partial runs are recoverable.
|
|||
|
|
|
|||
|
|
CLI:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /Users/klo-dev/work/spider2-ktx
|
|||
|
|
source .venv/bin/activate
|
|||
|
|
|
|||
|
|
# One specific instance
|
|||
|
|
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
|
|||
|
|
|
|||
|
|
# All 68
|
|||
|
|
python orchestrator.py --budget 3 --timeout 1500 --evaluate
|
|||
|
|
|
|||
|
|
# Skip ingest (when workspace already has wiki) — speeds re-runs
|
|||
|
|
python orchestrator.py -n provider001 --skip-ingest
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Flags:
|
|||
|
|
|
|||
|
|
| Flag | Default | Meaning |
|
|||
|
|
|------|---------|---------|
|
|||
|
|
| `-n, --instance` | none | Repeatable; restrict to listed instance ids |
|
|||
|
|
| `-l, --limit` | none | First N from spider2-dbt.jsonl |
|
|||
|
|
| `--model` | sonnet | Claude Code model alias |
|
|||
|
|
| `--budget` | 4.0 | `--max-budget-usd` per instance |
|
|||
|
|
| `--timeout` | 1800 | Wall-clock seconds per instance |
|
|||
|
|
| `--force` | off | Wipe and recreate workspace |
|
|||
|
|
| `--skip-ingest` | off | Reuse existing wiki |
|
|||
|
|
| `--evaluate` | off | Run `evaluate.py` at the end |
|
|||
|
|
|
|||
|
|
Scoring:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /Users/klo-dev/work/spider2-ktx/Spider2/spider2-dbt/evaluation_suite
|
|||
|
|
python evaluate.py \
|
|||
|
|
--result_dir /Users/klo-dev/work/spider2-ktx/results \
|
|||
|
|
--gold_dir ./gold
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The official evaluator prints `score = passes / total`, and one line per
|
|||
|
|
passing instance id.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. What ktx currently provides to the agent
|
|||
|
|
|
|||
|
|
`ktx ingest dbt_project --plain --yes` on a Spider2 example emits only
|
|||
|
|
**wiki pages** under `work/<id>/wiki/global/*.md`. There are **no
|
|||
|
|
semantic-layer entities** — `ktx sl list` returns `items: []`.
|
|||
|
|
|
|||
|
|
Example, for `playbook001`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
work/playbook001/wiki/global/
|
|||
|
|
├── acme-dbt-project.md # project overview: profile, sources, models
|
|||
|
|
└── cpa-roas-definitions.md # exact CPA & ROAS formulas, grain, columns
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For `asset001`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
work/asset001/wiki/global/
|
|||
|
|
├── dbt-asset-project-overview.md
|
|||
|
|
├── bar-quotes.md
|
|||
|
|
└── book-value.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The wiki pages **do** carry high-signal information for these tasks — they
|
|||
|
|
pre-digest the dbt project into prose with formulas, grain, columns, and
|
|||
|
|
unverified-vs-verified annotations. That's what made `playbook001` score
|
|||
|
|
1.0: the wiki said `CPA = total_spend / attribution_points`, `ROAS =
|
|||
|
|
attribution_revenue / total_spend`, grain `(date_month, utm_source)`, and
|
|||
|
|
the agent transcribed that into `cpa_and_roas.sql` directly.
|
|||
|
|
|
|||
|
|
The wiki itself flags the missing piece:
|
|||
|
|
|
|||
|
|
> "Run `ktx scan` on the DuckDB connection to populate the warehouse schema
|
|||
|
|
> and enable SL source creation for these tables."
|
|||
|
|
|
|||
|
|
Which brings us to:
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. The DuckDB connector gap
|
|||
|
|
|
|||
|
|
`ktx` ships connectors for: `postgres / postgresql / mysql / snowflake /
|
|||
|
|
bigquery / sqlite / sqlserver / clickhouse`. **There is no DuckDB scan
|
|||
|
|
connector**. References:
|
|||
|
|
|
|||
|
|
- `packages/cli/src/connection.test.ts:494` — `driver: duckdb` is asserted
|
|||
|
|
to be **unknown** by `createKtxCliScanConnector`.
|
|||
|
|
- `packages/context/src/sl/local-query.ts:59` — `DUCKDB: 'duckdb'` is a SQL
|
|||
|
|
*dialect* constant for query generation, not a connector.
|
|||
|
|
- `packages/context/src/mcp/local-project-ports.ts:32` — same: dialect
|
|||
|
|
hint, not a connector.
|
|||
|
|
|
|||
|
|
Consequence: with the current setup we can't add a warehouse connection
|
|||
|
|
that introspects each example's `.duckdb`. The dbt adapter falls back to
|
|||
|
|
wiki-only output, which is why `semantic-layer/` stays empty.
|
|||
|
|
|
|||
|
|
### The plan you're about to act on
|
|||
|
|
|
|||
|
|
Add `packages/connector-duckdb/` modeled on `packages/connector-sqlite/`:
|
|||
|
|
|
|||
|
|
| File | Source to copy from | Adapt |
|
|||
|
|
|------|---------------------|-------|
|
|||
|
|
| `package.json` | `connector-sqlite/package.json` | dep `better-sqlite3` → `duckdb` (or `@duckdb/node-api`) |
|
|||
|
|
| `src/dialect.ts` | `connector-sqlite/src/dialect.ts` | Quote with `"`; map types: `BIGINT → number`, `VARCHAR → string`, `TIMESTAMP → time`, etc. |
|
|||
|
|
| `src/connector.ts` | `connector-sqlite/src/connector.ts` | Replace `Database` with the DuckDB equivalent. Use `information_schema` instead of `sqlite_master`/`PRAGMA table_info`. For FKs DuckDB also has `information_schema.referential_constraints` + `key_column_usage`. Estimated row counts → `SELECT estimated_size FROM duckdb_tables()`. |
|
|||
|
|
| `src/index.ts` | `connector-sqlite/src/index.ts` | Re-export, plus `isKtxDuckDbConnectionConfig` |
|
|||
|
|
| `src/connector.test.ts` + `dialect.test.ts` | sqlite equivalents | Mirror tests; the sqlite ones are a good template for what to cover |
|
|||
|
|
|
|||
|
|
Then wire it up:
|
|||
|
|
|
|||
|
|
1. `packages/cli/src/local-scan-connectors.ts` — add a branch for
|
|||
|
|
`driver === 'duckdb'`, mirroring the sqlite branch.
|
|||
|
|
2. `packages/context/src/project/driver-schemas.ts` — extend
|
|||
|
|
`KTX_WAREHOUSE_DRIVERS` with `duckdb`. Connection config takes the same
|
|||
|
|
shape as sqlite (`path` or `url`).
|
|||
|
|
3. Add to `pnpm-workspace.yaml` if it isn't auto-discovered.
|
|||
|
|
4. `pnpm install && pnpm --filter @ktx/connector-duckdb run build && pnpm
|
|||
|
|
--filter @ktx/cli run build`.
|
|||
|
|
|
|||
|
|
Smoke test on `playbook001`:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /Users/klo-dev/work/spider2-ktx/work/playbook001
|
|||
|
|
# edit ktx.yaml — add a duckdb connection pointing at the warehouse:
|
|||
|
|
# connections:
|
|||
|
|
# warehouse:
|
|||
|
|
# driver: duckdb
|
|||
|
|
# path: /Users/klo-dev/work/spider2-ktx/work/playbook001/dbt/playbook.duckdb
|
|||
|
|
node /Users/klo-dev/conductor/workspaces/ktx/santiago/packages/cli/dist/bin.js \
|
|||
|
|
connection test warehouse
|
|||
|
|
node ... scan warehouse # populates raw-sources/
|
|||
|
|
node ... ingest dbt_project --plain --yes # should now write semantic-layer/*.yaml
|
|||
|
|
node ... sl list --json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
After that, update `orchestrator.write_ktx_yaml()` to also emit a
|
|||
|
|
`warehouse` connection per instance, pointing at
|
|||
|
|
`work/<id>/dbt/<name>.duckdb`. The `<name>` differs per instance (e.g.
|
|||
|
|
`playbook.duckdb`, `asset.duckdb`); the orchestrator already has
|
|||
|
|
`discover_duckdb_name()` for that.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Results so far (5-instance pilot)
|
|||
|
|
|
|||
|
|
Final score: **1 / 5 = 20%** on the official `evaluate.py`.
|
|||
|
|
|
|||
|
|
| Instance | Agent finished | Time (s) | Cost (USD) | Turns | Tool calls | Eval |
|
|||
|
|
|---------------|----------------|----------|------------|-------|---------------------------------------------|------|
|
|||
|
|
| playbook001 | OK | 82 | $0.28 | 30 | Bash 12, Read 5, Write 1, Edit 1 | ✅ 1.0 |
|
|||
|
|
| provider001 | OK | 289 | $0.57 | 44 | Bash 16, Read 7, Edit 1, Write 2 | ❌ 0 |
|
|||
|
|
| asana001 | OK | 181 | $0.54 | 44 | Bash 24, Read 1, Write 2, Edit 1 | ❌ 0 |
|
|||
|
|
| shopify001 | OK | 133 | $0.50 | 41 | Bash 13, Read 15, Write 2 | ❌ 0 |
|
|||
|
|
| asset001 | OK | 189 | $0.42 | 44 | Bash 14, Read 14, Write 2 | ❌ 0 |
|
|||
|
|
|
|||
|
|
Total spend on the pilot: ≈ $2.30. Mean: ~175 s, ~$0.46, ~40 turns.
|
|||
|
|
|
|||
|
|
**All five agent runs finished cleanly** — `dbt build` green, every target
|
|||
|
|
table materialised in the DuckDB. The four failures are *value-level*
|
|||
|
|
mismatches: column orderings, tie-breaks, NULL handling, or
|
|||
|
|
precision/rounding diverging from gold. That's exactly the failure mode
|
|||
|
|
that richer ktx context (real column dtypes, sample values, primary keys,
|
|||
|
|
SL measures) should address.
|
|||
|
|
|
|||
|
|
For reference, GPT-4o reported ~10% and o1-preview ~17%, so a 20% on n=5 is
|
|||
|
|
roughly in band but the sample is far too small to claim a delta.
|
|||
|
|
|
|||
|
|
### Why playbook001 passed
|
|||
|
|
|
|||
|
|
The wiki page `cpa-roas-definitions.md` pre-derived:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
CPA = total_spend / attribution_points (column: cost_per_acquisition)
|
|||
|
|
ROAS = attribution_revenue / total_spend (column: return_on_advertising_spend)
|
|||
|
|
Grain: (date_month, utm_source)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The agent read this page (via `KTX_PROJECT_DIR=.. ktx wiki list --json`
|
|||
|
|
then plain `Read` on `../wiki/global/cpa-roas-definitions.md`), wrote the
|
|||
|
|
missing `models/cpa_and_roas.sql` directly from it, and `dbt build`
|
|||
|
|
produced the correct table.
|
|||
|
|
|
|||
|
|
### Why the others failed (best guesses, not investigated deeply)
|
|||
|
|
|
|||
|
|
- `provider001`: gold checks `provider` table columns
|
|||
|
|
`[0,1,2,5,6,7,9,10,11,12,13]` and `specialty_mapping` columns `[0,1]`.
|
|||
|
|
All 7 tables are produced with the right schema; the tie-break logic for
|
|||
|
|
"most specific specialty" diverges from gold.
|
|||
|
|
- `asana001`: 95 models materialised, 55 tests passed; the gold compares
|
|||
|
|
`asana__team [0..9]` and `asana__user [0,1,2]` and our values differ on
|
|||
|
|
one or more aggregations (open vs completed task counts, avg close
|
|||
|
|
time).
|
|||
|
|
- `shopify001` and `asset001`: similar pattern — structure right, values
|
|||
|
|
off.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Hypotheses for the next agent
|
|||
|
|
|
|||
|
|
In rough order of expected impact:
|
|||
|
|
|
|||
|
|
1. **DuckDB connector** (above) so `ktx scan` and `ktx ingest` together
|
|||
|
|
emit `semantic-layer/<conn>/<source>.yaml` with real columns, types,
|
|||
|
|
primary keys, sample values, and (if enabled) relationship proposals.
|
|||
|
|
Expose those to the sub-agent via either:
|
|||
|
|
- `ktx sl read <source>` calls from Bash, or
|
|||
|
|
- the `ktx mcp stdio` server attached via `claude --mcp-config`.
|
|||
|
|
|
|||
|
|
2. **Verification step in the system prompt** — currently the agent
|
|||
|
|
declares success on `dbt build` green. Add: "Before declaring success,
|
|||
|
|
for every target table run `SELECT * FROM <t> ORDER BY 1 LIMIT 5` and
|
|||
|
|
sanity-check column count, types, no NaN/NULL in not-null cols, row
|
|||
|
|
count > 0; also compare the produced column names with the column list
|
|||
|
|
in the schema.yml / wiki / sl source." Cheap fix; should turn some of
|
|||
|
|
the value-mismatch fails into passes (or into productive iteration).
|
|||
|
|
|
|||
|
|
3. **dbt run-stage tests** — Spider2 examples often ship `tests/`; have
|
|||
|
|
the agent run `dbt test` after `dbt build` and treat any new test
|
|||
|
|
failures as a signal to revise. Some examples actually have gold-
|
|||
|
|
verifying tests in the project itself.
|
|||
|
|
|
|||
|
|
4. **Try Opus for hard cases** — the orchestrator passes `--model
|
|||
|
|
sonnet`; flipping to `opus` on the retries of failed instances may
|
|||
|
|
recover some of the value-mismatch tasks. Cost goes up ~5×.
|
|||
|
|
|
|||
|
|
5. **`ktx scan` query-history off** — currently `query-history` is
|
|||
|
|
`skipped` because the dbt adapter doesn't expose history. Once a
|
|||
|
|
warehouse connection exists, leave it skipped (DuckDB has no useful
|
|||
|
|
history for these one-shot DBs).
|
|||
|
|
|
|||
|
|
6. **Parallelism** — `claude` is rate-limit-sensitive but the orchestrator
|
|||
|
|
is fully sequential. Two or three workers via `concurrent.futures`
|
|||
|
|
would cut wall-clock to ~1h for the full 68.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. Resuming after the DuckDB connector lands
|
|||
|
|
|
|||
|
|
Concrete steps for the next agent:
|
|||
|
|
|
|||
|
|
1. **Confirm the connector is wired up**:
|
|||
|
|
```bash
|
|||
|
|
cd /Users/klo-dev/conductor/workspaces/ktx/santiago
|
|||
|
|
pnpm run build
|
|||
|
|
pnpm run ktx -- dev schema | jq '.properties.connections.additionalProperties.oneOf[].properties.driver.const' | sort -u
|
|||
|
|
# should include "duckdb"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Update the orchestrator's ktx.yaml template** in
|
|||
|
|
`/Users/klo-dev/work/spider2-ktx/orchestrator.py` (`write_ktx_yaml`).
|
|||
|
|
Pseudocode:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
db_name = discover_duckdb_name(ws / "dbt") # e.g. "playbook.duckdb"
|
|||
|
|
...
|
|||
|
|
connections:
|
|||
|
|
dbt_project:
|
|||
|
|
driver: dbt
|
|||
|
|
source_dir: {ws}/dbt
|
|||
|
|
warehouse:
|
|||
|
|
driver: duckdb
|
|||
|
|
path: {ws}/dbt/{db_name}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Also re-enable scan relationship discovery if it gives useful output:
|
|||
|
|
```yaml
|
|||
|
|
scan:
|
|||
|
|
enrichment: { mode: deterministic }
|
|||
|
|
relationships:
|
|||
|
|
enabled: true
|
|||
|
|
llmProposals: true
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Verify on a known-passing instance first** (`playbook001`) to make
|
|||
|
|
sure the dbt+warehouse combo still emits the same wiki pages it did
|
|||
|
|
before, plus new SL YAML, and the score stays at 1.0:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /Users/klo-dev/work/spider2-ktx
|
|||
|
|
source .venv/bin/activate
|
|||
|
|
python orchestrator.py -n playbook001 --force --budget 3 --timeout 1500
|
|||
|
|
cd Spider2/spider2-dbt/evaluation_suite
|
|||
|
|
python evaluate.py --result_dir ../../../results --gold_dir ./gold
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
4. **Optionally** improve the system prompt in `orchestrator.SYSTEM_PROMPT`
|
|||
|
|
to instruct the agent to use SL tools:
|
|||
|
|
- `ktx sl list --json`
|
|||
|
|
- `ktx sl read <source>`
|
|||
|
|
- `ktx sl query --connection-id warehouse --measure ...`
|
|||
|
|
|
|||
|
|
5. **Re-run a small batch** with diverse failures (`provider001`,
|
|||
|
|
`asana001`, `shopify001`, `asset001`) to see whether SL access lifts
|
|||
|
|
those scores from 0 → 1. If it moves the needle, run the full 68:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python orchestrator.py --budget 3 --timeout 1500 --evaluate
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Sequential 68 × ~3 min ≈ 3.5 h, ~$25 at current rates.
|
|||
|
|
|
|||
|
|
6. **Write the result back** — append a section to this doc with the new
|
|||
|
|
score and a one-line note per failing instance, so we accumulate
|
|||
|
|
evidence over iterations rather than losing it.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. Misc references
|
|||
|
|
|
|||
|
|
- KTX MCP tools (see `packages/context/src/mcp/context-tools.ts`):
|
|||
|
|
`connection_list`, `wiki_search`, `wiki_read`, `sl_read_source`,
|
|||
|
|
`sl_query`, `entity_details`, `dictionary_search`, `discover_data`,
|
|||
|
|
`sql_execution`, `memory_ingest`, `memory_ingest_status`.
|
|||
|
|
`sql_execution` will work for DuckDB once the connector exists; today
|
|||
|
|
it has no transport for it.
|
|||
|
|
- The sqlite connector at `packages/connector-sqlite/src/connector.ts`
|
|||
|
|
is the closest template for DuckDB.
|
|||
|
|
- `packages/context/src/ingest/adapters/dbt/` is the dbt adapter that
|
|||
|
|
generates the wiki pages — `parse.ts` reads `dbt_project.yml`,
|
|||
|
|
`schema.yml`, models; `chunk.ts` breaks them into work units;
|
|||
|
|
`dbt.adapter.ts` orchestrates.
|
|||
|
|
- Evaluator code is at
|
|||
|
|
`Spider2/spider2-dbt/evaluation_suite/{evaluate.py, eval_utils.py}`.
|
|||
|
|
`duckdb_match` is the only function that matters here.
|
|||
|
|
- Spider2 paper: https://arxiv.org/abs/2411.07763
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 11. Quick sanity checks for a fresh agent
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Toolchain
|
|||
|
|
which node pnpm uv claude
|
|||
|
|
source /Users/klo-dev/work/spider2-ktx/.venv/bin/activate && python -c "import dbt, duckdb, anthropic"
|
|||
|
|
|
|||
|
|
# KTX CLI build still works
|
|||
|
|
cd /Users/klo-dev/conductor/workspaces/ktx/santiago
|
|||
|
|
pnpm run ktx -- --help
|
|||
|
|
|
|||
|
|
# Orchestrator runnable
|
|||
|
|
cd /Users/klo-dev/work/spider2-ktx
|
|||
|
|
source .venv/bin/activate
|
|||
|
|
python orchestrator.py -h
|
|||
|
|
|
|||
|
|
# A previous result still scores 1.0
|
|||
|
|
cd Spider2/spider2-dbt/evaluation_suite
|
|||
|
|
python evaluate.py --result_dir ../../../results --gold_dir ./gold
|
|||
|
|
# expects: 0.2 1 5 (current state)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
If any of those fail before you do anything else, the environment has
|
|||
|
|
drifted — fix that before adding the connector.
|