Ship every paper-referenced experiment script

Reorganizes the repo so every section of the paper has a corresponding script. Previously only the core recipe + control + evals were here. New subdirs: - tts/ — test-time sampling (§2.2, §3.3): scaling sweep, HE, MATH-500, AIME, 14B-recipe + TTS, 8B-raw-TTS control. - experiments/ — every §3 finding as a runnable script: · self_consistency (§3.4) · recipe_x_tts_synergy (§3.5, novel) · mbpp_seeded_cross_arch (§3.9) · cross_domain_code_to_math (§3.10) · self_correction_math_{naive,fixed} (§3.10, the catastrophic-then-recovered case) · math500_seeded_mining (§3.10 distribution mismatch) · bcb_hard_eval (§3.10 distribution mismatch) · recursive_bootstrap (§3.10 plateau) · diversity_cued_mining (§3.10 low yield) · aime_scaling (TTS curve) · star_baseline_gsm8k (related-work baseline) - evals/ — moved out of recipe/ (eval_raw, eval_plus, confirm) Also adds: bootstrap_14b_4bit_harvest, curriculum_code, math_bootstrap to recipe/ for completeness. REPRODUCE.md now maps each paper section / table / figure to its exact script and expected output.
2026-06-11 21:05:12 +02:00 · 2026-05-13 21:09:54 +05:00 · 2026-05-13 21:09:54 +05:00 · 826f934d2e
commit 826f934d2e
parent c867697f7c
27 changed files with 4467 additions and 134 deletions
--- a/REPRODUCE.md
+++ b/REPRODUCE.md
@ -1,154 +1,151 @@
 # Reproduction Guide

-Maps every paper claim → exact command. There are **two replication paths**:
+Maps every paper claim → the script that produced it. Two replication paths:

- **Fast path** — use `recipe/train_on_pairs.py` with the released `data/*.jsonl`. Skips the mining stage. Gets you the trained adapter and the headline number in ~30 min on an H100.
- **Full path** — re-run the original research scripts (`bootstrap.py`, `multi_pair_14b.py`, `curriculum_math.py`) end-to-end including the self-mining step. This reproduces the recipe from scratch and verifies the mining is deterministic-ish (modulo sampling).
+- **Fast path** — use `recipe/train_on_pairs.py` with `data/*.jsonl`. Reproduces the trained adapter and headline number in ~30 min on H100. Recommended for paper verification.
+- **Full path** — re-run the original research scripts end-to-end including the self-mining stage. Use this if applying the recipe to a *new* base model.

-The fast path is what you want for paper verification. The full path is what you want if you're trying the recipe on a *new* base model.
+A note on script conventions: scripts under `recipe/`, `evals/`, and `controls/` are clean replication paths (argparse CLIs, no hardcoded paths). Scripts under `experiments/` and `tts/` are the original research code used to produce each finding — they work but use `--tag`-style outputs and sometimes assume `/workspace/` paths (set via `HF_HOME` env var). Read the top-of-file docstring of each to see exact invocation.

 ---

 ## Environment

 Tested on:
- **H100 80GB** (recommended for 14B runs) — Debian 12, CUDA 12.4, driver 570+
- **RTX 6000 Ada 48GB** — sufficient for 7B and 3B runs
+- **H100 80GB** — Debian 12, CUDA 12.4, driver 570+ (required for vLLM 0.8.5)
+- **RTX 6000 Ada 48GB** — sufficient for ≤7B models

 ```bash
 pip install -r requirements.txt
 ```

-Exact stack used in the paper: `torch==2.6.0`, `transformers==4.51.3`, `vllm==0.8.5`, `peft==0.13.0`.
+Pinned stack: `torch==2.6.0`, `transformers==4.51.3`, `vllm==0.8.5`, `peft==0.13.0`.

 ---

-## FAST PATH — reproduce headline numbers from released pairs
+# Mapping: paper claim → script

-### Qwen2.5-7B-Base → 25 → 95–112/164 (3-seed range)
+## §2 Method
+
+| Paper § | Method | Script | Notes |
+|---|---|---|---|
+| §2.1 | Self-bootstrap pipeline (code) | `recipe/bootstrap.py` | Generation → solving → mining → train, end-to-end |
+| §2.1 | 4-bit harvest for large models | `recipe/bootstrap_14b_4bit_harvest.py` | NF4 quantization, harvest-only (no in-loop training) |
+| §2.1 | Aggressive multi-pair mining | `recipe/multi_pair_14b.py` | The 14B 80.5% pipeline |
+| §2.2 | Test-time sampling (oracle) | `tts/tts_scaling.py` | Pass@N for HE / HE+ / MATH-500 |
+| §2.3 | Auto-difficulty curriculum (math) | `recipe/curriculum_math.py` | The GSM8K 32→66 pipeline |
+| §2.3 | Auto-difficulty curriculum (code) | `recipe/curriculum_code.py` | Code variant |
+
+---
+
+## §3 Experiments
+
+### §3.2 Recipe alone — HumanEval and HumanEval+
+
+| Claim (paper Table 1) | Script + command |
+|---|---|
+| Qwen2.5-7B-Base: 25 → 112 (+87 best seed) | Fast path: `python recipe/train_on_pairs.py --model Qwen/Qwen2.5-7B --pairs data/pairs_7b_40.jsonl --seed 13 --lora-rank 16 --out adapter_7b_seed13` then `python evals/eval_raw.py --model Qwen/Qwen2.5-7B --adapter adapter_7b_seed13 --bench humaneval` |
+| Qwen2.5-14B-Base: 44 → 131 / 80% on HE, 122/164 on HE+ | `cat data/pairs_7b_40.jsonl data/pairs_14b_multi_new60.jsonl > /tmp/14b.jsonl; python recipe/train_on_pairs.py --model Qwen/Qwen2.5-14B --pairs /tmp/14b.jsonl --lora-rank 32 --out adapter_14b_multi; python evals/eval_plus.py --model Qwen/Qwen2.5-14B --adapter adapter_14b_multi` |
+| Multi-pair full path (re-mine + train) | `python recipe/multi_pair_14b.py --model Qwen/Qwen2.5-14B --warmup_pairs_path data/pairs_7b_40.jsonl --n_problems 200 --n_attempts 8 --max_pairs_per_problem 4 --lora_rank 32 --tag multi_rerun` |
+| Boundary table for all 9 models | `python evals/eval_raw.py --model <each>` for baseline; recipe + re-eval per model. Cost: ~3 hr H100. |
+
+### §3.3 Test-time sampling (TTS) alone
+
+| Claim | Script | Expected |
+|---|---|---|
+| Qwen3-4B best-of-8 HE oracle = 92.7% | `python tts/tts_humaneval.py --model Qwen/Qwen3-4B-Base --n 8 --temperature 0.7` | 152/164 |
+| Qwen3-8B best-of-8 HE oracle = 92.1% | `python tts/tts_humaneval.py --model Qwen/Qwen3-8B-Base --n 8 --temperature 0.7` | 151/164 |
+| Qwen3-4B best-of-8 MATH-500 = 79.4% | `python tts/tts_math500.py --model Qwen/Qwen3-4B-Base --n 8` | 397/500 |
+| Qwen3-8B best-of-8 MATH-500 = 81.0% | `python tts/tts_math500.py --model Qwen/Qwen3-8B-Base --n 8` | 405/500 |
+| AIME pass@k curve (k=1..64) | `python tts/tts_aime.py --model Qwen/Qwen3-8B-Base --n 32` | 25.6 / 38.9% best-of-32 |
+| Full TTS scaling sweep (Table 2) | `python tts/tts_scaling.py --model Qwen/Qwen3-4B-Base` |  |
+
+### §3.4 Self-consistency (deployable TTS, no oracle)

 ```bash
-# 1. Baseline (raw-completion eval)
-python recipe/eval_raw.py --model Qwen/Qwen2.5-7B --bench humaneval
-# Expected: 25/164
-
-# 2. Train on the released 40 pairs (try multiple seeds — small-data variance)
-for SEED in 7 13 42; do
-    python recipe/train_on_pairs.py \
-        --model Qwen/Qwen2.5-7B \
-        --pairs data/pairs_7b_40.jsonl \
-        --out adapter_7b_seed${SEED} \
-        --seed ${SEED} --lora-rank 16 --epochs 2 --lr 1e-4
-    python recipe/eval_raw.py \
-        --model Qwen/Qwen2.5-7B \
-        --adapter adapter_7b_seed${SEED} \
-        --bench humaneval
-done
-# Expected: seed 7 → 104/164, seed 13 → 112/164, seed 42 → 95/164
+python experiments/self_consistency.py \
+    --model Qwen/Qwen3-4B-Base \
+    --bench gsm8k --n 8
 ```
+Tests if majority-vote selection without oracle access matches oracle pass@N. See paper Table 3.

-### Qwen2.5-14B-Base → 132/164 (80.5%) and HumanEval+ 122/164 (74.4%)
-
-The 14B run uses 100 pairs total: the 40 warmup pairs + 60 new aggressive-mined pairs. Concatenate first, then train.
+### §3.5 Recipe × TTS synergy threshold (novel finding)

 ```bash
-cat data/pairs_7b_40.jsonl data/pairs_14b_multi_new60.jsonl > /tmp/pairs_14b_100.jsonl
-
-python recipe/train_on_pairs.py \
-    --model Qwen/Qwen2.5-14B \
-    --pairs /tmp/pairs_14b_100.jsonl \
-    --out adapter_14b_multi \
-    --lora-rank 32 --epochs 2 --lr 1e-4
-
-python recipe/eval_raw.py \
-    --model Qwen/Qwen2.5-14B \
+python experiments/recipe_x_tts_synergy.py \
+    --base-model Qwen/Qwen2.5-14B \
    --adapter adapter_14b_multi \
-    --bench humaneval
-# Expected: 132/164 (80.5%) in the multi-pair eval format
-
-python recipe/eval_plus.py \
-    --model Qwen/Qwen2.5-14B \
-    --adapter adapter_14b_multi
-# Expected: HumanEval+ 122/164 (74.4%)
+    --n 8
 ```
+Compares: raw base | raw base + TTS | recipe-trained | recipe-trained + TTS. The novel finding: at sufficient mined-pair counts, recipe-trained + TTS > raw + TTS (+12.8pp). At too-few pairs, recipe-trained + TTS < raw + TTS (-4.9pp on Qwen2.5-3B with 36 pairs).

-### Qwen2.5-3B-Base → GSM8K 32 → 66
+### §3.6 Control: format alone does not explain the lift

 ```bash
-python recipe/train_on_pairs.py \
-    --model Qwen/Qwen2.5-3B \
-    --pairs data/pairs_math_13.jsonl \
-    --out adapter_3b_math \
-    --lora-rank 16 --epochs 2 --lr 1e-4
-
-# GSM8K eval — uses sympy as the verifier (no oracle math model needed).
-# eval_raw.py auto-detects GSM8K format and runs the right verifier.
-python recipe/eval_raw.py \
-    --model Qwen/Qwen2.5-3B \
-    --adapter adapter_3b_math \
-    --bench gsm8k
-# Expected: 66/100
-```
-
---
-
-## FULL PATH — re-mine from scratch
-
-These reproduce the *mining* step too. Each script does generation → solving → mining → training → eval as one pipeline. They write a `pairs.jsonl` and a `result.json` under `--tag`.
-
-### Self-bootstrap from scratch on Qwen2.5-7B
-
-```bash
-python recipe/bootstrap.py \
+python controls/mbpp_corrupt_control.py \
    --model Qwen/Qwen2.5-7B \
-    --iterations 20 \
-    --problems_per_iter 16 \
-    --train_every 10 \
-    --eval_every 10 \
-    --tag bs_7b_rerun
-# Writes: results/bs_7b_rerun/{pairs.jsonl,ckpt_iter*,eval_log.json,result.json}
-# Expected final eval: 25 → 95–112 (seed-dependent)
+    --tag mbpp_corrupt_control
 ```
+Expected: HumanEval stays at 25/164 (Δ = 0). Confirms the signal is in self-mined content, not pair-formatted training data.

-### Aggressive multi-pair mining on Qwen2.5-14B (the 80.5% headline)
+### §3.7 Multi-pair mining at 14B (the 80.5% headline)

 ```bash
 python recipe/multi_pair_14b.py \
    --model Qwen/Qwen2.5-14B \
    --warmup_pairs_path data/pairs_7b_40.jsonl \
-    --n_warmup_pairs 40 \
-    --n_problems 200 \
-    --n_attempts 8 \
-    --max_pairs_per_problem 4 \
-    --lora_rank 32 --epochs 2 --lr 1e-4 \
+    --n_problems 200 --n_attempts 8 \
+    --max_pairs_per_problem 4 --lora_rank 32 \
    --tag multi_rerun
-# Writes: results/multi_pair/multi_rerun/{pairs_new.jsonl,adapter/,result.json}
-# Expected: trained 130–134/164 (~80%)
 ```
+Expected: base 67/164 → trained 132/164 (multi-pair eval format) / 131/164 chat-template / 122/164 HE+.

-### GSM8K auto-difficulty curriculum on Qwen2.5-3B
+### §3.8 Math: auto-difficulty curriculum

 ```bash
 python recipe/curriculum_math.py \
    --model Qwen/Qwen2.5-3B \
    --iterations 16 \
    --tag curr_3b_rerun
-# Mines 10–15 curriculum-difficulty pairs, trains, evals.
-# Expected: GSM8K 32 → 60–70 (some seed variance)
 ```
+Expected: GSM8K 32/100 → 66/100. Compare to `recipe/math_bootstrap.py` (vanilla, no curriculum) which regresses.
+
+### §3.9 Cross-architecture and cross-generation
+
+| Model | Script | Expected |
+|---|---|---|
+| Llama-3.2-3B (own-mined 32) | `python experiments/mbpp_seeded_cross_arch.py --model meta-llama/Llama-3.2-3B` | HE 39→43 (+4) |
+| Qwen2.5-Coder-7B-Base | `python experiments/mbpp_seeded_cross_arch.py --model Qwen/Qwen2.5-Coder-7B` | HE 83→87 (+4), MBPP 122→124 (+2) |
+| Qwen3-4B-Base | Same script, Qwen3-4B-Base | HE 79→106 (+27), MBPP 135→148 (+13) |
+
+### §3.10 Failure modes and negative results
+
+Each negative finding has its own script. Run any of these to verify the documented failure.
+
+| Failure mode | Script | Expected |
+|---|---|---|
+| Saturation (Qwen3-8B/14B HE) | `python recipe/bootstrap.py --model Qwen/Qwen3-8B-Base --tag sat_check` | 132 → 118–133, no clean lift |
+| BCB-Hard distribution mismatch | `python experiments/bcb_hard_eval.py --model Qwen/Qwen3-8B-Base --adapter adapter_7b_seed13` | No transfer; HE-style pairs don't generalize to library code |
+| MATH-500 mining distribution mismatch | `python experiments/math500_seeded_mining.py --model Qwen/Qwen3-8B-Base` | 279/500 → 239/500 (−40, catastrophic) |
+| Self-correction over-correction (naive) | `python experiments/self_correction_math_naive.py --model Qwen/Qwen3-4B-Base` | 299/500 → 69/500 (Δ=−230!) |
+| Self-correction recovery (fixed) | `python experiments/self_correction_math_fixed.py --model Qwen/Qwen3-4B-Base` | Recovers to baseline + small lift via mixed positives |
+| Recursive bootstrap plateau | `python experiments/recursive_bootstrap.py --model Qwen/Qwen2.5-7B --iters 3` | iter1 gives most lift, iter2/3 plateau |
+| Cross-domain transfer (code→math) | `python experiments/cross_domain_code_to_math.py --code-adapter adapter_7b_seed13` | +2 marginal lift on GSM8K |
+| Diversity-cued mining low yield | `python experiments/diversity_cued_mining.py --model Qwen/Qwen2.5-7B` | Fewer well-formed pairs than vanilla mining |

 ---

-## Control experiment (Figure 2)
+## §3.11 Boundary conditions summary (Figure 6)

-Verifies the signal is in the *content* of self-mined pairs, not the format. Replaces the mined pairs with mechanically-corrupted external pairs (MBPP-style) that look identical structurally.
+The 9-model boundary chart is the synthesis of per-model recipe runs. To regenerate:

 ```bash
-python controls/mbpp_corrupt_control.py \
-    --model Qwen/Qwen2.5-7B \
-    --tag mbpp_corrupt_control
-# Expected: HumanEval stays at 25/164 (Δ ≈ 0, ± seed noise)
+for MODEL in Qwen/Qwen2.5-{3B,7B,14B,72B} Qwen/Qwen3-{1.7B,4B,8B,14B}-Base meta-llama/Llama-3.2-3B Qwen/Qwen2.5-Coder-7B allenai/OLMo-2-1124-7B; do
+    python evals/eval_raw.py --model "$MODEL" --bench humaneval  # baseline
+    python recipe/bootstrap.py --model "$MODEL" --tag "boundary_$(echo $MODEL | tr '/' '_')"
+done
 ```
+Run time: ~3 hours on a single H100, ~$8 cost.

 ---

@ -161,42 +158,40 @@ for N in 10 21 40; do
        --model Qwen/Qwen2.5-7B \
        --pairs /tmp/pairs_$N.jsonl \
        --out adapter_n$N --epochs 2
-    python recipe/eval_raw.py \
+    python evals/eval_raw.py \
        --model Qwen/Qwen2.5-7B --adapter adapter_n$N --bench humaneval
 done
-# Expected: n=10 → ~51, n=21 → 86–95, n=40 → 95–112 (seed-dependent for small N)
 ```
+Expected: n=10 → ~51, n=21 → mean ~91, n=40 → mean ~105 (seed-dependent for small N).

 ---

-## Boundary conditions to verify (paper §3)
+## Related-work baseline

-| Claim | Hint | Expected |
-|-------|------|----------|
-| Qwen3-8B saturated on HE | Run multi_pair_14b.py with `--model Qwen/Qwen3-8B-Base` | Base 132, adapter ≈ 118–133 — no clean lift |
-| Qwen2.5-72B saturated | Same on 72B with 10 pairs | Base 83 → trained 73 (−10) |
-| MATH-500 distribution mismatch | Mining on simple problems + MATH-500 eval | Base 279/500 → trained 239/500 (−40) |
-| Self-correction over-correction | Train on wrong→fix triples only, no right→stays-right | Base 299/500 → trained 69/500 (−230) |
-| BCB-Hard distribution mismatch | Apply 7B 40-pair adapter, eval on BCB-Hard | No transfer |
+| Method | Script | Use |
+|---|---|---|
+| STaR / rejection-sampling FT on GSM8K | `experiments/star_baseline_gsm8k.py` | Comparison point for the curriculum result |

 ---

-## Notes on stochasticity
+## Notes on stochasticity and reproducibility

- **vLLM sampling** is deterministic given a fixed seed, but vLLM 0.8.x occasionally changes pad/EOS handling between point releases. Pin to 0.8.5.
- **LoRA training is seed-sensitive at small N.** The 7B 40-pair run spans 95–112/164 across seeds 7/13/42. The 14B 100-pair run is much tighter (130–134/164).
- **Stop tokens matter.** Use `--stop "\nclass " --stop "\nif __name__"` for raw-completion eval. Wrong stop tokens cut output prematurely and produce artifactually low baselines. We saw this earlier in the project — see paper §2.
+- **vLLM sampling** is deterministic given a fixed seed, but vLLM 0.8.x can change pad/EOS handling between point releases. Pin to 0.8.5.
+- **LoRA training is seed-sensitive at small N.** 7B 40-pair: 95–112/164 across seeds 7/13/42. 14B 100-pair: 130–134/164 (tighter).
+- **Stop tokens matter.** Use `--stop "\nclass " --stop "\nif __name__"` for raw-completion eval. Wrong stop tokens cut output and produce artifactually low baselines. We hit this earlier in the project; the paper §2 documents the fix.

 ---

 ## Cost reference (May 2026, RunPod)

 | Workflow | Hardware | Wall time | Cost |
-|----------|----------|-----------|------|
+|---|---|---|---|
 | 7B headline (fast path) | RTX 6000 Ada 48GB | ~30 min | ~$0.50 |
 | 14B 80.5% (fast path) | H100 80GB | ~30 min | ~$1.50 |
-| 14B 80.5% full path (mining + train) | H100 80GB | ~95 min | ~$3.50 |
-| GSM8K 32→66 | RTX 6000 Ada | ~30 min | ~$0.50 |
-| Full eval matrix (9 models) | H100 80GB | ~3 hrs | ~$8 |
+| 14B 80.5% full path | H100 80GB | ~95 min | ~$3.50 |
+| GSM8K 32→66 curriculum | RTX 6000 Ada | ~30 min | ~$0.50 |
+| TTS scaling sweep (one model) | H100 80GB | ~30 min | ~$1.50 |
+| Full 9-model boundary chart | H100 80GB | ~3 hrs | ~$8 |
+| Every negative result | mixed | ~5 hrs total | ~$15 |

-Total cost to verify all numbers in the paper via the fast path: **under $10**.
+Verify all paper numbers via fast path: **under $10**. Full reproduction from scratch (including all negative results and the full TTS sweep): **~$50**, matching the paper's reported total spend.