Ship every paper-referenced experiment script

Reorganizes the repo so every section of the paper has a corresponding script. Previously only the core recipe + control + evals were here. New subdirs: - tts/ — test-time sampling (§2.2, §3.3): scaling sweep, HE, MATH-500, AIME, 14B-recipe + TTS, 8B-raw-TTS control. - experiments/ — every §3 finding as a runnable script: · self_consistency (§3.4) · recipe_x_tts_synergy (§3.5, novel) · mbpp_seeded_cross_arch (§3.9) · cross_domain_code_to_math (§3.10) · self_correction_math_{naive,fixed} (§3.10, the catastrophic-then-recovered case) · math500_seeded_mining (§3.10 distribution mismatch) · bcb_hard_eval (§3.10 distribution mismatch) · recursive_bootstrap (§3.10 plateau) · diversity_cued_mining (§3.10 low yield) · aime_scaling (TTS curve) · star_baseline_gsm8k (related-work baseline) - evals/ — moved out of recipe/ (eval_raw, eval_plus, confirm) Also adds: bootstrap_14b_4bit_harvest, curriculum_code, math_bootstrap to recipe/ for completeness. REPRODUCE.md now maps each paper section / table / figure to its exact script and expected output.
2026-06-08 20:55:13 +02:00 · 2026-05-13 21:09:54 +05:00 · 2026-05-13 21:09:54 +05:00 · 826f934d2e
commit 826f934d2e
parent c867697f7c
27 changed files with 4467 additions and 134 deletions
--- a/README.md
+++ b/README.md
@ -37,29 +37,59 @@ A control experiment — replacing the mined pairs with **identically-formatted

 ```
 tinyforge-zero/
-├── recipe/
-│   ├── train_on_pairs.py       # Fast-path: train LoRA on a released pairs.jsonl
-│   ├── bootstrap.py            # Full-path: self-bootstrap pipeline (mining + train, 7B / 3B)
-│   ├── multi_pair_14b.py       # Full-path: aggressive multi-pair variant → 80.5% on 14B
-│   ├── curriculum_math.py      # Full-path: auto-difficulty curriculum for GSM8K
-│   ├── eval_raw.py             # HumanEval / MBPP / GSM8K eval (vLLM, raw-completion)
-│   ├── eval_plus.py            # HumanEval+ contamination-resistant eval
-│   └── confirm.py              # Confirmation re-eval against base
-├── data/
-│   ├── pairs_7b_40.jsonl              # 40 self-mined pairs (Qwen2.5-7B-Base run)
-│   ├── pairs_14b_multi_new60.jsonl    # 60 aggressive-mined pairs for 14B (+ warmup 40 → 100 total)
-│   └── pairs_math_13.jsonl            # 13 curriculum-mined math pairs (Qwen2.5-3B-Base → GSM8K 32→66)
+├── recipe/                                  # Training pipelines
+│   ├── train_on_pairs.py                    # Fast-path: train LoRA on a released pairs.jsonl
+│   ├── bootstrap.py                         # Self-bootstrap pipeline (mining + train, 7B / 3B)
+│   ├── bootstrap_14b_4bit_harvest.py        # 4-bit harvest variant (when full-precision OOMs)
+│   ├── multi_pair_14b.py                    # Aggressive multi-pair variant → 80.5% on 14B
+│   ├── curriculum_math.py                   # Auto-difficulty curriculum for GSM8K (§2.3, §3.8)
+│   ├── curriculum_code.py                   # Auto-difficulty curriculum for code
+│   └── math_bootstrap.py                    # Vanilla math bootstrap (regressed; see §3.8)
+├── evals/                                   # Evaluation harnesses
+│   ├── eval_raw.py                          # HumanEval / MBPP / GSM8K (vLLM, raw-completion)
+│   ├── eval_plus.py                         # HumanEval+ contamination-resistant eval
+│   └── confirm.py                           # Confirmation re-eval against base
+├── tts/                                     # Test-time sampling (§2.2, §3.3)
+│   ├── tts_scaling.py                       # Pass@N scaling sweep (HE, HE+, MATH-500)
+│   ├── tts_humaneval.py                     # Best-of-N pass@1 on HE/HE+
+│   ├── tts_math500.py                       # Best-of-N pass@1 on MATH-500
+│   ├── tts_aime.py                          # Pass@k curve on AIME (k=1..64)
+│   ├── tts_qwen14b_recipe.py                # TTS on top of the 14B multi-pair adapter
+│   └── tts_qwen3_8b_raw_control.py          # Control: TTS on raw Qwen3-8B (recipe vs sampling)
+├── experiments/                             # Every paper experiment, one script each
+│   ├── self_consistency.py                  # §3.4 — deployable TTS via majority vote (no oracle)
+│   ├── recipe_x_tts_synergy.py              # §3.5 — recipe × TTS synergy threshold (novel finding)
+│   ├── cross_domain_code_to_math.py         # §3.10 — code-trained recipe on math (+2, marginal)
+│   ├── mbpp_seeded_cross_arch.py            # §3.9 — Llama/Coder cross-architecture self-mining
+│   ├── diversity_cued_mining.py             # §3.10 — diversity-cued mining (low yield)
+│   ├── recursive_bootstrap.py               # §3.10 — recursive iter1→iter2→iter3 (plateau)
+│   ├── self_correction_code.py              # §3.10 — code self-correction recipe
+│   ├── self_correction_math_naive.py        # §3.10 — naive (wrong→fix only): catastrophic regress
+│   ├── self_correction_math_fixed.py        # §3.10 — fixed (mixed positives): recovered
+│   ├── math500_seeded_mining.py             # §3.10 — distribution-mismatch demo (catastrophic)
+│   ├── aime_scaling.py                      # AIME pass@k = 1..64 sweep
+│   ├── bcb_hard_eval.py                     # §3.10 — BigCodeBench-Hard distribution mismatch
+│   └── star_baseline_gsm8k.py               # Related-work baseline (STaR / rejection sampling FT)
 ├── controls/
-│   └── mbpp_corrupt_control.py # The +0 negative-control experiment
+│   └── mbpp_corrupt_control.py              # §3.6 — the +0 negative-control experiment
+├── data/                                    # Released mined pairs (drove paper numbers)
+│   ├── pairs_7b_40.jsonl                    # 40 pairs for Qwen2.5-7B-Base
+│   ├── pairs_14b_multi_new60.jsonl          # 60 aggressive-mined pairs for 14B (+ warmup 40 = 100)
+│   └── pairs_math_13.jsonl                  # 13 curriculum-mined math pairs (3B GSM8K)
 ├── docs/
-│   ├── scaling_chart.png       # Recipe lift vs base capability (paper Fig 1)
-│   ├── fig1_headline.png       # Headline result chart
-│   └── fig6_boundary.png       # Boundary conditions across 9 models
-├── REPRODUCE.md                # Paper figure/table → exact command mapping
+│   ├── recipe_diagram.png                   # The 5-stage recipe diagram (rendered above)
+│   ├── scaling_chart.png                    # Recipe lift vs base capability (paper Fig 1)
+│   ├── fig1_headline.png                    # Headline result chart
+│   └── fig6_boundary.png                    # Boundary conditions across 9 models
+├── scripts/
+│   └── make_recipe_diagram.py               # Source for the rendered recipe diagram
+├── REPRODUCE.md                             # Paper claim → exact command mapping (all sections)
 ├── requirements.txt
 └── LICENSE
 ```

+A note on these scripts: `recipe/`, `evals/`, and `controls/` are the clean replication paths — these have argparse CLIs and produce the headline numbers. The scripts under `experiments/` and `tts/` are the **original research scripts** used to produce each figure / table in the paper. They work, but they're closer to "research code" than "production tooling" — argument names vary, some have hard-coded paths to `/workspace/`, and they were each run on RunPod with a specific GPU. Read the top-of-file docstring of any experiment script for what it does and how to invoke it.
+
 ---

 ## Quickstart
@ -73,7 +103,7 @@ cd tinyforge-zero
 pip install -r requirements.txt

 # 3. Baseline the model (so you know the lift is real)
-python recipe/eval_raw.py \
+python evals/eval_raw.py \
    --model Qwen/Qwen2.5-7B \
    --bench humaneval

@ -85,7 +115,7 @@ python recipe/train_on_pairs.py \
    --out adapter_7b --seed 13

 # 5. Evaluate the trained adapter
-python recipe/eval_raw.py \
+python evals/eval_raw.py \
    --model Qwen/Qwen2.5-7B \
    --adapter adapter_7b \
    --bench humaneval