tinyforge-zero/REPRODUCE.md
Rana Usman 6305ff0f91 Initial release: TinyForge-Zero recipe + mined pairs + reproduction guide
Companion artifact for the paper 'How Far Can an Open Base Model
Self-Improve? Recipes, Limits, and Test-Time Synergy'.

Contents:
- recipe/{train_on_pairs,bootstrap,multi_pair_14b,curriculum_math,eval_raw,eval_plus,confirm}.py
- data/pairs_{7b_40,14b_multi_new60,math_13}.jsonl (released mined pairs)
- controls/mbpp_corrupt_control.py (the +0 negative control)
- docs/{scaling_chart,fig1_headline,fig6_boundary}.png
- REPRODUCE.md (paper claim -> exact command mapping)
2026-05-13 20:43:52 +05:00

202 lines
7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reproduction Guide
Maps every paper claim → exact command. There are **two replication paths**:
- **Fast path** — use `recipe/train_on_pairs.py` with the released `data/*.jsonl`. Skips the mining stage. Gets you the trained adapter and the headline number in ~30 min on an H100.
- **Full path** — re-run the original research scripts (`bootstrap.py`, `multi_pair_14b.py`, `curriculum_math.py`) end-to-end including the self-mining step. This reproduces the recipe from scratch and verifies the mining is deterministic-ish (modulo sampling).
The fast path is what you want for paper verification. The full path is what you want if you're trying the recipe on a *new* base model.
---
## Environment
Tested on:
- **H100 80GB** (recommended for 14B runs) — Debian 12, CUDA 12.4, driver 570+
- **RTX 6000 Ada 48GB** — sufficient for 7B and 3B runs
```bash
pip install -r requirements.txt
```
Exact stack used in the paper: `torch==2.6.0`, `transformers==4.51.3`, `vllm==0.8.5`, `peft==0.13.0`.
---
## FAST PATH — reproduce headline numbers from released pairs
### Qwen2.5-7B-Base → 25 → 95112/164 (3-seed range)
```bash
# 1. Baseline (raw-completion eval)
python recipe/eval_raw.py --model Qwen/Qwen2.5-7B --bench humaneval
# Expected: 25/164
# 2. Train on the released 40 pairs (try multiple seeds — small-data variance)
for SEED in 7 13 42; do
python recipe/train_on_pairs.py \
--model Qwen/Qwen2.5-7B \
--pairs data/pairs_7b_40.jsonl \
--out adapter_7b_seed${SEED} \
--seed ${SEED} --lora-rank 16 --epochs 2 --lr 1e-4
python recipe/eval_raw.py \
--model Qwen/Qwen2.5-7B \
--adapter adapter_7b_seed${SEED} \
--bench humaneval
done
# Expected: seed 7 → 104/164, seed 13 → 112/164, seed 42 → 95/164
```
### Qwen2.5-14B-Base → 132/164 (80.5%) and HumanEval+ 122/164 (74.4%)
The 14B run uses 100 pairs total: the 40 warmup pairs + 60 new aggressive-mined pairs. Concatenate first, then train.
```bash
cat data/pairs_7b_40.jsonl data/pairs_14b_multi_new60.jsonl > /tmp/pairs_14b_100.jsonl
python recipe/train_on_pairs.py \
--model Qwen/Qwen2.5-14B \
--pairs /tmp/pairs_14b_100.jsonl \
--out adapter_14b_multi \
--lora-rank 32 --epochs 2 --lr 1e-4
python recipe/eval_raw.py \
--model Qwen/Qwen2.5-14B \
--adapter adapter_14b_multi \
--bench humaneval
# Expected: 132/164 (80.5%) in the multi-pair eval format
python recipe/eval_plus.py \
--model Qwen/Qwen2.5-14B \
--adapter adapter_14b_multi
# Expected: HumanEval+ 122/164 (74.4%)
```
### Qwen2.5-3B-Base → GSM8K 32 → 66
```bash
python recipe/train_on_pairs.py \
--model Qwen/Qwen2.5-3B \
--pairs data/pairs_math_13.jsonl \
--out adapter_3b_math \
--lora-rank 16 --epochs 2 --lr 1e-4
# GSM8K eval — uses sympy as the verifier (no oracle math model needed).
# eval_raw.py auto-detects GSM8K format and runs the right verifier.
python recipe/eval_raw.py \
--model Qwen/Qwen2.5-3B \
--adapter adapter_3b_math \
--bench gsm8k
# Expected: 66/100
```
---
## FULL PATH — re-mine from scratch
These reproduce the *mining* step too. Each script does generation → solving → mining → training → eval as one pipeline. They write a `pairs.jsonl` and a `result.json` under `--tag`.
### Self-bootstrap from scratch on Qwen2.5-7B
```bash
python recipe/bootstrap.py \
--model Qwen/Qwen2.5-7B \
--iterations 20 \
--problems_per_iter 16 \
--train_every 10 \
--eval_every 10 \
--tag bs_7b_rerun
# Writes: results/bs_7b_rerun/{pairs.jsonl,ckpt_iter*,eval_log.json,result.json}
# Expected final eval: 25 → 95112 (seed-dependent)
```
### Aggressive multi-pair mining on Qwen2.5-14B (the 80.5% headline)
```bash
python recipe/multi_pair_14b.py \
--model Qwen/Qwen2.5-14B \
--warmup_pairs_path data/pairs_7b_40.jsonl \
--n_warmup_pairs 40 \
--n_problems 200 \
--n_attempts 8 \
--max_pairs_per_problem 4 \
--lora_rank 32 --epochs 2 --lr 1e-4 \
--tag multi_rerun
# Writes: results/multi_pair/multi_rerun/{pairs_new.jsonl,adapter/,result.json}
# Expected: trained 130134/164 (~80%)
```
### GSM8K auto-difficulty curriculum on Qwen2.5-3B
```bash
python recipe/curriculum_math.py \
--model Qwen/Qwen2.5-3B \
--iterations 16 \
--tag curr_3b_rerun
# Mines 1015 curriculum-difficulty pairs, trains, evals.
# Expected: GSM8K 32 → 6070 (some seed variance)
```
---
## Control experiment (Figure 2)
Verifies the signal is in the *content* of self-mined pairs, not the format. Replaces the mined pairs with mechanically-corrupted external pairs (MBPP-style) that look identical structurally.
```bash
python controls/mbpp_corrupt_control.py \
--model Qwen/Qwen2.5-7B \
--tag mbpp_corrupt_control
# Expected: HumanEval stays at 25/164 (Δ ≈ 0, ± seed noise)
```
---
## Pair-count sweep (Figure 3)
```bash
for N in 10 21 40; do
head -n $N data/pairs_7b_40.jsonl > /tmp/pairs_$N.jsonl
python recipe/train_on_pairs.py \
--model Qwen/Qwen2.5-7B \
--pairs /tmp/pairs_$N.jsonl \
--out adapter_n$N --epochs 2
python recipe/eval_raw.py \
--model Qwen/Qwen2.5-7B --adapter adapter_n$N --bench humaneval
done
# Expected: n=10 → ~51, n=21 → 8695, n=40 → 95112 (seed-dependent for small N)
```
---
## Boundary conditions to verify (paper §3)
| Claim | Hint | Expected |
|-------|------|----------|
| Qwen3-8B saturated on HE | Run multi_pair_14b.py with `--model Qwen/Qwen3-8B-Base` | Base 132, adapter ≈ 118133 — no clean lift |
| Qwen2.5-72B saturated | Same on 72B with 10 pairs | Base 83 → trained 73 (10) |
| MATH-500 distribution mismatch | Mining on simple problems + MATH-500 eval | Base 279/500 → trained 239/500 (40) |
| Self-correction over-correction | Train on wrong→fix triples only, no right→stays-right | Base 299/500 → trained 69/500 (230) |
| BCB-Hard distribution mismatch | Apply 7B 40-pair adapter, eval on BCB-Hard | No transfer |
---
## Notes on stochasticity
- **vLLM sampling** is deterministic given a fixed seed, but vLLM 0.8.x occasionally changes pad/EOS handling between point releases. Pin to 0.8.5.
- **LoRA training is seed-sensitive at small N.** The 7B 40-pair run spans 95112/164 across seeds 7/13/42. The 14B 100-pair run is much tighter (130134/164).
- **Stop tokens matter.** Use `--stop "\nclass " --stop "\nif __name__"` for raw-completion eval. Wrong stop tokens cut output prematurely and produce artifactually low baselines. We saw this earlier in the project — see paper §2.
---
## Cost reference (May 2026, RunPod)
| Workflow | Hardware | Wall time | Cost |
|----------|----------|-----------|------|
| 7B headline (fast path) | RTX 6000 Ada 48GB | ~30 min | ~$0.50 |
| 14B 80.5% (fast path) | H100 80GB | ~30 min | ~$1.50 |
| 14B 80.5% full path (mining + train) | H100 80GB | ~95 min | ~$3.50 |
| GSM8K 32→66 | RTX 6000 Ada | ~30 min | ~$0.50 |
| Full eval matrix (9 models) | H100 80GB | ~3 hrs | ~$8 |
Total cost to verify all numbers in the paper via the fast path: **under $10**.