**Self-bootstrapping recipes for open base LLMs — no human-written training data.**
A 14B open base model reaches **80% on HumanEval** and **74.4% on HumanEval+** with only a Python interpreter as oracle and no human-curated training data, for under **$5** of consumer-GPU compute. This repo contains the recipes, mined pairs, evaluation scripts, and adapters from the paper.
📄 **Paper**: *How Far Can an Open Base Model Self-Improve? Recipes, Limits, and Test-Time Synergy* — arXiv link forthcoming
All numbers from `result.json` files in this repo's accompanying paper data. Same adapter under the multi-pair run's eval format reads **132/164 (80.5%)** — both round to 80%.
A control experiment — replacing the mined pairs with **identically-formatted but randomly-corrupted external pairs** — yields **exactly +0**. The signal is in the self-mined content, not the training-data format.
---
## What's in this repo
```
tinyforge-zero/
├── recipe/
│ ├── train_on_pairs.py # Fast-path: train LoRA on a released pairs.jsonl
# 2. Install (Python 3.10+, CUDA 12.1+, GPU with ≥40GB VRAM recommended)
pip install -r requirements.txt
# 3. Baseline the model (so you know the lift is real)
python recipe/eval_raw.py \
--model Qwen/Qwen2.5-7B \
--bench humaneval
# 4. Train on the released 40 mined pairs (~10 min on H100)
python recipe/train_on_pairs.py \
--model Qwen/Qwen2.5-7B \
--pairs data/pairs_7b_40.jsonl \
--epochs 2 --lr 1e-4 --lora-rank 16 \
--out adapter_7b --seed 13
# 5. Evaluate the trained adapter
python recipe/eval_raw.py \
--model Qwen/Qwen2.5-7B \
--adapter adapter_7b \
--bench humaneval
```
Expected outcome: HumanEval moves from ~25/164 to **~95–112/164** (seed-dependent).
For the **14B → 80.5%** run, use `recipe/multi_pair_14b.py` with both `data/pairs_7b_40.jsonl` (warmup) and `data/pairs_14b_multi_new60.jsonl`. See [REPRODUCE.md](REPRODUCE.md) for the exact command and expected hardware.
---
## Boundary conditions (where the recipe fails)

The recipe works under stated conditions. We document four failure modes:
1.**Saturation**: Qwen3-8B/14B-Base and Qwen2.5-72B-Base have so little headroom on HumanEval that mining produces zero or negative lift.
2.**Distribution mismatch**: Pairs mined on simple problems do not transfer to BigCodeBench-Hard (library code) or MATH-500 (competition math). Catastrophic when ignored — see the over-correction case (Qwen3-4B MATH-500 dropped 299 → 69).
3.**Base capability floor**: OLMo-2-7B at 5/164 baseline produces too few "fix" attempts to mine from.
4.**Self-correction trained on wrong→fix only**: model over-doubts and degrades on correct outputs. Mixing right→stays-right traces recovers it.
See the paper's §3 for measurements; the boundary chart above shows the recipe's lift across all 9 base models we tested.
---
## Adapters
The LoRA adapter weights for the headline 14B run (the 80.5% adapter) are ~200 MB and are not committed to this repo. They live separately:
| Qwen2.5-3B GSM8K curriculum | RTX 6000 Ada | ~30 min | <$1 |
| Full eval suite (9 models, HE+HE++MBPP) | 1× H100 | ~3 hrs | ~$8 |
All runs were on rented consumer/cloud GPUs (RunPod). Total spend documented in the paper was under $50.
---
## Citation
```bibtex
@misc{usman2026tinyforgezero,
title = {How Far Can an Open Base Model Self-Improve?
Recipes, Limits, and Test-Time Synergy},
author = {Rana Usman},
year = {2026},
eprint = {TBD},
archivePrefix = {arXiv},
primaryClass = {cs.AI}
}
```
---
## License
MIT — see [LICENSE](LICENSE). The mined pairs in `data/` are derivatives of base-model outputs (Qwen2.5 family, Apache-2.0). Treat downstream redistribution accordingly.