mirror of
https://github.com/SakanaAI/doc-to-lora.git
synced 2026-04-25 00:06:20 +02:00
| .. | ||
| eval | ||
| 0-download_data.py | ||
| 1-train.sh | ||
| 2-train-chunk.sh | ||
| gen_data.sh | ||
| README.md | ||
| train-cross-enc-chunk-slurm.sh | ||
| train-cross-enc-slurm.sh | ||
| train_no_qa.sh | ||
D2L pipeline
Data
You can either download the generated data (recommended, ~100 GB for each model) or generate them by youself.
Please see 0-download_data.sh for how to do model-specific data download.
# download training data for all three models (328GB)
uv run bash scripts/main_exp/0-download_data.sh
Generating data from scratch can take very long if not parallelized across multiple gpus.
# generate training data (takes very long if not parallelized across multiple gpus)
# optional: use the command below for generating data from scratch
# uv run bash scripts/main_exp/gen_data.sh
Training
Simply run the training script once the data is ready.
# train
uv run bash scripts/main_exp/1-train.sh
Evaluation
All evaluation scripts for reproducing the main results in the paper are included in eval directory.