mirror of https://github.com/SakanaAI/doc-to-lora.git synced 2026-06-08 15:05:14 +02:00

51616 1abe8ae16d Doc-to-LoRA release		2026-02-27 03:47:04 +00:00
..
eval	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
0-download_data.py	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
1-train.sh	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
2-train-chunk.sh	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
gen_data.sh	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
README.md	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
train-cross-enc-chunk-slurm.sh	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
train-cross-enc-slurm.sh	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00
train_no_qa.sh	Doc-to-LoRA release	2026-02-27 03:47:04 +00:00

README.md

D2L pipeline

Data

You can either download the generated data (recommended, ~100 GB for each model) or generate them by youself. Please see 0-download_data.sh for how to do model-specific data download.

# download training data for all three models (328GB)
uv run bash scripts/main_exp/0-download_data.sh

Generating data from scratch can take very long if not parallelized across multiple gpus.

# generate training data (takes very long if not parallelized across multiple gpus)
# optional: use the command below for generating data from scratch
# uv run bash scripts/main_exp/gen_data.sh

Training

Simply run the training script once the data is ready.

# train
uv run bash scripts/main_exp/1-train.sh

Evaluation

All evaluation scripts for reproducing the main results in the paper are included in eval directory.