doc-to-lora/scripts/main_exp
2026-02-27 03:47:04 +00:00
..
eval Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
0-download_data.py Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
1-train.sh Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
2-train-chunk.sh Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
gen_data.sh Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
README.md Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
train-cross-enc-chunk-slurm.sh Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
train-cross-enc-slurm.sh Doc-to-LoRA release 2026-02-27 03:47:04 +00:00
train_no_qa.sh Doc-to-LoRA release 2026-02-27 03:47:04 +00:00

D2L pipeline

Data

You can either download the generated data (recommended, ~100 GB for each model) or generate them by youself. Please see 0-download_data.sh for how to do model-specific data download.

# download training data for all three models (328GB)
uv run bash scripts/main_exp/0-download_data.sh

Generating data from scratch can take very long if not parallelized across multiple gpus.

# generate training data (takes very long if not parallelized across multiple gpus)
# optional: use the command below for generating data from scratch
# uv run bash scripts/main_exp/gen_data.sh

Training

Simply run the training script once the data is ready.

# train
uv run bash scripts/main_exp/1-train.sh

Evaluation

All evaluation scripts for reproducing the main results in the paper are included in eval directory.