Doc-to-LoRA release

2026-04-25 08:06:22 +02:00 · 2026-02-27 03:47:04 +00:00 · 2026-02-27 03:47:04 +00:00 · 1abe8ae16d
commit 1abe8ae16d
92 changed files with 22131 additions and 0 deletions
--- a/scripts/main_exp/README.md
+++ b/scripts/main_exp/README.md
@ -0,0 +1,25 @@
+# D2L pipeline
+### Data
+You can either download the generated data (recommended, ~100 GB for each model) or generate them by youself.
+Please see [`0-download_data.sh`](0-download_data.sh) for how to do model-specific data download.
+```bash
+# download training data for all three models (328GB)
+uv run bash scripts/main_exp/0-download_data.sh
+```
+
+Generating data from scratch can take very long if not parallelized across multiple gpus.
+```bash
+# generate training data (takes very long if not parallelized across multiple gpus)
+# optional: use the command below for generating data from scratch
+# uv run bash scripts/main_exp/gen_data.sh
+```
+
+### Training
+Simply run the training script once the data is ready.
+```bash
+# train
+uv run bash scripts/main_exp/1-train.sh
+```
+
+### Evaluation
+All evaluation scripts for reproducing the main results in the paper are included in [eval](eval/) directory.