Unsloth-Finetune-Template/README.md

7.5 KiB

Unsloth Fine-Tune Template

Linux only — This template is designed for Linux systems with NVIDIA GPU (CUDA), AMD GPU (ROCm), or Vulkan support.

A template for fine-tuning LLMs using Unsloth and converting to GGUF format with llama.cpp.

Prerequisites

  • Linux OS
  • Python 3.10+
  • NVIDIA GPU (CUDA) or AMD GPU (ROCm) or Vulkan-compatible GPU
  • cmake
  • git

Quick Start

# 1. Setup (clones llama.cpp, builds it, installs dependencies)
bash setup.sh

# 2. Configure scripts (see variables below)

# 3. Run full pipeline
bash run-pipeline.sh

Workflow

scripts/generate-data.sh   → Generate synthetic training data (optional)
scripts/finetune.sh        → Fine-tune model with LoRA adapters
scripts/merge-and-convert.sh → Merge LoRA into base model and convert to GGUF
scripts/run-model.sh       → Run the converted GGUF model
run-pipeline.sh            → Run finetune → merge/convert → run in sequence

Setup

setup.sh will:

  1. Create a Python virtual environment and install Python dependencies
  2. Clone llama.cpp (fresh build) or symlink an existing build
  3. Build llama.cpp with shared libraries (-DBUILD_SHARED_LIBS=ON)
  4. Install llama-cpp-python bindings linked against the shared library (-DLLAMA_BUILD=OFF)

Using an existing llama.cpp build: Choose option 2 and provide the absolute path. The build must have been created with -DBUILD_SHARED_LIBS=ON and contain libllama.so. Setup will create a symlink at ./llama.cpp.

Backend Selection

Backend is only prompted when building llama.cpp from scratch. Choose based on your GPU:

Choice Backend Requirements
1 CUDA (NVIDIA) Systemwide CUDA installation (NVIDIA drivers + CUDA toolkit)
2 ROCm (AMD) Systemwide ROCm installation (AMD drivers + ROCm toolkit)
3 Vulkan Vulkan drivers + libvulkan-dev, glslc, spirv-headers
4 CPU only None

Vulkan dependencies (Ubuntu/Debian):

sudo apt-get install libvulkan-dev glslc spirv-headers

Verify Vulkan is correctly installed:

vulkaninfo

Should run without errors.

Existing llama.cpp Build

If using option 2 (existing build), ensure it was compiled with shared libraries:

cmake -B build -DBUILD_SHARED_LIBS=ON # Add your custom build options
cmake --build build --config Release -j$(nproc)

The build must contain libllama.so (typically at build/bin/libllama.so).

Scripts

1. scripts/generate-data.sh

Generates synthetic training data using a GGUF model via llama.cpp. Run this if you need to create or extend a training dataset.

Edit synthetic-data.py:

Variable Description Example
GGUF_MODEL_PATH Path to the GGUF model used for generation ./path/to/model.gguf
INPUT_PARQUET_PATH Path to existing training data to extend ./data/train.parquet
OUTPUT_PARQUET_PATH Path to save the combined dataset ./data/output.parquet
NEW_ROWS_COUNT Number of synthetic records to generate 100
User prompt (line 66) Replace "YOUR PROMPT GOES HERE" with generation instructions Generate questions about machine learning...
System message (line 62) Controls the model's role "You are a data generator. Output ONLY the format below..."
max_tokens Max tokens per response 200
temperature Creativity of generation 0.7
top_p Nucleus sampling threshold 0.95
top_k Top-k sampling threshold 50
min_p Minimum probability threshold 0.05

The script expects output in the format:

Question: <generated question>
Answer: <generated answer>
bash scripts/generate-data.sh

2. scripts/finetune.sh

Fine-tunes a model using Unsloth with LoRA adapters. Saves LoRA adapter to ./model/.

Edit finetune.py:

Variable Description Example
DATA_PATH Path to training Parquet file ./data/output.parquet
OUTPUT_DIR Directory to save LoRA adapters (leave at default) ./model
BATCH_SIZE Per-device batch size 2
GRADIENT_ACCUMULATION_STEPS Gradient accumulation steps 8
LEARNING_RATE Training learning rate 2e-4
MAX_LENGTH Maximum sequence length 4096
TRAIN_EPOCHS Number of training epochs 1
model_name (line 74) Base model to fine-tune "Qwen/Qwen3.5-2B""
bash scripts/finetune.sh

3. scripts/merge-and-convert.sh

Merges LoRA adapters into the base model, saves the merged model, then converts to GGUF format using llama.cpp.

Edit merge.py:

Variable Description Example
BASE_MODEL_PATH Path to the base model (same as model_name in finetune.py) "Qwen/Qwen3.5-2B"
LORA_DIR Path to LoRA adapters (leave at default) ./model
MERGED_MODEL_PATH Output directory for merged model (leave at default) ./merged_model
bash scripts/merge-and-convert.sh

4. scripts/run-model.sh

Runs the converted GGUF model using llama.cpp's CLI interface for inference.

Edit run-model.sh:

Variable Description Example
Model path Path to the GGUF file (gguf file name will vary based on base model) ./merged_model/model.gguf
bash scripts/run-model.sh

Output Structure

./model/                  ← LoRA adapters (from finetune.sh)
./merged_model/           ← Merged HF model + GGUF file (from merge-and-convert.sh)
llama.cpp/                ← llama.cpp repository (created by setup.sh)
scripts/                  ← Individual pipeline step scripts
setup.sh                  ← Setup script (venv + llama.cpp build/symlink)
run-pipeline.sh           ← Run full pipeline (finetune → merge/convert → run)

Troubleshooting

llama.cpp build fails

See the official build guide: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

Common issues:

  • CUDA: Requires a systemwide CUDA installation (NVIDIA drivers + CUDA toolkit)
  • ROCm: Requires a systemwide ROCm installation (AMD drivers + ROCm toolkit)
  • Vulkan: Requires Vulkan drivers + libvulkan-dev, glslc, spirv-headers
  • cmake: Install via sudo apt install cmake (Debian/Ubuntu)

Out of memory during training

  • Reduce BATCH_SIZE in finetune.py (lower = less VRAM usage)
  • Increase GRADIENT_ACCUMULATION_STEPS to compensate (higher = longer finetuning time)
  • EFFECTIVE_BATCH_SIZE = BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS = 16+
  • Reduce MAX_LENGTH to fit shorter sequences
  • Set load_in_4bit=True in finetune.py (line 77) for QLoRA

llama-cpp-python install fails

  • Ensure llama.cpp is built successfully first (or build it yourself if you want to use a backend other than CUDA, ROCm or Vulkan)
  • Try CPU-only install first to verify: pip install llama-cpp-python
  • Check llama-cpp-python docs

Project Structure

├── finetune.py           ← Training script
├── merge.py              ← Merge LoRA into base model
├── synthetic-data.py     ← Generate synthetic training data
├── requirements.txt      ← Python dependencies
├── setup.sh              ← One-time setup
├── run-pipeline.sh       ← Run full pipeline
├── scripts/
│   ├── generate-data.sh
│   ├── finetune.sh
│   ├── merge-and-convert.sh
│   └── run-model.sh
└── README.md