2026-06-02 15:45:59 +02:00
# Unsloth Fine-Tune Template
> **Linux only** — This template is designed for Linux systems with NVIDIA GPU (CUDA), AMD GPU (ROCm), or Vulkan support.
A template for fine-tuning LLMs using [Unsloth ](https://github.com/unslothai/unsloth ) and converting to GGUF format with [llama.cpp ](https://github.com/ggerganov/llama.cpp ).
## Prerequisites
- Linux OS
- Python 3.10+
- NVIDIA GPU (CUDA) or AMD GPU (ROCm) or Vulkan-compatible GPU
- [cmake ](https://cmake.org/ )
- [git ](https://git-scm.com/ )
## Quick Start
```bash
# 1. Setup (clones llama.cpp, builds it, installs dependencies)
bash setup.sh
# 2. Configure scripts (see variables below)
# 3. Run full pipeline
bash run-pipeline.sh
```
## Workflow
```
scripts/generate-data.sh → Generate synthetic training data (optional)
scripts/finetune.sh → Fine-tune model with LoRA adapters
scripts/merge-and-convert.sh → Merge LoRA into base model and convert to GGUF
scripts/run-model.sh → Run the converted GGUF model
run-pipeline.sh → Run finetune → merge/convert → run in sequence
```
## Setup
`setup.sh` will:
1. Create a Python virtual environment and install Python dependencies
2026-06-02 17:08:39 +02:00
2. Clone [llama.cpp ](https://github.com/ggml-org/llama.cpp ) (fresh build) or symlink an existing build
3. Build llama.cpp with shared libraries (`-DBUILD_SHARED_LIBS=ON` )
4. Install llama-cpp-python bindings linked against the shared library (`-DLLAMA_BUILD=OFF` )
2026-06-02 15:45:59 +02:00
2026-06-02 17:08:39 +02:00
**Using an existing llama.cpp build:** Choose option 2 and provide the absolute path. The build must have been created with `-DBUILD_SHARED_LIBS=ON` and contain `libllama.so` . Setup will create a symlink at `./llama.cpp` .
2026-06-02 16:39:02 +02:00
2026-06-02 15:45:59 +02:00
### Backend Selection
2026-06-02 17:08:39 +02:00
Backend is only prompted when building llama.cpp from scratch. Choose based on your GPU:
2026-06-02 15:45:59 +02:00
| Choice | Backend | Requirements |
|---|---|---|
2026-06-02 15:55:34 +02:00
| 1 | CUDA (NVIDIA) | Systemwide CUDA installation (NVIDIA drivers + CUDA toolkit) |
| 2 | ROCm (AMD) | Systemwide ROCm installation (AMD drivers + ROCm toolkit) |
| 3 | Vulkan | Vulkan drivers + `libvulkan-dev` , `glslc` , `spirv-headers` |
2026-06-02 15:45:59 +02:00
| 4 | CPU only | None |
2026-06-02 15:55:34 +02:00
**Vulkan dependencies (Ubuntu/Debian):**
```bash
sudo apt-get install libvulkan-dev glslc spirv-headers
```
Verify Vulkan is correctly installed:
```bash
vulkaninfo
```
Should run without errors.
2026-06-02 17:08:39 +02:00
### Existing llama.cpp Build
If using option 2 (existing build), ensure it was compiled with shared libraries:
```bash
cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON # or -DGGML_HIP=ON / -DGGML_VULKAN=1
cmake --build build --config Release -j$(nproc)
```
The build must contain `libllama.so` (typically at `build/libllama.so` ).
2026-06-02 15:45:59 +02:00
## Scripts
### 1. scripts/generate-data.sh
Generates synthetic training data using a GGUF model via llama.cpp. Run this if you need to create or extend a training dataset.
**Edit `synthetic-data.py` :**
| Variable | Description | Example |
|---|---|---|
| `GGUF_MODEL_PATH` | Path to the GGUF model used for generation | `./path/to/model.gguf` |
| `INPUT_PARQUET_PATH` | Path to existing training data to extend | `./data/train.parquet` |
| `OUTPUT_PARQUET_PATH` | Path to save the combined dataset | `./data/output.parquet` |
| `NEW_ROWS_COUNT` | Number of synthetic records to generate | `100` |
2026-06-02 17:08:39 +02:00
| `User prompt` (line 66) | Replace `"YOUR PROMPT GOES HERE"` with generation instructions | `Generate questions about machine learning...` |
| `System message` (line 62) | Controls the model's role | `"You are a data generator. Output ONLY the format below..."` |
| `max_tokens` | Max tokens per response | `200` |
| `temperature` | Creativity of generation | `0.7` |
| `top_p` | Nucleus sampling threshold | `0.95` |
| `top_k` | Top-k sampling threshold | `50` |
| `min_p` | Minimum probability threshold | `0.05` |
2026-06-02 17:41:18 +02:00
The script expects output in the format:
2026-06-02 17:08:39 +02:00
```
Question: < generated question >
Answer: < generated answer >
```
2026-06-02 15:45:59 +02:00
```bash
bash scripts/generate-data.sh
```
### 2. scripts/finetune.sh
2026-06-02 17:41:18 +02:00
Fine-tunes a model using Unsloth with LoRA adapters. Saves LoRA adapter to `./model/` .
2026-06-02 15:45:59 +02:00
**Edit `finetune.py` :**
| Variable | Description | Example |
|---|---|---|
| `DATA_PATH` | Path to training Parquet file | `./data/output.parquet` |
2026-06-02 17:41:18 +02:00
| `OUTPUT_DIR` | Directory to save LoRA adapters (leave at default) | `./model` |
2026-06-02 15:45:59 +02:00
| `BATCH_SIZE` | Per-device batch size | `2` |
| `GRADIENT_ACCUMULATION_STEPS` | Gradient accumulation steps | `8` |
| `LEARNING_RATE` | Training learning rate | `2e-4` |
| `MAX_LENGTH` | Maximum sequence length | `4096` |
| `TRAIN_EPOCHS` | Number of training epochs | `1` |
| `model_name` (line 74) | Base model to fine-tune | `"unsloth/Llama-3.2-3B-Instruct"` |
```bash
bash scripts/finetune.sh
```
### 3. scripts/merge-and-convert.sh
Merges LoRA adapters into the base model, saves the merged model, then converts to GGUF format using llama.cpp.
**Edit `merge.py` :**
| Variable | Description | Example |
|---|---|---|
2026-06-02 17:41:18 +02:00
| `BASE_MODEL_PATH` | Path to the base model (same as model_name in finetune.py) | `"Qwen/Qwen3.5-2B"` |
| `LORA_DIR` | Path to LoRA adapters (leave at default) | `./model` |
| `MERGED_MODEL_PATH` | Output directory for merged model (leave at default) | `./merged_model` |
2026-06-02 15:45:59 +02:00
```bash
bash scripts/merge-and-convert.sh
```
### 4. scripts/run-model.sh
Runs the converted GGUF model using llama.cpp's CLI interface for inference.
**Edit `run-model.sh` :**
| Variable | Description | Example |
|---|---|---|
2026-06-02 17:41:18 +02:00
| Model path | Path to the GGUF file (gguf file name will vary based on base model) | `./merged_model/model.gguf` |
2026-06-02 15:45:59 +02:00
```bash
bash scripts/run-model.sh
```
## Output Structure
```
./model/ ← LoRA adapters (from finetune.sh)
./merged_model/ ← Merged HF model + GGUF file (from merge-and-convert.sh)
llama.cpp/ ← llama.cpp repository (created by setup.sh)
scripts/ ← Individual pipeline step scripts
2026-06-02 17:41:18 +02:00
setup.sh ← Setup script (venv + llama.cpp build/symlink)
2026-06-02 15:45:59 +02:00
run-pipeline.sh ← Run full pipeline (finetune → merge/convert → run)
```
## Troubleshooting
### llama.cpp build fails
See the official build guide:
https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
Common issues:
2026-06-02 15:55:34 +02:00
- **CUDA**: Requires a systemwide CUDA installation (NVIDIA drivers + CUDA toolkit)
- **ROCm**: Requires a systemwide ROCm installation (AMD drivers + ROCm toolkit)
- **Vulkan**: Requires Vulkan drivers + `libvulkan-dev` , `glslc` , `spirv-headers`
2026-06-02 15:45:59 +02:00
- **cmake**: Install via `sudo apt install cmake` (Debian/Ubuntu)
### Out of memory during training
- Reduce `BATCH_SIZE` in `finetune.py`
- Increase `GRADIENT_ACCUMULATION_STEPS` to compensate
- Reduce `MAX_LENGTH` to fit shorter sequences
- Set `load_in_4bit=True` in `finetune.py` (line 77)
### llama-cpp-python install fails
2026-06-02 17:41:18 +02:00
- Ensure llama.cpp is built successfully first (or build it yourself if you want to use a backend other than CUDA, ROCm or Vulkan)
2026-06-02 15:45:59 +02:00
- Try CPU-only install first to verify: `pip install llama-cpp-python`
2026-06-02 17:41:18 +02:00
- Check [llama-cpp-python docs ](https://llama-cpp-python.readthedocs.io/en/latest/ )
2026-06-02 15:45:59 +02:00
## Project Structure
```
├── finetune.py ← Training script
├── merge.py ← Merge LoRA into base model
├── synthetic-data.py ← Generate synthetic training data
├── requirements.txt ← Python dependencies
├── setup.sh ← One-time setup
├── run-pipeline.sh ← Run full pipeline
├── scripts/
│ ├── generate-data.sh
│ ├── finetune.sh
│ ├── merge-and-convert.sh
│ └── run-model.sh
└── README.md
```