Unsloth-Finetune-Template/README.md

# Unsloth Fine-Tune Template

> **Linux only** — This template is designed for Linux systems with NVIDIA GPU (CUDA), AMD GPU (ROCm), or Vulkan support.

A template for fine-tuning LLMs using [Unsloth](https://github.com/unslothai/unsloth) and converting to GGUF format with [llama.cpp](https://github.com/ggerganov/llama.cpp).

## Prerequisites

- Linux OS
- Python 3.10+
- NVIDIA GPU (CUDA) or AMD GPU (ROCm) or Vulkan-compatible GPU
- [cmake](https://cmake.org/)
- [git](https://git-scm.com/)

## Quick Start

```bash
# 1. Setup (clones llama.cpp, builds it, installs dependencies)
bash setup.sh

# 2. Configure scripts (see variables below)

# 3. Run full pipeline
bash run-pipeline.sh
```

## Workflow

```
scripts/generate-data.sh   → Generate synthetic training data (optional)
scripts/finetune.sh        → Fine-tune model with LoRA adapters
scripts/merge-and-convert.sh → Merge LoRA into base model and convert to GGUF
scripts/run-model.sh       → Run the converted GGUF model
run-pipeline.sh            → Run finetune → merge/convert → run in sequence
```

## Setup

`setup.sh` will:
1. Create a Python virtual environment and install Python dependencies
2. Clone [llama.cpp](https://github.com/ggml-org/llama.cpp) (fresh build) or symlink an existing build
3. Build llama.cpp with shared libraries (`-DBUILD_SHARED_LIBS=ON`)
4. Install llama-cpp-python bindings linked against the shared library (`-DLLAMA_BUILD=OFF`)

**Using an existing llama.cpp build:** Choose option 2 and provide the absolute path. The build must have been created with `-DBUILD_SHARED_LIBS=ON` and contain `libllama.so`. Setup will create a symlink at `./llama.cpp`.

### Backend Selection

Backend is only prompted when building llama.cpp from scratch. Choose based on your GPU:

| Choice | Backend | Requirements |
|---|---|---|
| 1 | CUDA (NVIDIA) | Systemwide CUDA installation (NVIDIA drivers + CUDA toolkit) |
| 2 | ROCm (AMD) | Systemwide ROCm installation (AMD drivers + ROCm toolkit) |
| 3 | Vulkan | Vulkan drivers + `libvulkan-dev`, `glslc`, `spirv-headers` |
| 4 | CPU only | None |

**Vulkan dependencies (Ubuntu/Debian):**

```bash
sudo apt-get install libvulkan-dev glslc spirv-headers
```

Verify Vulkan is correctly installed:

```bash
vulkaninfo
```

Should run without errors.

### Existing llama.cpp Build

If using option 2 (existing build), ensure it was compiled with shared libraries:

```bash
cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON  # or -DGGML_HIP=ON / -DGGML_VULKAN=1
cmake --build build --config Release -j$(nproc)
```

The build must contain `libllama.so` (typically at `build/libllama.so`).

## Scripts

### 1. scripts/generate-data.sh

Generates synthetic training data using a GGUF model via llama.cpp. Run this if you need to create or extend a training dataset.

**Edit `synthetic-data.py`:**

| Variable | Description | Example |
|---|---|---|
| `GGUF_MODEL_PATH` | Path to the GGUF model used for generation | `./path/to/model.gguf` |
| `INPUT_PARQUET_PATH` | Path to existing training data to extend | `./data/train.parquet` |
| `OUTPUT_PARQUET_PATH` | Path to save the combined dataset | `./data/output.parquet` |
| `NEW_ROWS_COUNT` | Number of synthetic records to generate | `100` |
| `User prompt` (line 66) | Replace `"YOUR PROMPT GOES HERE"` with generation instructions | `Generate questions about machine learning...` |
| `System message` (line 62) | Controls the model's role | `"You are a data generator. Output ONLY the format below..."` |
| `max_tokens` | Max tokens per response | `200` |
| `temperature` | Creativity of generation | `0.7` |
| `top_p` | Nucleus sampling threshold | `0.95` |
| `top_k` | Top-k sampling threshold | `50` |
| `min_p` | Minimum probability threshold | `0.05` |

The script expects output in the format:

```
Question: <generated question>
Answer: <generated answer>
```

```bash
bash scripts/generate-data.sh
```

### 2. scripts/finetune.sh

Fine-tunes a model using Unsloth with LoRA adapters. Saves LoRA adapter to `./model/`.

**Edit `finetune.py`:**

| Variable | Description | Example |
|---|---|---|
| `DATA_PATH` | Path to training Parquet file | `./data/output.parquet` |
| `OUTPUT_DIR` | Directory to save LoRA adapters (leave at default) | `./model` |
| `BATCH_SIZE` | Per-device batch size | `2` |
| `GRADIENT_ACCUMULATION_STEPS` | Gradient accumulation steps | `8` |
| `LEARNING_RATE` | Training learning rate | `2e-4` |
| `MAX_LENGTH` | Maximum sequence length | `4096` |
| `TRAIN_EPOCHS` | Number of training epochs | `1` |
| `model_name` (line 74) | Base model to fine-tune | `"unsloth/Llama-3.2-3B-Instruct"` |

```bash
bash scripts/finetune.sh
```

### 3. scripts/merge-and-convert.sh

Merges LoRA adapters into the base model, saves the merged model, then converts to GGUF format using llama.cpp.

**Edit `merge.py`:**

| Variable | Description | Example |
|---|---|---|
| `BASE_MODEL_PATH` | Path to the base model (same as model_name in finetune.py) | `"Qwen/Qwen3.5-2B"` |
| `LORA_DIR` | Path to LoRA adapters (leave at default) | `./model` |
| `MERGED_MODEL_PATH` | Output directory for merged model (leave at default) | `./merged_model` |

```bash
bash scripts/merge-and-convert.sh
```

### 4. scripts/run-model.sh

Runs the converted GGUF model using llama.cpp's CLI interface for inference.

**Edit `run-model.sh`:**

| Variable | Description | Example |
|---|---|---|
| Model path | Path to the GGUF file (gguf file name will vary based on base model) | `./merged_model/model.gguf` |

```bash
bash scripts/run-model.sh
```

## Output Structure

```
./model/                  ← LoRA adapters (from finetune.sh)
./merged_model/           ← Merged HF model + GGUF file (from merge-and-convert.sh)
llama.cpp/                ← llama.cpp repository (created by setup.sh)
scripts/                  ← Individual pipeline step scripts
setup.sh                  ← Setup script (venv + llama.cpp build/symlink)
run-pipeline.sh           ← Run full pipeline (finetune → merge/convert → run)
```

## Troubleshooting

### llama.cpp build fails

See the official build guide:
https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

Common issues:
- **CUDA**: Requires a systemwide CUDA installation (NVIDIA drivers + CUDA toolkit)
- **ROCm**: Requires a systemwide ROCm installation (AMD drivers + ROCm toolkit)
- **Vulkan**: Requires Vulkan drivers + `libvulkan-dev`, `glslc`, `spirv-headers`
- **cmake**: Install via `sudo apt install cmake` (Debian/Ubuntu)

### Out of memory during training

- Reduce `BATCH_SIZE` in `finetune.py`
- Increase `GRADIENT_ACCUMULATION_STEPS` to compensate
- Reduce `MAX_LENGTH` to fit shorter sequences
- Set `load_in_4bit=True` in `finetune.py` (line 77)

### llama-cpp-python install fails

- Ensure llama.cpp is built successfully first (or build it yourself if you want to use a backend other than CUDA, ROCm or Vulkan)
- Try CPU-only install first to verify: `pip install llama-cpp-python`
- Check [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/)

## Project Structure

```
├── finetune.py           ← Training script
├── merge.py              ← Merge LoRA into base model
├── synthetic-data.py     ← Generate synthetic training data
├── requirements.txt      ← Python dependencies
├── setup.sh              ← One-time setup
├── run-pipeline.sh       ← Run full pipeline
├── scripts/
│   ├── generate-data.sh
│   ├── finetune.sh
│   ├── merge-and-convert.sh
│   └── run-model.sh
└── README.md
```
Initial commit 2026-06-02 15:45:59 +02:00			`# Unsloth Fine-Tune Template`

			`> Linux only — This template is designed for Linux systems with NVIDIA GPU (CUDA), AMD GPU (ROCm), or Vulkan support.`

			`A template for fine-tuning LLMs using [Unsloth](https://github.com/unslothai/unsloth) and converting to GGUF format with [llama.cpp](https://github.com/ggerganov/llama.cpp).`

			`## Prerequisites`

			`- Linux OS`
			`- Python 3.10+`
			`- NVIDIA GPU (CUDA) or AMD GPU (ROCm) or Vulkan-compatible GPU`
			`- [cmake](https://cmake.org/)`
			`- [git](https://git-scm.com/)`

			`## Quick Start`

			```bash
			`# 1. Setup (clones llama.cpp, builds it, installs dependencies)`
			`bash setup.sh`

			`# 2. Configure scripts (see variables below)`

			`# 3. Run full pipeline`
			`bash run-pipeline.sh`
			```

			`## Workflow`

			```
			`scripts/generate-data.sh → Generate synthetic training data (optional)`
			`scripts/finetune.sh → Fine-tune model with LoRA adapters`
			`scripts/merge-and-convert.sh → Merge LoRA into base model and convert to GGUF`
			`scripts/run-model.sh → Run the converted GGUF model`
			`run-pipeline.sh → Run finetune → merge/convert → run in sequence`
			```

			`## Setup`

			`setup.sh` will:
			`1. Create a Python virtual environment and install Python dependencies`
Use existing llama.cpp build for python bindings if possible 2026-06-02 17:08:39 +02:00			`2. Clone [llama.cpp](https://github.com/ggml-org/llama.cpp) (fresh build) or symlink an existing build`
			3. Build llama.cpp with shared libraries (`-DBUILD_SHARED_LIBS=ON`)
			4. Install llama-cpp-python bindings linked against the shared library (`-DLLAMA_BUILD=OFF`)
Initial commit 2026-06-02 15:45:59 +02:00
Use existing llama.cpp build for python bindings if possible 2026-06-02 17:08:39 +02:00			Using an existing llama.cpp build: Choose option 2 and provide the absolute path. The build must have been created with `-DBUILD_SHARED_LIBS=ON` and contain `libllama.so`. Setup will create a symlink at `./llama.cpp`.
Add option to use existing llama.cpp build 2026-06-02 16:39:02 +02:00
Initial commit 2026-06-02 15:45:59 +02:00			`### Backend Selection`

Use existing llama.cpp build for python bindings if possible 2026-06-02 17:08:39 +02:00			`Backend is only prompted when building llama.cpp from scratch. Choose based on your GPU:`

Initial commit 2026-06-02 15:45:59 +02:00			`\| Choice \| Backend \| Requirements \|`
			`\|---\|---\|---\|`
Fix setup 2026-06-02 15:55:34 +02:00			`\| 1 \| CUDA (NVIDIA) \| Systemwide CUDA installation (NVIDIA drivers + CUDA toolkit) \|`
			`\| 2 \| ROCm (AMD) \| Systemwide ROCm installation (AMD drivers + ROCm toolkit) \|`
			\| 3 \| Vulkan \| Vulkan drivers + `libvulkan-dev`, `glslc`, `spirv-headers` \|
Initial commit 2026-06-02 15:45:59 +02:00			`\| 4 \| CPU only \| None \|`

Fix setup 2026-06-02 15:55:34 +02:00			`Vulkan dependencies (Ubuntu/Debian):`

			```bash
			`sudo apt-get install libvulkan-dev glslc spirv-headers`
			```

			`Verify Vulkan is correctly installed:`

			```bash
			`vulkaninfo`
			```

			`Should run without errors.`

Use existing llama.cpp build for python bindings if possible 2026-06-02 17:08:39 +02:00			`### Existing llama.cpp Build`

			`If using option 2 (existing build), ensure it was compiled with shared libraries:`

			```bash
			`cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON # or -DGGML_HIP=ON / -DGGML_VULKAN=1`
			`cmake --build build --config Release -j$(nproc)`
			```

			The build must contain `libllama.so` (typically at `build/libllama.so`).

Initial commit 2026-06-02 15:45:59 +02:00			`## Scripts`

			`### 1. scripts/generate-data.sh`

			`Generates synthetic training data using a GGUF model via llama.cpp. Run this if you need to create or extend a training dataset.`

			Edit `synthetic-data.py`:

			`\| Variable \| Description \| Example \|`
			`\|---\|---\|---\|`
			\| `GGUF_MODEL_PATH` \| Path to the GGUF model used for generation \| `./path/to/model.gguf` \|
			\| `INPUT_PARQUET_PATH` \| Path to existing training data to extend \| `./data/train.parquet` \|
			\| `OUTPUT_PARQUET_PATH` \| Path to save the combined dataset \| `./data/output.parquet` \|
			\| `NEW_ROWS_COUNT` \| Number of synthetic records to generate \| `100` \|
Use existing llama.cpp build for python bindings if possible 2026-06-02 17:08:39 +02:00			\| `User prompt` (line 66) \| Replace `"YOUR PROMPT GOES HERE"` with generation instructions \| `Generate questions about machine learning...` \|
			\| `System message` (line 62) \| Controls the model's role \| `"You are a data generator. Output ONLY the format below..."` \|
			\| `max_tokens` \| Max tokens per response \| `200` \|
			\| `temperature` \| Creativity of generation \| `0.7` \|
			\| `top_p` \| Nucleus sampling threshold \| `0.95` \|
			\| `top_k` \| Top-k sampling threshold \| `50` \|
			\| `min_p` \| Minimum probability threshold \| `0.05` \|

Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			`The script expects output in the format:`
Use existing llama.cpp build for python bindings if possible 2026-06-02 17:08:39 +02:00
			```
			`Question: <generated question>`
			`Answer: <generated answer>`
			```
Initial commit 2026-06-02 15:45:59 +02:00
			```bash
			`bash scripts/generate-data.sh`
			```

			`### 2. scripts/finetune.sh`

Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			Fine-tunes a model using Unsloth with LoRA adapters. Saves LoRA adapter to `./model/`.
Initial commit 2026-06-02 15:45:59 +02:00
			Edit `finetune.py`:

			`\| Variable \| Description \| Example \|`
			`\|---\|---\|---\|`
			\| `DATA_PATH` \| Path to training Parquet file \| `./data/output.parquet` \|
Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			\| `OUTPUT_DIR` \| Directory to save LoRA adapters (leave at default) \| `./model` \|
Initial commit 2026-06-02 15:45:59 +02:00			\| `BATCH_SIZE` \| Per-device batch size \| `2` \|
			\| `GRADIENT_ACCUMULATION_STEPS` \| Gradient accumulation steps \| `8` \|
			\| `LEARNING_RATE` \| Training learning rate \| `2e-4` \|
			\| `MAX_LENGTH` \| Maximum sequence length \| `4096` \|
			\| `TRAIN_EPOCHS` \| Number of training epochs \| `1` \|
			\| `model_name` (line 74) \| Base model to fine-tune \| `"unsloth/Llama-3.2-3B-Instruct"` \|

			```bash
			`bash scripts/finetune.sh`
			```

			`### 3. scripts/merge-and-convert.sh`

			`Merges LoRA adapters into the base model, saves the merged model, then converts to GGUF format using llama.cpp.`

			Edit `merge.py`:

			`\| Variable \| Description \| Example \|`
			`\|---\|---\|---\|`
Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			\| `BASE_MODEL_PATH` \| Path to the base model (same as model_name in finetune.py) \| `"Qwen/Qwen3.5-2B"` \|
			\| `LORA_DIR` \| Path to LoRA adapters (leave at default) \| `./model` \|
			\| `MERGED_MODEL_PATH` \| Output directory for merged model (leave at default) \| `./merged_model` \|
Initial commit 2026-06-02 15:45:59 +02:00
			```bash
			`bash scripts/merge-and-convert.sh`
			```

			`### 4. scripts/run-model.sh`

			`Runs the converted GGUF model using llama.cpp's CLI interface for inference.`

			Edit `run-model.sh`:

			`\| Variable \| Description \| Example \|`
			`\|---\|---\|---\|`
Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			\| Model path \| Path to the GGUF file (gguf file name will vary based on base model) \| `./merged_model/model.gguf` \|
Initial commit 2026-06-02 15:45:59 +02:00
			```bash
			`bash scripts/run-model.sh`
			```

			`## Output Structure`

			```
			`./model/ ← LoRA adapters (from finetune.sh)`
			`./merged_model/ ← Merged HF model + GGUF file (from merge-and-convert.sh)`
			`llama.cpp/ ← llama.cpp repository (created by setup.sh)`
			`scripts/ ← Individual pipeline step scripts`
Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			`setup.sh ← Setup script (venv + llama.cpp build/symlink)`
Initial commit 2026-06-02 15:45:59 +02:00			`run-pipeline.sh ← Run full pipeline (finetune → merge/convert → run)`
			```

			`## Troubleshooting`

			`### llama.cpp build fails`

			`See the official build guide:`
			`https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md`

			`Common issues:`
Fix setup 2026-06-02 15:55:34 +02:00			`- CUDA: Requires a systemwide CUDA installation (NVIDIA drivers + CUDA toolkit)`
			`- ROCm: Requires a systemwide ROCm installation (AMD drivers + ROCm toolkit)`
			- Vulkan: Requires Vulkan drivers + `libvulkan-dev`, `glslc`, `spirv-headers`
Initial commit 2026-06-02 15:45:59 +02:00			- cmake: Install via `sudo apt install cmake` (Debian/Ubuntu)

			`### Out of memory during training`

			- Reduce `BATCH_SIZE` in `finetune.py`
			- Increase `GRADIENT_ACCUMULATION_STEPS` to compensate
			- Reduce `MAX_LENGTH` to fit shorter sequences
			- Set `load_in_4bit=True` in `finetune.py` (line 77)

			`### llama-cpp-python install fails`

Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			`- Ensure llama.cpp is built successfully first (or build it yourself if you want to use a backend other than CUDA, ROCm or Vulkan)`
Initial commit 2026-06-02 15:45:59 +02:00			- Try CPU-only install first to verify: `pip install llama-cpp-python`
Clarify variable adjustment in README.md 2026-06-02 17:41:18 +02:00			`- Check [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/)`
Initial commit 2026-06-02 15:45:59 +02:00
			`## Project Structure`

			```
			`├── finetune.py ← Training script`
			`├── merge.py ← Merge LoRA into base model`
			`├── synthetic-data.py ← Generate synthetic training data`
			`├── requirements.txt ← Python dependencies`
			`├── setup.sh ← One-time setup`
			`├── run-pipeline.sh ← Run full pipeline`
			`├── scripts/`
			`│ ├── generate-data.sh`
			`│ ├── finetune.sh`
			`│ ├── merge-and-convert.sh`
			`│ └── run-model.sh`
			`└── README.md`
			```