feat: add llama-swap as a backend

2026-06-14 16:34:31 +02:00 · 2026-06-14 16:34:31 +02:00 · aa8baebac5
commit aa8baebac5
parent c8da58430a
17 changed files with 544 additions and 52 deletions
--- a/doc/configuration.md
+++ b/doc/configuration.md
@ -78,6 +78,37 @@ endpoints:
 - OpenAI-compatible endpoints use `/v1` prefix
 - The router automatically detects endpoint type based on URL pattern

+### `llama_server_endpoints`
+
+**Type**: `list[str]` (optional)
+
+**Default**: `[]`
+
+**Description**: List of [llama.cpp `llama-server`](https://github.com/ggml-org/llama.cpp) endpoints (OpenAI-compatible, configured with the `/v1` suffix). The router reads each backend's loaded models from `/v1/models` (entries with `status == "loaded"`) and unloads idle models via `POST /models/unload`.
+
+```yaml
+llama_server_endpoints:
+  - http://192.168.0.50:8889/v1
+```
+
+### `llama_swap_endpoints`
+
+**Type**: `list[str]` (optional)
+
+**Default**: `[]`
+
+**Description**: List of [llama-swap](https://github.com/mostlygeek/llama-swap) endpoints (OpenAI-compatible, configured with the `/v1` suffix). llama-swap fronts multiple `llama-server` workers behind one address. It is treated like `llama_server_endpoints` for routing, model discovery, and reranking, but differs in two ways the router handles automatically:
+
+- **Loaded-model detection** — llama-swap's `/v1/models` omits the per-model `status` field, so running workers are read from `GET /running` (entries with `state == "ready"`).
+- **Model unload** — done via `POST /api/models/unload/:model_id` (path parameter), not the `llama-server` body form.
+
+The router also exposes a passthrough route, `GET|POST /upstream/:model_id/<path>`, which forwards directly to a model's underlying `llama-server` worker (via llama-swap's `/upstream`), letting clients use `llama-server` features that llama-swap does not forward (e.g. token-array prompts).
+
+```yaml
+llama_swap_endpoints:
+  - http://192.168.0.50:8890/v1
+```
+
 ### `max_concurrent_connections`

 **Type**: `int`