feat: add llama-swap as a backend
All checks were successful
PR Tests / test (pull_request) Successful in 1m18s
NYX Security Scan / nyx-scan (pull_request) Successful in 6m19s

This commit is contained in:
Alpha Nerd 2026-06-14 16:34:31 +02:00
parent c8da58430a
commit aa8baebac5
Signed by: alpha-nerd
SSH key fingerprint: SHA256:QkkAgVoYi9TQ0UKPkiKSfnerZy2h4qhi3SVPXJmBN+M
17 changed files with 544 additions and 52 deletions

View file

@ -78,6 +78,37 @@ endpoints:
- OpenAI-compatible endpoints use `/v1` prefix
- The router automatically detects endpoint type based on URL pattern
### `llama_server_endpoints`
**Type**: `list[str]` (optional)
**Default**: `[]`
**Description**: List of [llama.cpp `llama-server`](https://github.com/ggml-org/llama.cpp) endpoints (OpenAI-compatible, configured with the `/v1` suffix). The router reads each backend's loaded models from `/v1/models` (entries with `status == "loaded"`) and unloads idle models via `POST /models/unload`.
```yaml
llama_server_endpoints:
- http://192.168.0.50:8889/v1
```
### `llama_swap_endpoints`
**Type**: `list[str]` (optional)
**Default**: `[]`
**Description**: List of [llama-swap](https://github.com/mostlygeek/llama-swap) endpoints (OpenAI-compatible, configured with the `/v1` suffix). llama-swap fronts multiple `llama-server` workers behind one address. It is treated like `llama_server_endpoints` for routing, model discovery, and reranking, but differs in two ways the router handles automatically:
- **Loaded-model detection** — llama-swap's `/v1/models` omits the per-model `status` field, so running workers are read from `GET /running` (entries with `state == "ready"`).
- **Model unload** — done via `POST /api/models/unload/:model_id` (path parameter), not the `llama-server` body form.
The router also exposes a passthrough route, `GET|POST /upstream/:model_id/<path>`, which forwards directly to a model's underlying `llama-server` worker (via llama-swap's `/upstream`), letting clients use `llama-server` features that llama-swap does not forward (e.g. token-array prompts).
```yaml
llama_swap_endpoints:
- http://192.168.0.50:8890/v1
```
### `max_concurrent_connections`
**Type**: `int`