138 lines
5.9 KiB
Markdown
138 lines
5.9 KiB
Markdown
# Load testing the NOMYO Router
|
|
|
|
`loadtest.py` is a self-contained load generator (asyncio + httpx) with a built-in
|
|
**mock backend** so you can measure the router's own concurrency ceiling on a given
|
|
machine — independent of real GPU/backend compute.
|
|
|
|
It answers the question *"how many concurrent connections can the router sustain
|
|
on this box?"* by hammering it with N concurrent virtual clients and reporting
|
|
throughput, latency percentiles and (for streaming) time-to-first-token.
|
|
|
|
Run everything from the project root with the project venv active:
|
|
|
|
```bash
|
|
source ~/.venv/nomyo-router/bin/activate # whatever venv has the router deps
|
|
```
|
|
|
|
## The three modes
|
|
|
|
### 1. `--mock-backend` (recommended) — fully self-contained
|
|
|
|
Spawns a fast fake Ollama/OpenAI backend **and** the router (wired to it via a
|
|
temporary config), drives load against the router, then tears both down. Because
|
|
the backend is trivial, the numbers reflect the **router's proxy overhead**, not
|
|
model inference time.
|
|
|
|
```bash
|
|
python test/load/loadtest.py --mock-backend --stream --concurrency 128 --duration 30
|
|
```
|
|
|
|
### 2. Default — drive an already-running router
|
|
|
|
```bash
|
|
python test/load/loadtest.py --url http://127.0.0.1:12434 \
|
|
--api ollama --stream --concurrency 64 --duration 30 --model llama3
|
|
```
|
|
|
|
### 3. `--serve-mock` — just the mock backend
|
|
|
|
Run only the fake backend and point your own router `config.yaml` at it
|
|
(`endpoints: [http://127.0.0.1:11434]`):
|
|
|
|
```bash
|
|
python test/load/loadtest.py --serve-mock --mock-port 11434 --mock-tokens 64
|
|
```
|
|
|
|
## Finding the concurrency knee
|
|
|
|
`--ramp` sweeps several concurrency levels and prints a table. The knee is where
|
|
`req/s` stops rising and `p99` latency starts climbing sharply:
|
|
|
|
```bash
|
|
python test/load/loadtest.py --mock-backend --stream \
|
|
--ramp 8,32,64,128,256 --duration 15
|
|
```
|
|
|
|
```
|
|
conc req ok err req/s p50ms p90ms p99ms maxms ttftP50 ttftP99
|
|
---------------------------------------------------------------------------------------------
|
|
8 120 120 0 19.8 404.6 448.3 478.6 501.4 358.4 391.7
|
|
32 140 140 0 21.5 1487.1 1641.8 2341.8 2397.4 1269.8 1476.3
|
|
64 148 148 0 21.3 2953.0 4632.5 5204.3 5267.0 1207.8 3031.7
|
|
128 168 168 0 19.0 6376.4 8608.9 8726.9 8739.8 2843.1 8348.6
|
|
```
|
|
|
|
> Reading the table above: throughput stays flat (~20 req/s) while latency grows
|
|
> linearly with concurrency — the classic signature of a **single-worker
|
|
> serialization bottleneck**. Raising `--router-workers` lets throughput scale
|
|
> across CPU cores; the per-worker ceiling is what each table row measures.
|
|
|
|
## Streaming vs non-streaming, Ollama vs OpenAI
|
|
|
|
| flag | effect |
|
|
|------|--------|
|
|
| `--stream` / `--no-stream` | streamed response (default) vs a single buffered response |
|
|
| `--api ollama` | drives `POST /api/chat` (default) |
|
|
| `--api openai` | drives `POST /v1/chat/completions` |
|
|
|
|
Streaming runs additionally report **TTFT** (time-to-first-token), which isolates
|
|
prefill/routing latency from total stream duration.
|
|
|
|
## Shaping the mock backend (the "fake GPU")
|
|
|
|
The mock's latency is fully configurable, so you can model anything from an
|
|
instant echo (measure pure proxy overhead) to a slow, long-streaming model
|
|
(measure how many slow streams the box holds open at once):
|
|
|
|
| flag | meaning |
|
|
|------|---------|
|
|
| `--mock-ttft-ms` | prefill latency before the first token (ms) |
|
|
| `--mock-tokens` | number of completion tokens emitted |
|
|
| `--mock-tok-ms` | per-token decode delay (ms) — inverse of tokens/sec |
|
|
| `--mock-models` | comma-separated model names advertised in `/api/tags` & `/api/ps` |
|
|
|
|
Example — simulate a realistic 40 tok/s model with 300 ms prefill emitting 200
|
|
tokens, and see how many concurrent such streams the router holds:
|
|
|
|
```bash
|
|
python test/load/loadtest.py --mock-backend --stream --ramp 16,64,256 \
|
|
--mock-ttft-ms 300 --mock-tokens 200 --mock-tok-ms 25 --duration 20
|
|
```
|
|
|
|
## Load shape & misc flags
|
|
|
|
| flag | default | meaning |
|
|
|------|---------|---------|
|
|
| `--concurrency N` | 32 | concurrent virtual clients |
|
|
| `--duration S` | 20 | seconds per stage (ignored if `--requests` set) |
|
|
| `--requests N` | — | send exactly N requests instead of timing out |
|
|
| `--warmup S` | 2 | unmeasured warmup before each stage (hot caches/connections) |
|
|
| `--timeout S` | 120 | per-request timeout |
|
|
| `--model NAME` | `mock` | model name requested (must match what the backend advertises) |
|
|
| `--prompt STR` | … | user prompt sent in every request |
|
|
| `--json PATH` | — | also write the full results as JSON |
|
|
|
|
### `--mock-backend` orchestration knobs
|
|
|
|
| flag | default | meaning |
|
|
|------|---------|---------|
|
|
| `--router-workers N` | 1 | `uvicorn --workers` for the spawned router |
|
|
| `--router-max-conc N` | = peak concurrency | `max_concurrent_connections` in the generated config (so the router doesn't queue unless you want it to) |
|
|
| `--router-port` / `--mock-port` | auto | fix the ports instead of auto-picking free ones |
|
|
| `--keep-config` | off | keep the generated temp `config.yaml` for inspection |
|
|
|
|
## Notes & caveats
|
|
|
|
- **Single-machine bias.** With `--mock-backend`, the driver, router and mock all
|
|
share the same CPU, so they compete for cores. For an upper-bound number, run
|
|
the driver on a separate machine against a real router (`--url`), or pin
|
|
processes to different cores.
|
|
- The generated config sets `conversation_affinity: false` and
|
|
`cache_enabled: false` to measure the raw proxy path. The temp config and a
|
|
throwaway token DB (under the system temp dir) are deleted on exit.
|
|
- To measure the router's *admission* limit instead of raw throughput, set
|
|
`--router-max-conc` low (e.g. `2`) — requests beyond the limit queue on the
|
|
least-busy endpoint rather than erroring.
|
|
- Requires the router's own dependencies (`aiohttp`, `httpx`, `uvicorn`, …); it
|
|
reuses the project venv, no extra packages needed.
|
|
```
|