nomyo-router/test/load/README.md

138 lines
5.9 KiB
Markdown

# Load testing the NOMYO Router
`loadtest.py` is a self-contained load generator (asyncio + httpx) with a built-in
**mock backend** so you can measure the router's own concurrency ceiling on a given
machine — independent of real GPU/backend compute.
It answers the question *"how many concurrent connections can the router sustain
on this box?"* by hammering it with N concurrent virtual clients and reporting
throughput, latency percentiles and (for streaming) time-to-first-token.
Run everything from the project root with the project venv active:
```bash
source ~/.venv/nomyo-router/bin/activate # whatever venv has the router deps
```
## The three modes
### 1. `--mock-backend` (recommended) — fully self-contained
Spawns a fast fake Ollama/OpenAI backend **and** the router (wired to it via a
temporary config), drives load against the router, then tears both down. Because
the backend is trivial, the numbers reflect the **router's proxy overhead**, not
model inference time.
```bash
python test/load/loadtest.py --mock-backend --stream --concurrency 128 --duration 30
```
### 2. Default — drive an already-running router
```bash
python test/load/loadtest.py --url http://127.0.0.1:12434 \
--api ollama --stream --concurrency 64 --duration 30 --model llama3
```
### 3. `--serve-mock` — just the mock backend
Run only the fake backend and point your own router `config.yaml` at it
(`endpoints: [http://127.0.0.1:11434]`):
```bash
python test/load/loadtest.py --serve-mock --mock-port 11434 --mock-tokens 64
```
## Finding the concurrency knee
`--ramp` sweeps several concurrency levels and prints a table. The knee is where
`req/s` stops rising and `p99` latency starts climbing sharply:
```bash
python test/load/loadtest.py --mock-backend --stream \
--ramp 8,32,64,128,256 --duration 15
```
```
conc req ok err req/s p50ms p90ms p99ms maxms ttftP50 ttftP99
---------------------------------------------------------------------------------------------
8 120 120 0 19.8 404.6 448.3 478.6 501.4 358.4 391.7
32 140 140 0 21.5 1487.1 1641.8 2341.8 2397.4 1269.8 1476.3
64 148 148 0 21.3 2953.0 4632.5 5204.3 5267.0 1207.8 3031.7
128 168 168 0 19.0 6376.4 8608.9 8726.9 8739.8 2843.1 8348.6
```
> Reading the table above: throughput stays flat (~20 req/s) while latency grows
> linearly with concurrency — the classic signature of a **single-worker
> serialization bottleneck**. Raising `--router-workers` lets throughput scale
> across CPU cores; the per-worker ceiling is what each table row measures.
## Streaming vs non-streaming, Ollama vs OpenAI
| flag | effect |
|------|--------|
| `--stream` / `--no-stream` | streamed response (default) vs a single buffered response |
| `--api ollama` | drives `POST /api/chat` (default) |
| `--api openai` | drives `POST /v1/chat/completions` |
Streaming runs additionally report **TTFT** (time-to-first-token), which isolates
prefill/routing latency from total stream duration.
## Shaping the mock backend (the "fake GPU")
The mock's latency is fully configurable, so you can model anything from an
instant echo (measure pure proxy overhead) to a slow, long-streaming model
(measure how many slow streams the box holds open at once):
| flag | meaning |
|------|---------|
| `--mock-ttft-ms` | prefill latency before the first token (ms) |
| `--mock-tokens` | number of completion tokens emitted |
| `--mock-tok-ms` | per-token decode delay (ms) — inverse of tokens/sec |
| `--mock-models` | comma-separated model names advertised in `/api/tags` & `/api/ps` |
Example — simulate a realistic 40 tok/s model with 300 ms prefill emitting 200
tokens, and see how many concurrent such streams the router holds:
```bash
python test/load/loadtest.py --mock-backend --stream --ramp 16,64,256 \
--mock-ttft-ms 300 --mock-tokens 200 --mock-tok-ms 25 --duration 20
```
## Load shape & misc flags
| flag | default | meaning |
|------|---------|---------|
| `--concurrency N` | 32 | concurrent virtual clients |
| `--duration S` | 20 | seconds per stage (ignored if `--requests` set) |
| `--requests N` | — | send exactly N requests instead of timing out |
| `--warmup S` | 2 | unmeasured warmup before each stage (hot caches/connections) |
| `--timeout S` | 120 | per-request timeout |
| `--model NAME` | `mock` | model name requested (must match what the backend advertises) |
| `--prompt STR` | … | user prompt sent in every request |
| `--json PATH` | — | also write the full results as JSON |
### `--mock-backend` orchestration knobs
| flag | default | meaning |
|------|---------|---------|
| `--router-workers N` | 1 | `uvicorn --workers` for the spawned router |
| `--router-max-conc N` | = peak concurrency | `max_concurrent_connections` in the generated config (so the router doesn't queue unless you want it to) |
| `--router-port` / `--mock-port` | auto | fix the ports instead of auto-picking free ones |
| `--keep-config` | off | keep the generated temp `config.yaml` for inspection |
## Notes & caveats
- **Single-machine bias.** With `--mock-backend`, the driver, router and mock all
share the same CPU, so they compete for cores. For an upper-bound number, run
the driver on a separate machine against a real router (`--url`), or pin
processes to different cores.
- The generated config sets `conversation_affinity: false` and
`cache_enabled: false` to measure the raw proxy path. The temp config and a
throwaway token DB (under the system temp dir) are deleted on exit.
- To measure the router's *admission* limit instead of raw throughput, set
`--router-max-conc` low (e.g. `2`) — requests beyond the limit queue on the
least-busy endpoint rather than erroring.
- Requires the router's own dependencies (`aiohttp`, `httpx`, `uvicorn`, …); it
reuses the project venv, no extra packages needed.
```