# Load testing the NOMYO Router `loadtest.py` is a self-contained load generator (asyncio + httpx) with a built-in **mock backend** so you can measure the router's own concurrency ceiling on a given machine — independent of real GPU/backend compute. It answers the question *"how many concurrent connections can the router sustain on this box?"* by hammering it with N concurrent virtual clients and reporting throughput, latency percentiles and (for streaming) time-to-first-token. Run everything from the project root with the project venv active: ```bash source ~/.venv/nomyo-router/bin/activate # whatever venv has the router deps ``` ## The three modes ### 1. `--mock-backend` (recommended) — fully self-contained Spawns a fast fake Ollama/OpenAI backend **and** the router (wired to it via a temporary config), drives load against the router, then tears both down. Because the backend is trivial, the numbers reflect the **router's proxy overhead**, not model inference time. ```bash python test/load/loadtest.py --mock-backend --stream --concurrency 128 --duration 30 ``` ### 2. Default — drive an already-running router ```bash python test/load/loadtest.py --url http://127.0.0.1:12434 \ --api ollama --stream --concurrency 64 --duration 30 --model llama3 ``` ### 3. `--serve-mock` — just the mock backend Run only the fake backend and point your own router `config.yaml` at it (`endpoints: [http://127.0.0.1:11434]`): ```bash python test/load/loadtest.py --serve-mock --mock-port 11434 --mock-tokens 64 ``` ## Finding the concurrency knee `--ramp` sweeps several concurrency levels and prints a table. The knee is where `req/s` stops rising and `p99` latency starts climbing sharply: ```bash python test/load/loadtest.py --mock-backend --stream \ --ramp 8,32,64,128,256 --duration 15 ``` ``` conc req ok err req/s p50ms p90ms p99ms maxms ttftP50 ttftP99 --------------------------------------------------------------------------------------------- 8 120 120 0 19.8 404.6 448.3 478.6 501.4 358.4 391.7 32 140 140 0 21.5 1487.1 1641.8 2341.8 2397.4 1269.8 1476.3 64 148 148 0 21.3 2953.0 4632.5 5204.3 5267.0 1207.8 3031.7 128 168 168 0 19.0 6376.4 8608.9 8726.9 8739.8 2843.1 8348.6 ``` > Reading the table above: throughput stays flat (~20 req/s) while latency grows > linearly with concurrency — the classic signature of a **single-worker > serialization bottleneck**. Raising `--router-workers` lets throughput scale > across CPU cores; the per-worker ceiling is what each table row measures. ## Streaming vs non-streaming, Ollama vs OpenAI | flag | effect | |------|--------| | `--stream` / `--no-stream` | streamed response (default) vs a single buffered response | | `--api ollama` | drives `POST /api/chat` (default) | | `--api openai` | drives `POST /v1/chat/completions` | Streaming runs additionally report **TTFT** (time-to-first-token), which isolates prefill/routing latency from total stream duration. ## Shaping the mock backend (the "fake GPU") The mock's latency is fully configurable, so you can model anything from an instant echo (measure pure proxy overhead) to a slow, long-streaming model (measure how many slow streams the box holds open at once): | flag | meaning | |------|---------| | `--mock-ttft-ms` | prefill latency before the first token (ms) | | `--mock-tokens` | number of completion tokens emitted | | `--mock-tok-ms` | per-token decode delay (ms) — inverse of tokens/sec | | `--mock-models` | comma-separated model names advertised in `/api/tags` & `/api/ps` | Example — simulate a realistic 40 tok/s model with 300 ms prefill emitting 200 tokens, and see how many concurrent such streams the router holds: ```bash python test/load/loadtest.py --mock-backend --stream --ramp 16,64,256 \ --mock-ttft-ms 300 --mock-tokens 200 --mock-tok-ms 25 --duration 20 ``` ## Load shape & misc flags | flag | default | meaning | |------|---------|---------| | `--concurrency N` | 32 | concurrent virtual clients | | `--duration S` | 20 | seconds per stage (ignored if `--requests` set) | | `--requests N` | — | send exactly N requests instead of timing out | | `--warmup S` | 2 | unmeasured warmup before each stage (hot caches/connections) | | `--timeout S` | 120 | per-request timeout | | `--model NAME` | `mock` | model name requested (must match what the backend advertises) | | `--prompt STR` | … | user prompt sent in every request | | `--json PATH` | — | also write the full results as JSON | ### `--mock-backend` orchestration knobs | flag | default | meaning | |------|---------|---------| | `--router-workers N` | 1 | `uvicorn --workers` for the spawned router | | `--router-max-conc N` | = peak concurrency | `max_concurrent_connections` in the generated config (so the router doesn't queue unless you want it to) | | `--router-port` / `--mock-port` | auto | fix the ports instead of auto-picking free ones | | `--keep-config` | off | keep the generated temp `config.yaml` for inspection | ## Notes & caveats - **Single-machine bias.** With `--mock-backend`, the driver, router and mock all share the same CPU, so they compete for cores. For an upper-bound number, run the driver on a separate machine against a real router (`--url`), or pin processes to different cores. - The generated config sets `conversation_affinity: false` and `cache_enabled: false` to measure the raw proxy path. The temp config and a throwaway token DB (under the system temp dir) are deleted on exit. - To measure the router's *admission* limit instead of raw throughput, set `--router-max-conc` low (e.g. `2`) — requests beyond the limit queue on the least-busy endpoint rather than erroring. - Requires the router's own dependencies (`aiohttp`, `httpx`, `uvicorn`, …); it reuses the project venv, no extra packages needed. ```