| .. | ||
| loadtest.py | ||
| README.md | ||
Load testing the NOMYO Router
loadtest.py is a self-contained load generator (asyncio + httpx) with a built-in
mock backend so you can measure the router's own concurrency ceiling on a given
machine — independent of real GPU/backend compute.
It answers the question "how many concurrent connections can the router sustain on this box?" by hammering it with N concurrent virtual clients and reporting throughput, latency percentiles and (for streaming) time-to-first-token.
Run everything from the project root with the project venv active:
source ~/.venv/nomyo-router/bin/activate # whatever venv has the router deps
The three modes
1. --mock-backend (recommended) — fully self-contained
Spawns a fast fake Ollama/OpenAI backend and the router (wired to it via a temporary config), drives load against the router, then tears both down. Because the backend is trivial, the numbers reflect the router's proxy overhead, not model inference time.
python test/load/loadtest.py --mock-backend --stream --concurrency 128 --duration 30
2. Default — drive an already-running router
python test/load/loadtest.py --url http://127.0.0.1:12434 \
--api ollama --stream --concurrency 64 --duration 30 --model llama3
3. --serve-mock — just the mock backend
Run only the fake backend and point your own router config.yaml at it
(endpoints: [http://127.0.0.1:11434]):
python test/load/loadtest.py --serve-mock --mock-port 11434 --mock-tokens 64
Finding the concurrency knee
--ramp sweeps several concurrency levels and prints a table. The knee is where
req/s stops rising and p99 latency starts climbing sharply:
python test/load/loadtest.py --mock-backend --stream \
--ramp 8,32,64,128,256 --duration 15
conc req ok err req/s p50ms p90ms p99ms maxms ttftP50 ttftP99
---------------------------------------------------------------------------------------------
8 120 120 0 19.8 404.6 448.3 478.6 501.4 358.4 391.7
32 140 140 0 21.5 1487.1 1641.8 2341.8 2397.4 1269.8 1476.3
64 148 148 0 21.3 2953.0 4632.5 5204.3 5267.0 1207.8 3031.7
128 168 168 0 19.0 6376.4 8608.9 8726.9 8739.8 2843.1 8348.6
Reading the table above: throughput stays flat (~20 req/s) while latency grows linearly with concurrency — the classic signature of a single-worker serialization bottleneck. Raising
--router-workerslets throughput scale across CPU cores; the per-worker ceiling is what each table row measures.
Streaming vs non-streaming, Ollama vs OpenAI
| flag | effect |
|---|---|
--stream / --no-stream |
streamed response (default) vs a single buffered response |
--api ollama |
drives POST /api/chat (default) |
--api openai |
drives POST /v1/chat/completions |
Streaming runs additionally report TTFT (time-to-first-token), which isolates prefill/routing latency from total stream duration.
Shaping the mock backend (the "fake GPU")
The mock's latency is fully configurable, so you can model anything from an instant echo (measure pure proxy overhead) to a slow, long-streaming model (measure how many slow streams the box holds open at once):
| flag | meaning |
|---|---|
--mock-ttft-ms |
prefill latency before the first token (ms) |
--mock-tokens |
number of completion tokens emitted |
--mock-tok-ms |
per-token decode delay (ms) — inverse of tokens/sec |
--mock-models |
comma-separated model names advertised in /api/tags & /api/ps |
Example — simulate a realistic 40 tok/s model with 300 ms prefill emitting 200 tokens, and see how many concurrent such streams the router holds:
python test/load/loadtest.py --mock-backend --stream --ramp 16,64,256 \
--mock-ttft-ms 300 --mock-tokens 200 --mock-tok-ms 25 --duration 20
Load shape & misc flags
| flag | default | meaning |
|---|---|---|
--concurrency N |
32 | concurrent virtual clients |
--duration S |
20 | seconds per stage (ignored if --requests set) |
--requests N |
— | send exactly N requests instead of timing out |
--warmup S |
2 | unmeasured warmup before each stage (hot caches/connections) |
--timeout S |
120 | per-request timeout |
--model NAME |
mock |
model name requested (must match what the backend advertises) |
--prompt STR |
… | user prompt sent in every request |
--json PATH |
— | also write the full results as JSON |
--mock-backend orchestration knobs
| flag | default | meaning |
|---|---|---|
--router-workers N |
1 | uvicorn --workers for the spawned router |
--router-max-conc N |
= peak concurrency | max_concurrent_connections in the generated config (so the router doesn't queue unless you want it to) |
--router-port / --mock-port |
auto | fix the ports instead of auto-picking free ones |
--keep-config |
off | keep the generated temp config.yaml for inspection |
Notes & caveats
- Single-machine bias. With
--mock-backend, the driver, router and mock all share the same CPU, so they compete for cores. For an upper-bound number, run the driver on a separate machine against a real router (--url), or pin processes to different cores. - The generated config sets
conversation_affinity: falseandcache_enabled: falseto measure the raw proxy path. The temp config and a throwaway token DB (under the system temp dir) are deleted on exit. - To measure the router's admission limit instead of raw throughput, set
--router-max-conclow (e.g.2) — requests beyond the limit queue on the least-busy endpoint rather than erroring. - Requires the router's own dependencies (
aiohttp,httpx,uvicorn, …); it reuses the project venv, no extra packages needed.