nomyo-router/test/load/README.md

5.9 KiB

Load testing the NOMYO Router

loadtest.py is a self-contained load generator (asyncio + httpx) with a built-in mock backend so you can measure the router's own concurrency ceiling on a given machine — independent of real GPU/backend compute.

It answers the question "how many concurrent connections can the router sustain on this box?" by hammering it with N concurrent virtual clients and reporting throughput, latency percentiles and (for streaming) time-to-first-token.

Run everything from the project root with the project venv active:

source ~/.venv/nomyo-router/bin/activate   # whatever venv has the router deps

The three modes

Spawns a fast fake Ollama/OpenAI backend and the router (wired to it via a temporary config), drives load against the router, then tears both down. Because the backend is trivial, the numbers reflect the router's proxy overhead, not model inference time.

python test/load/loadtest.py --mock-backend --stream --concurrency 128 --duration 30

2. Default — drive an already-running router

python test/load/loadtest.py --url http://127.0.0.1:12434 \
    --api ollama --stream --concurrency 64 --duration 30 --model llama3

3. --serve-mock — just the mock backend

Run only the fake backend and point your own router config.yaml at it (endpoints: [http://127.0.0.1:11434]):

python test/load/loadtest.py --serve-mock --mock-port 11434 --mock-tokens 64

Finding the concurrency knee

--ramp sweeps several concurrency levels and prints a table. The knee is where req/s stops rising and p99 latency starts climbing sharply:

python test/load/loadtest.py --mock-backend --stream \
    --ramp 8,32,64,128,256 --duration 15
 conc     req      ok   err     req/s    p50ms    p90ms     p99ms     maxms  ttftP50  ttftP99
---------------------------------------------------------------------------------------------
    8     120     120     0      19.8    404.6    448.3     478.6     501.4    358.4    391.7
   32     140     140     0      21.5   1487.1   1641.8    2341.8    2397.4   1269.8   1476.3
   64     148     148     0      21.3   2953.0   4632.5    5204.3    5267.0   1207.8   3031.7
  128     168     168     0      19.0   6376.4   8608.9    8726.9    8739.8   2843.1   8348.6

Reading the table above: throughput stays flat (~20 req/s) while latency grows linearly with concurrency — the classic signature of a single-worker serialization bottleneck. Raising --router-workers lets throughput scale across CPU cores; the per-worker ceiling is what each table row measures.

Streaming vs non-streaming, Ollama vs OpenAI

flag effect
--stream / --no-stream streamed response (default) vs a single buffered response
--api ollama drives POST /api/chat (default)
--api openai drives POST /v1/chat/completions

Streaming runs additionally report TTFT (time-to-first-token), which isolates prefill/routing latency from total stream duration.

Shaping the mock backend (the "fake GPU")

The mock's latency is fully configurable, so you can model anything from an instant echo (measure pure proxy overhead) to a slow, long-streaming model (measure how many slow streams the box holds open at once):

flag meaning
--mock-ttft-ms prefill latency before the first token (ms)
--mock-tokens number of completion tokens emitted
--mock-tok-ms per-token decode delay (ms) — inverse of tokens/sec
--mock-models comma-separated model names advertised in /api/tags & /api/ps

Example — simulate a realistic 40 tok/s model with 300 ms prefill emitting 200 tokens, and see how many concurrent such streams the router holds:

python test/load/loadtest.py --mock-backend --stream --ramp 16,64,256 \
    --mock-ttft-ms 300 --mock-tokens 200 --mock-tok-ms 25 --duration 20

Load shape & misc flags

flag default meaning
--concurrency N 32 concurrent virtual clients
--duration S 20 seconds per stage (ignored if --requests set)
--requests N send exactly N requests instead of timing out
--warmup S 2 unmeasured warmup before each stage (hot caches/connections)
--timeout S 120 per-request timeout
--model NAME mock model name requested (must match what the backend advertises)
--prompt STR user prompt sent in every request
--json PATH also write the full results as JSON

--mock-backend orchestration knobs

flag default meaning
--router-workers N 1 uvicorn --workers for the spawned router
--router-max-conc N = peak concurrency max_concurrent_connections in the generated config (so the router doesn't queue unless you want it to)
--router-port / --mock-port auto fix the ports instead of auto-picking free ones
--keep-config off keep the generated temp config.yaml for inspection

Notes & caveats

  • Single-machine bias. With --mock-backend, the driver, router and mock all share the same CPU, so they compete for cores. For an upper-bound number, run the driver on a separate machine against a real router (--url), or pin processes to different cores.
  • The generated config sets conversation_affinity: false and cache_enabled: false to measure the raw proxy path. The temp config and a throwaway token DB (under the system temp dir) are deleted on exit.
  • To measure the router's admission limit instead of raw throughput, set --router-max-conc low (e.g. 2) — requests beyond the limit queue on the least-busy endpoint rather than erroring.
  • Requires the router's own dependencies (aiohttp, httpx, uvicorn, …); it reuses the project venv, no extra packages needed.