nomyo-router/test/load/README.md

# Load testing the NOMYO Router

`loadtest.py` is a self-contained load generator (asyncio + httpx) with a built-in
**mock backend** so you can measure the router's own concurrency ceiling on a given
machine — independent of real GPU/backend compute.

It answers the question *"how many concurrent connections can the router sustain
on this box?"* by hammering it with N concurrent virtual clients and reporting
throughput, latency percentiles and (for streaming) time-to-first-token.

Run everything from the project root with the project venv active:

```bash
source ~/.venv/nomyo-router/bin/activate   # whatever venv has the router deps
```

## The three modes

### 1. `--mock-backend` (recommended) — fully self-contained

Spawns a fast fake Ollama/OpenAI backend **and** the router (wired to it via a
temporary config), drives load against the router, then tears both down. Because
the backend is trivial, the numbers reflect the **router's proxy overhead**, not
model inference time.

```bash
python test/load/loadtest.py --mock-backend --stream --concurrency 128 --duration 30
```

### 2. Default — drive an already-running router

```bash
python test/load/loadtest.py --url http://127.0.0.1:12434 \
    --api ollama --stream --concurrency 64 --duration 30 --model llama3
```

### 3. `--serve-mock` — just the mock backend

Run only the fake backend and point your own router `config.yaml` at it
(`endpoints: [http://127.0.0.1:11434]`):

```bash
python test/load/loadtest.py --serve-mock --mock-port 11434 --mock-tokens 64
```

## Finding the concurrency knee

`--ramp` sweeps several concurrency levels and prints a table. The knee is where
`req/s` stops rising and `p99` latency starts climbing sharply:

```bash
python test/load/loadtest.py --mock-backend --stream \
    --ramp 8,32,64,128,256 --duration 15
```

```
 conc     req      ok   err     req/s    p50ms    p90ms     p99ms     maxms  ttftP50  ttftP99
---------------------------------------------------------------------------------------------
    8     120     120     0      19.8    404.6    448.3     478.6     501.4    358.4    391.7
   32     140     140     0      21.5   1487.1   1641.8    2341.8    2397.4   1269.8   1476.3
   64     148     148     0      21.3   2953.0   4632.5    5204.3    5267.0   1207.8   3031.7
  128     168     168     0      19.0   6376.4   8608.9    8726.9    8739.8   2843.1   8348.6
```

> Reading the table above: throughput stays flat (~20 req/s) while latency grows
> linearly with concurrency — the classic signature of a **single-worker
> serialization bottleneck**. Raising `--router-workers` lets throughput scale
> across CPU cores; the per-worker ceiling is what each table row measures.

## Streaming vs non-streaming, Ollama vs OpenAI

| flag | effect |
|------|--------|
| `--stream` / `--no-stream` | streamed response (default) vs a single buffered response |
| `--api ollama` | drives `POST /api/chat` (default) |
| `--api openai` | drives `POST /v1/chat/completions` |

Streaming runs additionally report **TTFT** (time-to-first-token), which isolates
prefill/routing latency from total stream duration.

## Shaping the mock backend (the "fake GPU")

The mock's latency is fully configurable, so you can model anything from an
instant echo (measure pure proxy overhead) to a slow, long-streaming model
(measure how many slow streams the box holds open at once):

| flag | meaning |
|------|---------|
| `--mock-ttft-ms` | prefill latency before the first token (ms) |
| `--mock-tokens`  | number of completion tokens emitted |
| `--mock-tok-ms`  | per-token decode delay (ms) — inverse of tokens/sec |
| `--mock-models`  | comma-separated model names advertised in `/api/tags` & `/api/ps` |

Example — simulate a realistic 40 tok/s model with 300 ms prefill emitting 200
tokens, and see how many concurrent such streams the router holds:

```bash
python test/load/loadtest.py --mock-backend --stream --ramp 16,64,256 \
    --mock-ttft-ms 300 --mock-tokens 200 --mock-tok-ms 25 --duration 20
```

## Load shape & misc flags

| flag | default | meaning |
|------|---------|---------|
| `--concurrency N` | 32 | concurrent virtual clients |
| `--duration S` | 20 | seconds per stage (ignored if `--requests` set) |
| `--requests N` | — | send exactly N requests instead of timing out |
| `--warmup S` | 2 | unmeasured warmup before each stage (hot caches/connections) |
| `--timeout S` | 120 | per-request timeout |
| `--model NAME` | `mock` | model name requested (must match what the backend advertises) |
| `--prompt STR` | … | user prompt sent in every request |
| `--json PATH` | — | also write the full results as JSON |

### `--mock-backend` orchestration knobs

| flag | default | meaning |
|------|---------|---------|
| `--router-workers N` | 1 | `uvicorn --workers` for the spawned router |
| `--router-max-conc N` | = peak concurrency | `max_concurrent_connections` in the generated config (so the router doesn't queue unless you want it to) |
| `--router-port` / `--mock-port` | auto | fix the ports instead of auto-picking free ones |
| `--keep-config` | off | keep the generated temp `config.yaml` for inspection |

## Notes & caveats

- **Single-machine bias.** With `--mock-backend`, the driver, router and mock all
  share the same CPU, so they compete for cores. For an upper-bound number, run
  the driver on a separate machine against a real router (`--url`), or pin
  processes to different cores.
- The generated config sets `conversation_affinity: false` and
  `cache_enabled: false` to measure the raw proxy path. The temp config and a
  throwaway token DB (under the system temp dir) are deleted on exit.
- To measure the router's *admission* limit instead of raw throughput, set
  `--router-max-conc` low (e.g. `2`) — requests beyond the limit queue on the
  least-busy endpoint rather than erroring.
- Requires the router's own dependencies (`aiohttp`, `httpx`, `uvicorn`, …); it
  reuses the project venv, no extra packages needed.
```