remove model_metrics_sources and selection_policy from demo and docs

This commit is contained in:
Adil Hafeez 2026-03-30 15:51:09 -07:00
parent ba701264be
commit 0ff166e0f6
3 changed files with 7 additions and 168 deletions

View file

@ -13,7 +13,6 @@ Plano is an AI-native proxy and data plane for agentic apps — with built-in or
- **One endpoint, many models** — apps call Plano using standard OpenAI/Anthropic APIs; Plano handles provider selection, keys, and failover
- **Intelligent routing** — a lightweight 1.5B router model classifies user intent and picks the best model per request
- **Cost & latency ranking** — models are ranked by live cost (DigitalOcean pricing API) or latency (Prometheus) before returning the fallback list
- **Platform governance** — centralize API keys, rate limits, guardrails, and observability without touching app code
- **Runs anywhere** — single binary; self-host the router for full data privacy
@ -30,38 +29,24 @@ routing_preferences:
models:
- openai/gpt-4o
- openai/gpt-4o-mini
selection_policy:
prefer: cheapest # rank by live cost data
- name: code_generation
description: generating new code, writing functions, or creating boilerplate
models:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
selection_policy:
prefer: fastest # rank by Prometheus p95 latency
```
### `selection_policy.prefer` values
| Value | Behavior |
|---|---|
| `cheapest` | Sort models by ascending cost. Requires a `type: cost` source in `model_metrics_sources`. |
| `fastest` | Sort models by ascending P95 latency. Requires a `type: latency` source in `model_metrics_sources`. |
| `random` | Shuffle the model list on each request. |
| `none` | Return models in definition order — no reordering. |
When a request arrives, Plano:
1. Sends the conversation + route descriptions to Arch-Router for intent classification
2. Looks up the matched route and ranks its candidate models by cost or latency
2. Looks up the matched route and returns its candidate models
3. Returns an ordered list — client uses `models[0]`, falls back to `models[1]` on 429/5xx
```
1. Request arrives → "Write binary search in Python"
2. Arch-Router classifies → route: "code_generation"
3. Rank by latency → claude-sonnet (0.85s) < gpt-4o (1.2s)
4. Response → models: ["anthropic/claude-sonnet-4-20250514", "openai/gpt-4o"]
3. Response → models: ["anthropic/claude-sonnet-4-20250514", "openai/gpt-4o"]
```
No match? Arch-Router returns `null` route → client falls back to the model in the original request.
@ -77,28 +62,12 @@ export OPENAI_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>
```
Start Prometheus and the mock latency metrics server:
Start Plano:
```bash
cd demos/llm_routing/model_routing_service
docker compose up -d
planoai up demos/llm_routing/model_routing_service/config.yaml
```
Then start Plano:
```bash
planoai up config.yaml
```
On startup you should see logs like:
```
fetched digitalocean pricing: N models
fetched prometheus latency metrics: 3 models
```
If a model in `routing_preferences` has no matching pricing or latency data, Plano logs a warning at startup — the model is still included but ranked last.
## Run the demo
```bash
@ -135,35 +104,7 @@ Response:
}
```
The response contains the ranked model list — your client should try `models[0]` first and fall back to `models[1]` on 429 or 5xx errors.
## Metrics Sources
### Cost Metrics (provider: digitalocean)
Fetches public model pricing from the DigitalOcean Gen-AI catalog (no auth required). Model IDs are normalized as `lowercase(creator)/model_id`. Cost scalar = `input_price_per_million + output_price_per_million`.
```yaml
model_metrics_sources:
- type: cost
provider: digitalocean
refresh_interval: 3600 # re-fetch every hour
```
### Latency Metrics (provider: prometheus)
Queries a Prometheus instance for P95 latency. The PromQL expression must return an instant vector with a `model_name` label matching the model names in `routing_preferences`.
```yaml
model_metrics_sources:
- type: latency
provider: prometheus
url: http://localhost:9090
query: model_latency_p95_seconds
refresh_interval: 60
```
The demo's `metrics_server.py` exposes mock latency data; `docker compose up -d` starts it alongside Prometheus.
The response contains the model list — your client should try `models[0]` first and fall back to `models[1]` on 429 or 5xx errors.
## Kubernetes Deployment (Self-hosted Arch-Router on GPU)

View file

@ -22,28 +22,9 @@ routing_preferences:
models:
- openai/gpt-4o
- openai/gpt-4o-mini
selection_policy:
prefer: cheapest
- name: code_generation
description: generating new code, writing functions, or creating boilerplate
models:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
selection_policy:
prefer: fastest
model_metrics_sources:
- type: cost
provider: digitalocean
refresh_interval: 3600
model_aliases:
openai-gpt-4o: openai/gpt-4o
openai-gpt-4o-mini: openai/gpt-4o-mini
anthropic-claude-sonnet-4: anthropic/claude-sonnet-4-20250514
- type: latency
provider: prometheus
url: http://localhost:9090
query: model_latency_p95_seconds
refresh_interval: 60

View file

@ -21,14 +21,12 @@ POST /v1/chat/completions
{
"name": "code generation",
"description": "generating new code snippets",
"models": ["anthropic/claude-sonnet-4-20250514", "openai/gpt-4o", "openai/gpt-4o-mini"],
"selection_policy": {"prefer": "fastest"}
"models": ["anthropic/claude-sonnet-4-20250514", "openai/gpt-4o", "openai/gpt-4o-mini"]
},
{
"name": "general questions",
"description": "casual conversation and simple queries",
"models": ["openai/gpt-4o-mini"],
"selection_policy": {"prefer": "cheapest"}
"models": ["openai/gpt-4o-mini"]
}
]
}
@ -41,15 +39,6 @@ POST /v1/chat/completions
| `name` | string | yes | Route identifier. Must match the LLM router's route classification. |
| `description` | string | yes | Natural language description used by the router to match user intent. |
| `models` | string[] | yes | Ordered candidate pool. At least one entry required. Must be declared in `model_providers`. |
| `selection_policy.prefer` | enum | yes | How to rank models: `cheapest`, `fastest`, or `none`. |
### `selection_policy.prefer` values
| Value | Behavior |
|---|---|
| `cheapest` | Sort by ascending cost from the metrics endpoint. Models with no data appended last. |
| `fastest` | Sort by ascending latency from the metrics endpoint. Models with no data appended last. |
| `none` | Return models in the order they were defined — no reordering. |
### Notes
@ -121,86 +110,14 @@ routing_preferences:
models:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
selection_policy:
prefer: fastest
- name: general questions
description: casual conversation and simple queries
models:
- openai/gpt-4o-mini
- openai/gpt-4o
selection_policy:
prefer: cheapest
# Optional: live cost and latency data sources (max one per type)
model_metrics_sources:
- type: cost
provider: digitalocean
refresh_interval: 3600
- type: latency
provider: prometheus
url: https://internal-prometheus/
query: histogram_quantile(0.95, sum by (model_name, le) (rate(model_latency_seconds_bucket[5m])))
refresh_interval: 60
```
### Startup validation
Plano validates metric source configuration at startup and exits with a clear error if:
| Condition | Error |
|---|---|
| `prefer: cheapest` with no cost source | `prefer: cheapest requires a cost metrics source` |
| `prefer: fastest` with no latency source | `prefer: fastest requires a latency metrics source` |
| Two `type: cost` entries | `only one cost metrics source is allowed` |
| Two `type: latency` entries | `only one latency metrics source is allowed` |
If a model listed in `routing_preferences` has no matching entry in the fetched pricing or latency data, Plano logs a `WARN` at startup — the model is still included but ranked last. The same warning is also emitted per routing request when a model has no data in cache at decision time (relevant for inline `routing_preferences` overrides that reference models not covered by the configured metrics sources).
### Cost metrics (provider: digitalocean)
Fetches public model pricing from the DigitalOcean Gen-AI catalog. No authentication required.
```yaml
model_metrics_sources:
- type: cost
provider: digitalocean
refresh_interval: 3600 # re-fetch every hour; omit to fetch once on startup
model_aliases:
openai-gpt-4o: openai/gpt-4o
openai-gpt-4o-mini: openai/gpt-4o-mini
anthropic-claude-sonnet-4: anthropic/claude-sonnet-4-20250514
```
DO catalog entries are stored by their `model_id` field (e.g. `openai-gpt-4o`). The cost scalar is `input_price_per_million + output_price_per_million`.
**`model_aliases`** — optional. Maps DO `model_id` values to the model names used in `routing_preferences`. Without aliases, cost data is stored under the DO model_id (e.g. `openai-gpt-4o`), which won't match models configured as `openai/gpt-4o`. Aliases let you bridge the naming gap without changing your routing config.
**Constraints:**
- Only one `type: cost` entry is allowed.
### Latency metrics (provider: prometheus)
Plano queries `{url}/api/v1/query?query={query}` on startup and each `refresh_interval`. The PromQL expression must return an instant vector with a `model_name` label:
```json
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{"metric": {"model_name": "anthropic/claude-sonnet-4-20250514"}, "value": [1234567890, "120.5"]},
{"metric": {"model_name": "openai/gpt-4o"}, "value": [1234567890, "200.3"]}
]
}
}
```
- The PromQL query is responsible for computing the percentile (e.g. `histogram_quantile(0.95, ...)`)
- Latency units are arbitrary — only relative order matters
- Models missing from the result are appended at the end of the ranked list
---
## Version Requirements