restructure model_metrics_sources to use type + provider pattern

2026-05-15 11:02:39 +02:00 · 2026-03-30 15:18:04 -07:00 · 2026-03-30 15:18:04 -07:00 · ba701264be
commit ba701264be
parent e5751d6b13
7 changed files with 142 additions and 299 deletions
--- a/demos/llm_routing/model_routing_service/README.md
+++ b/demos/llm_routing/model_routing_service/README.md
@ -46,8 +46,8 @@ routing_preferences:

 | Value | Behavior |
 |---|---|
-| `cheapest` | Sort models by ascending cost. Requires `cost_metrics` or `digitalocean_pricing` in `model_metrics_sources`. |
-| `fastest` | Sort models by ascending P95 latency. Requires `prometheus_metrics` in `model_metrics_sources`. |
+| `cheapest` | Sort models by ascending cost. Requires a `type: cost` source in `model_metrics_sources`. |
+| `fastest` | Sort models by ascending P95 latency. Requires a `type: latency` source in `model_metrics_sources`. |
 | `random` | Shuffle the model list on each request. |
 | `none` | Return models in definition order — no reordering. |

@ -139,23 +139,25 @@ The response contains the ranked model list — your client should try `models[0

 ## Metrics Sources

-### DigitalOcean Pricing (`digitalocean_pricing`)
+### Cost Metrics (provider: digitalocean)

 Fetches public model pricing from the DigitalOcean Gen-AI catalog (no auth required). Model IDs are normalized as `lowercase(creator)/model_id`. Cost scalar = `input_price_per_million + output_price_per_million`.

 ```yaml
 model_metrics_sources:
-  - type: digitalocean_pricing
+  - type: cost
+    provider: digitalocean
    refresh_interval: 3600   # re-fetch every hour
 ```

-### Prometheus Latency (`prometheus_metrics`)
+### Latency Metrics (provider: prometheus)

 Queries a Prometheus instance for P95 latency. The PromQL expression must return an instant vector with a `model_name` label matching the model names in `routing_preferences`.

 ```yaml
 model_metrics_sources:
-  - type: prometheus_metrics
+  - type: latency
+    provider: prometheus
    url: http://localhost:9090
    query: model_latency_p95_seconds
    refresh_interval: 60
@ -163,32 +165,6 @@ model_metrics_sources:

 The demo's `metrics_server.py` exposes mock latency data; `docker compose up -d` starts it alongside Prometheus.

-### Custom Cost Endpoint (`cost_metrics`)
-
-```yaml
-model_metrics_sources:
-  - type: cost_metrics
-    url: https://my-internal-pricing-api/costs
-    auth:
-      type: bearer
-      token: $PRICING_TOKEN
-    refresh_interval: 300
-```
-
-Expected response format:
-```json
-{
-  "anthropic/claude-sonnet-4-20250514": {
-    "input_per_million": 3.0,
-    "output_per_million": 15.0
-  },
-  "openai/gpt-4o": {
-    "input_per_million": 5.0,
-    "output_per_million": 20.0
-  }
-}
-```
-
 ## Kubernetes Deployment (Self-hosted Arch-Router on GPU)

 To run Arch-Router in-cluster using vLLM instead of the default hosted endpoint:
--- a/demos/llm_routing/model_routing_service/config.yaml
+++ b/demos/llm_routing/model_routing_service/config.yaml
@ -34,20 +34,16 @@ routing_preferences:
      prefer: fastest

 model_metrics_sources:
-  - type: digitalocean_pricing
+  - type: cost
+    provider: digitalocean
    refresh_interval: 3600
    model_aliases:
      openai-gpt-4o: openai/gpt-4o
      openai-gpt-4o-mini: openai/gpt-4o-mini
      anthropic-claude-sonnet-4: anthropic/claude-sonnet-4-20250514

-  # Use cost_metrics instead of digitalocean_pricing to supply your own pricing data.
-  # The demo metrics_server.py exposes /costs with OpenAI and Anthropic pricing.
-  # - type: cost_metrics
-  #   url: http://localhost:8080/costs
-  #   refresh_interval: 300
-
-  - type: prometheus_metrics
+  - type: latency
+    provider: prometheus
    url: http://localhost:9090
    query: model_latency_p95_seconds
    refresh_interval: 60