feat: make model pricing source configurable (models.dev + DigitalOcean) (#971)

This commit is contained in:
Musa 2026-06-24 10:14:12 -07:00 committed by GitHub
parent 5cc4c4ee77
commit 558df0307c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 687 additions and 48 deletions

View file

@ -209,6 +209,178 @@ Clients can let the router decide or still specify aliases:
)
.. _cost_latency_aware_selection:
Cost- and latency-aware selection
---------------------------------
When a route lists more than one candidate model, you can let Plano reorder that
candidate pool using **live cost or latency data** instead of relying solely on the
order you wrote them in. This is controlled per route with ``selection_policy`` and
backed by one or more ``model_metrics_sources``.
This is useful when several models are equally capable for a route and you want Plano
to always reach for the cheapest (or fastest) option first, with the others kept as
fallbacks.
Selection policy
~~~~~~~~~~~~~~~~~
Attach an optional ``selection_policy`` to any entry in ``routing_preferences``:
.. code-block:: yaml
:caption: Per-route selection policy
routing_preferences:
- name: code review
description: reviewing, analyzing, and suggesting improvements to existing code
models:
- anthropic/claude-sonnet-4-5
- groq/llama-3.3-70b-versatile
selection_policy:
prefer: cheapest # cheapest | fastest | none
``prefer`` accepts:
- ``cheapest`` — order candidates by total price (input + output rate) ascending, using a ``cost`` metrics source.
- ``fastest`` — order candidates by observed latency ascending, using a ``latency`` metrics source.
- ``none`` (default) — keep the order you declared; no reordering.
Models that have no data in the selected source are ranked **last**, in their original
order, so routing always degrades gracefully rather than dropping a candidate.
Configuring the pricing source
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``cheapest`` routing needs a price catalog. Plano's **default pricing provider is
DigitalOcean** — its GenAI model catalog is public (no API key, no signup), so cost data
is available out of the box and is what ``planoai obs`` uses if you don't configure
anything. The pricing source is fully swappable: point Plano at `models.dev <https://models.dev/>`_,
or at **any endpoint that exposes a supported pricing structure**.
The ``provider`` field selects which response schema Plano expects (and therefore how it
parses the catalog); the optional ``url`` lets you override the endpoint — for example to
use a mirror, a cached copy, or an internal catalog service that returns the same shape.
.. list-table::
:header-rows: 1
:widths: 18 34 28 20
* - ``provider``
- Default catalog URL
- Key format
- Expected structure
* - ``digitalocean`` *(default)*
- DigitalOcean GenAI model catalog
- ``lowercase(creator)/model_id``
- ``{ data: [ { model_id, pricing: { input_price_per_million, output_price_per_million } } ] }``
* - ``models.dev``
- ``https://models.dev/api.json``
- ``creator/model`` (e.g. ``anthropic/claude-sonnet-4-5``)
- ``{ <provider>: { models: { <model>: { cost: { input, output } } } } }``
Because the source is selected per ``provider``, switching is a one-line change. To stay
on the default DigitalOcean catalog you can omit ``model_metrics_sources`` entirely for
``planoai obs``, or declare it explicitly for routing:
.. code-block:: yaml
:caption: Default cost source (DigitalOcean)
model_metrics_sources:
- type: cost
provider: digitalocean # default; uses the public DO GenAI catalog
To switch to models.dev — an open, community-maintained catalog covering a broad range of
providers and models — change the ``provider`` (and optionally ``url``):
.. code-block:: yaml
:caption: Cost source backed by models.dev
model_metrics_sources:
- type: cost
provider: models.dev # models.dev | digitalocean
url: https://models.dev/api.json # optional; defaults per provider
refresh_interval: 3600 # optional, seconds; refetch on this interval
model_aliases: # optional; see below
openai/gpt-oss-120b: openai/gpt-4o
To use your own endpoint, pick the ``provider`` whose structure your endpoint matches and
override ``url`` — Plano parses the response with that provider's schema:
.. code-block:: yaml
:caption: Custom endpoint exposing the DigitalOcean catalog structure
model_metrics_sources:
- type: cost
provider: digitalocean # selects the DO response schema
url: https://catalog.internal.example.com/pricing
.. note::
The cost metric used for ranking is the sum of the input and output per-million-token
rates — a relative signal for ordering candidates, not a per-request bill. For actual
per-request cost, see the observability console below.
Matching catalog keys to your models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The router looks up each candidate model by the exact name you use in
``routing_preferences`` (e.g. ``anthropic/claude-sonnet-4-5``). models.dev keys models as
``creator/model``, which lines up with Plano's ``provider/model`` naming, so most models
match automatically.
When a catalog key does not match your model name — for example a version skew, or an
open-weight model you serve under a different provider — use ``model_aliases`` to map the
**catalog key** to the **Plano model name** used in your routing preferences:
.. code-block:: yaml
model_metrics_sources:
- type: cost
provider: models.dev
model_aliases:
# catalog key : plano model name
openai/gpt-oss-120b: openai/gpt-4o
Latency source
~~~~~~~~~~~~~~~
``fastest`` routing reads observed latency from a Prometheus instance. Provide the query
that returns a per-model latency value (lower is faster), labelled by ``model_name``:
.. code-block:: yaml
:caption: Latency source backed by Prometheus
model_metrics_sources:
- type: latency
provider: prometheus
url: http://prometheus:9090
query: avg by (model_name) (rate(plano_llm_latency_seconds_sum[5m]))
refresh_interval: 60
You can declare both a ``cost`` and a ``latency`` source at the same time; each route
picks whichever it needs based on its ``selection_policy``.
Cost in the observability console
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``planoai obs`` displays a per-request USD cost column derived from the same pricing
catalog. By default it reads the ``cost`` source from your config (the first
``type: cost`` entry under ``model_metrics_sources``); you can also override it on the
command line:
.. code-block:: bash
# Use the cost source from ./config.yaml (default)
planoai obs
# Or override the provider / endpoint explicitly
planoai obs --pricing-provider models.dev
planoai obs --pricing-url https://models.dev/api.json
If no source is configured and no override is given, ``planoai obs`` falls back to the
DigitalOcean catalog so the cost column still populates out of the box.
Plano-Orchestrator
-------------------
Plano-Orchestrator is a **preference-based routing model** specifically designed to address the limitations of traditional LLM routing. It delivers production-ready performance with low latency and high accuracy while solving key routing challenges.

View file

@ -86,6 +86,24 @@ routing_preferences:
selection_policy:
prefer: cheapest
# model_metrics_sources: external catalogs the router reads to reorder candidate
# models for selection_policy.prefer. A `cost` source ranks `prefer: cheapest`;
# a `latency` source ranks `prefer: fastest`. Both are optional.
model_metrics_sources:
# Cost catalog. provider: models.dev | digitalocean (default url per provider).
- type: cost
provider: models.dev
url: https://models.dev/api.json # optional; omit to use the provider default
refresh_interval: 3600 # optional, seconds
model_aliases: # optional: catalog key -> Plano model name
openai/gpt-oss-120b: openai/gpt-4o
# Latency catalog (Prometheus). Used for selection_policy.prefer: fastest.
- type: latency
provider: prometheus
url: http://prometheus:9090
query: avg by (model_name) (rate(plano_llm_latency_seconds_sum[5m]))
refresh_interval: 60
# HTTP listeners - entry points for agent routing, prompt targets, and direct LLM access
listeners:
# Agent listener for routing requests to multiple agents

View file

@ -115,6 +115,18 @@ model_aliases:
target: gpt-4o-mini
smart-llm:
target: gpt-4o
model_metrics_sources:
- model_aliases:
openai/gpt-oss-120b: openai/gpt-4o
provider: models.dev
refresh_interval: 3600
type: cost
url: https://models.dev/api.json
- provider: prometheus
query: avg by (model_name) (rate(plano_llm_latency_seconds_sum[5m]))
refresh_interval: 60
type: latency
url: http://prometheus:9090
model_providers:
- access_key: $OPENAI_API_KEY
default: true