mirror of
https://github.com/katanemo/plano.git
synced 2026-05-08 07:12:42 +02:00
rename session pinning to model affinity with x-model-affinity header
This commit is contained in:
parent
5789694d2f
commit
da9792c2dd
14 changed files with 468 additions and 371 deletions
135
demos/llm_routing/model_affinity/README.md
Normal file
135
demos/llm_routing/model_affinity/README.md
Normal file
|
|
@ -0,0 +1,135 @@
|
|||
# Model Affinity Demo
|
||||
|
||||
> Consistent model selection for agentic loops using `X-Model-Affinity`.
|
||||
|
||||
## Why Model Affinity?
|
||||
|
||||
When an agent runs in a loop — calling tools, reasoning about results, calling more tools — each LLM request hits Plano's router independently. Because prompts vary in intent (tool selection looks like code generation, reasoning about results looks like complex analysis), the router may select **different models** for each turn, fragmenting context mid-session.
|
||||
|
||||
**Model affinity** solves this: send an `X-Model-Affinity` header and the first request runs routing as usual, caching the decision. Every subsequent request with the same affinity ID returns the **same model**, without re-running the router.
|
||||
|
||||
```
|
||||
Without affinity With affinity (X-Model-Affinity)
|
||||
──────────────── ───────────────────────────────
|
||||
Turn 1 → claude-sonnet (tool calls) Turn 1 → claude-sonnet ← routed
|
||||
Turn 2 → gpt-4o (reasoning) Turn 2 → claude-sonnet ← pinned ✓
|
||||
Turn 3 → claude-sonnet (tool calls) Turn 3 → claude-sonnet ← pinned ✓
|
||||
Turn 4 → gpt-4o (reasoning) Turn 4 → claude-sonnet ← pinned ✓
|
||||
Turn 5 → claude-sonnet (final answer) Turn 5 → claude-sonnet ← pinned ✓
|
||||
↑ model switches every turn ↑ one model, start to finish
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Set API keys
|
||||
export OPENAI_API_KEY=<your-key>
|
||||
export ANTHROPIC_API_KEY=<your-key>
|
||||
|
||||
# 2. Start Plano
|
||||
cd demos/llm_routing/model_affinity
|
||||
planoai up config.yaml
|
||||
|
||||
# 3. Run the demo (uv manages dependencies automatically)
|
||||
./demo.sh # or: uv run demo.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What the Demo Does
|
||||
|
||||
A **database selection agent** investigates whether to use PostgreSQL or MongoDB
|
||||
for an e-commerce platform. It runs a real tool-calling loop: the LLM decides
|
||||
which tools to call, receives simulated results, and continues until it has
|
||||
enough data to recommend a database.
|
||||
|
||||
Available tools:
|
||||
- `get_db_benchmarks` — fetch performance data for a workload type
|
||||
- `get_case_studies` — retrieve real-world e-commerce case studies
|
||||
- `check_feature_support` — check if a database supports a specific feature
|
||||
|
||||
The demo runs the **same agent loop twice**:
|
||||
|
||||
1. **Without affinity** — no `X-Model-Affinity`; models may switch between turns
|
||||
2. **With affinity** — `X-Model-Affinity` header included; model is pinned from turn 1
|
||||
|
||||
Each turn is a separate `POST /v1/chat/completions` request to Plano using the
|
||||
[OpenAI SDK](https://github.com/openai/openai-python). The demo prints the
|
||||
model used on each turn so you can see the difference.
|
||||
|
||||
### Expected Output
|
||||
|
||||
```
|
||||
Run 1: WITHOUT Model Affinity
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks
|
||||
turn 2 [gpt-4o ] get_case_studies, get_case_studies ← switched
|
||||
turn 3 [claude-sonnet-4-20250514 ] check_feature_support ← switched
|
||||
turn 4 [gpt-4o ] final answer ← switched
|
||||
|
||||
✗ Without affinity: model switched 3 time(s)
|
||||
|
||||
|
||||
Run 2: WITH Model Affinity (X-Model-Affinity: a1b2c3d4…)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
turn 1 [claude-sonnet-4-20250514 ] get_db_benchmarks, get_db_benchmarks
|
||||
turn 2 [claude-sonnet-4-20250514 ] get_case_studies, get_case_studies
|
||||
turn 3 [claude-sonnet-4-20250514 ] check_feature_support
|
||||
turn 4 [claude-sonnet-4-20250514 ] final answer
|
||||
|
||||
✓ With affinity: claude-sonnet-4-20250514 for all 4 turns
|
||||
```
|
||||
|
||||
### How It Works
|
||||
|
||||
Model affinity is implemented in brightstaff. When `X-Model-Affinity` is present:
|
||||
|
||||
1. **First request** — routing runs normally, result is cached keyed by the affinity ID
|
||||
2. **Subsequent requests** — cache hit skips routing and returns the cached model instantly
|
||||
|
||||
The `X-Model-Affinity` header is forwarded transparently; no changes to your OpenAI
|
||||
SDK calls beyond adding the header.
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
import uuid
|
||||
|
||||
client = OpenAI(base_url="http://localhost:12000/v1", api_key="EMPTY")
|
||||
|
||||
affinity_id = str(uuid.uuid4())
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="gpt-4o-mini",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
extra_headers={"X-Model-Affinity": affinity_id},
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
Model affinity is configurable in `config.yaml`:
|
||||
|
||||
```yaml
|
||||
routing:
|
||||
session_ttl_seconds: 600 # How long affinity lasts (default: 10 min)
|
||||
session_max_entries: 10000 # Max cached sessions (upper limit: 10000)
|
||||
```
|
||||
|
||||
Without the `X-Model-Affinity` header, routing runs fresh every time — no breaking
|
||||
change to existing clients.
|
||||
|
||||
---
|
||||
|
||||
## Advanced: Agent Server Demo
|
||||
|
||||
The `agent.py` file is a FastAPI-based agent server that demonstrates a more
|
||||
complex pattern: an external agent service that forwards `X-Model-Affinity`
|
||||
on all outbound calls to Plano. Use `start_agents.sh` to run it.
|
||||
|
||||
## See Also
|
||||
|
||||
- [Model Routing Service Demo](../model_routing_service/) — curl-based examples of the routing endpoint
|
||||
429
demos/llm_routing/model_affinity/agent.py
Normal file
429
demos/llm_routing/model_affinity/agent.py
Normal file
|
|
@ -0,0 +1,429 @@
|
|||
#!/usr/bin/env -S uv run --script
|
||||
# /// script
|
||||
# requires-python = ">=3.12"
|
||||
# dependencies = ["fastapi>=0.115", "uvicorn>=0.30", "openai>=1.0.0"]
|
||||
# ///
|
||||
"""
|
||||
Research Agent — FastAPI service exposing /v1/chat/completions.
|
||||
|
||||
For each incoming request the agent runs 3 independent research tasks,
|
||||
each with its own tool-calling loop. The tasks deliberately alternate between
|
||||
code_generation and complex_reasoning intents so Plano's preference-based
|
||||
router selects different models for each task.
|
||||
|
||||
If the client sends X-Model-Affinity, the agent forwards it on every outbound
|
||||
call to Plano. The first task pins the model; all subsequent tasks skip the
|
||||
router and reuse it — keeping the whole session on one consistent model.
|
||||
|
||||
Run standalone:
|
||||
uv run agent.py
|
||||
PLANO_URL=http://myhost:12000 AGENT_PORT=8000 uv run agent.py
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import uuid
|
||||
|
||||
import uvicorn
|
||||
from fastapi import FastAPI, Request
|
||||
from fastapi.responses import JSONResponse
|
||||
from openai import AsyncOpenAI
|
||||
from openai.types.chat import ChatCompletionMessageParam
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [AGENT] %(levelname)s %(message)s",
|
||||
)
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
PLANO_URL = os.environ.get("PLANO_URL", "http://localhost:12000")
|
||||
PORT = int(os.environ.get("AGENT_PORT", "8000"))
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tasks — each has its own conversation so Plano routes each independently.
|
||||
# Intent alternates: code_generation → complex_reasoning → code_generation.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
TASKS = [
|
||||
{
|
||||
"name": "generate_comparison",
|
||||
# Triggers code_generation routing preference (write/generate output)
|
||||
"prompt": (
|
||||
"Use the tools to fetch benchmark data for PostgreSQL and MongoDB "
|
||||
"under a mixed workload. Then generate a compact Markdown comparison "
|
||||
"table with columns: metric, PostgreSQL, MongoDB. Cover read QPS, "
|
||||
"write QPS, p99 latency ms, ACID support, and horizontal scaling."
|
||||
),
|
||||
},
|
||||
{
|
||||
"name": "analyse_tradeoffs",
|
||||
# Triggers complex_reasoning routing preference (analyse/reason/evaluate)
|
||||
"prompt": (
|
||||
"Context from prior research:\n{context}\n\n"
|
||||
"Perform a deep analysis: for a high-traffic e-commerce platform that "
|
||||
"requires ACID guarantees for order processing but flexible schemas for "
|
||||
"product attributes, carefully reason through and evaluate the long-term "
|
||||
"architectural trade-offs of each database. Consider consistency "
|
||||
"guarantees, operational complexity, and scalability risks."
|
||||
),
|
||||
},
|
||||
{
|
||||
"name": "write_schema",
|
||||
# Triggers code_generation routing preference (write SQL / generate code)
|
||||
"prompt": (
|
||||
"Context from prior research:\n{context}\n\n"
|
||||
"Write the CREATE TABLE SQL schema for the database you would recommend "
|
||||
"from the analysis above. Include: orders, order_items, products, and "
|
||||
"users tables with appropriate primary keys, foreign keys, and indexes."
|
||||
),
|
||||
},
|
||||
]
|
||||
|
||||
SYSTEM_PROMPT = (
|
||||
"You are a database selection analyst for an e-commerce platform. "
|
||||
"Use the available tools when you need data. "
|
||||
"Be concise — each response should be a compact table, code block, "
|
||||
"or 3–5 clear sentences."
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tool definitions
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
TOOLS = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_db_benchmarks",
|
||||
"description": (
|
||||
"Fetch performance benchmark data for a database. "
|
||||
"Returns read/write throughput, latency, and scaling characteristics."
|
||||
),
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"database": {
|
||||
"type": "string",
|
||||
"enum": ["postgresql", "mongodb"],
|
||||
},
|
||||
"workload": {
|
||||
"type": "string",
|
||||
"enum": ["read_heavy", "write_heavy", "mixed"],
|
||||
},
|
||||
},
|
||||
"required": ["database", "workload"],
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_case_studies",
|
||||
"description": "Retrieve e-commerce case studies for a database.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"database": {"type": "string", "enum": ["postgresql", "mongodb"]},
|
||||
},
|
||||
"required": ["database"],
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "check_feature_support",
|
||||
"description": (
|
||||
"Check whether a database supports a specific feature "
|
||||
"(e.g. ACID transactions, horizontal sharding, JSON documents)."
|
||||
),
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"database": {"type": "string", "enum": ["postgresql", "mongodb"]},
|
||||
"feature": {"type": "string"},
|
||||
},
|
||||
"required": ["database", "feature"],
|
||||
},
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tool implementations (simulated — no external calls)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_BENCHMARKS = {
|
||||
("postgresql", "read_heavy"): {
|
||||
"read_qps": 55_000,
|
||||
"write_qps": 18_000,
|
||||
"p99_ms": 4,
|
||||
"notes": "Excellent for complex joins; connection pooling via pgBouncer recommended",
|
||||
},
|
||||
("postgresql", "write_heavy"): {
|
||||
"read_qps": 30_000,
|
||||
"write_qps": 24_000,
|
||||
"p99_ms": 8,
|
||||
"notes": "WAL overhead increases at very high write volume; partitioning helps",
|
||||
},
|
||||
("postgresql", "mixed"): {
|
||||
"read_qps": 42_000,
|
||||
"write_qps": 21_000,
|
||||
"p99_ms": 6,
|
||||
"notes": "Solid all-round; MVCC keeps reads non-blocking",
|
||||
},
|
||||
("mongodb", "read_heavy"): {
|
||||
"read_qps": 85_000,
|
||||
"write_qps": 30_000,
|
||||
"p99_ms": 2,
|
||||
"notes": "Atlas Search built-in; sharding distributes read load well",
|
||||
},
|
||||
("mongodb", "write_heavy"): {
|
||||
"read_qps": 40_000,
|
||||
"write_qps": 65_000,
|
||||
"p99_ms": 3,
|
||||
"notes": "WiredTiger compression reduces I/O; journal writes are async-safe",
|
||||
},
|
||||
("mongodb", "mixed"): {
|
||||
"read_qps": 60_000,
|
||||
"write_qps": 50_000,
|
||||
"p99_ms": 3,
|
||||
"notes": "Flexible schema accelerates feature iteration",
|
||||
},
|
||||
}
|
||||
|
||||
_CASE_STUDIES = {
|
||||
"postgresql": [
|
||||
{
|
||||
"company": "Shopify",
|
||||
"scale": "100 B+ req/day",
|
||||
"notes": "Moved critical order tables back to Postgres for ACID guarantees",
|
||||
},
|
||||
{
|
||||
"company": "Zalando",
|
||||
"scale": "50 M customers",
|
||||
"notes": "Uses Postgres + Citus for sharded order processing",
|
||||
},
|
||||
{
|
||||
"company": "Instacart",
|
||||
"scale": "10 M orders/mo",
|
||||
"notes": "Postgres for inventory; strict consistency required for stock levels",
|
||||
},
|
||||
],
|
||||
"mongodb": [
|
||||
{
|
||||
"company": "eBay",
|
||||
"scale": "1.5 B listings",
|
||||
"notes": "Product catalogue in MongoDB for flexible attribute schemas",
|
||||
},
|
||||
{
|
||||
"company": "Alibaba",
|
||||
"scale": "billions of docs",
|
||||
"notes": "Session and cart data in MongoDB; high write throughput",
|
||||
},
|
||||
{
|
||||
"company": "Foursquare",
|
||||
"scale": "10 B+ check-ins",
|
||||
"notes": "Geospatial queries and flexible location schemas",
|
||||
},
|
||||
],
|
||||
}
|
||||
|
||||
_FEATURES = {
|
||||
("postgresql", "acid transactions"): {
|
||||
"supported": True,
|
||||
"notes": "Full ACID with serialisable isolation",
|
||||
},
|
||||
("postgresql", "horizontal sharding"): {
|
||||
"supported": True,
|
||||
"notes": "Via Citus extension or manual partitioning; not native",
|
||||
},
|
||||
("postgresql", "json documents"): {
|
||||
"supported": True,
|
||||
"notes": "JSONB with indexing; flexible but slower than native doc store",
|
||||
},
|
||||
("postgresql", "full-text search"): {
|
||||
"supported": True,
|
||||
"notes": "Built-in tsvector/tsquery; Elasticsearch for advanced use cases",
|
||||
},
|
||||
("postgresql", "multi-document transactions"): {
|
||||
"supported": True,
|
||||
"notes": "Native cross-table ACID",
|
||||
},
|
||||
("mongodb", "acid transactions"): {
|
||||
"supported": True,
|
||||
"notes": "Multi-document ACID since v4.0; single-doc always atomic",
|
||||
},
|
||||
("mongodb", "horizontal sharding"): {
|
||||
"supported": True,
|
||||
"notes": "Native sharding; auto-balancing across shards",
|
||||
},
|
||||
("mongodb", "json documents"): {
|
||||
"supported": True,
|
||||
"notes": "Native BSON document model; schema-free by default",
|
||||
},
|
||||
("mongodb", "full-text search"): {
|
||||
"supported": True,
|
||||
"notes": "Atlas Search (Lucene-based) for advanced full-text",
|
||||
},
|
||||
("mongodb", "multi-document transactions"): {
|
||||
"supported": True,
|
||||
"notes": "Available but adds latency; best avoided on hot paths",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _dispatch(name: str, args: dict) -> str:
|
||||
if name == "get_db_benchmarks":
|
||||
key = (args["database"].lower(), args["workload"].lower())
|
||||
return json.dumps(_BENCHMARKS.get(key, {"error": f"no data for {key}"}))
|
||||
|
||||
if name == "get_case_studies":
|
||||
db = args["database"].lower()
|
||||
return json.dumps(_CASE_STUDIES.get(db, {"error": f"unknown db '{db}'"}))
|
||||
|
||||
if name == "check_feature_support":
|
||||
key = (args["database"].lower(), args["feature"].lower())
|
||||
for k, v in _FEATURES.items():
|
||||
if k[0] == key[0] and k[1] in key[1]:
|
||||
return json.dumps(v)
|
||||
return json.dumps({"error": f"feature '{args['feature']}' not in dataset"})
|
||||
|
||||
return json.dumps({"error": f"unknown tool '{name}'"})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Task runner — one independent conversation per task
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def run_task(
|
||||
client: AsyncOpenAI,
|
||||
task_name: str,
|
||||
prompt: str,
|
||||
session_id: str | None,
|
||||
) -> tuple[str, str]:
|
||||
"""
|
||||
Run a single research task with its own tool-calling loop.
|
||||
|
||||
Each task is an independent conversation so the router sees only
|
||||
this task's intent — not the accumulated context of previous tasks.
|
||||
Model affinity via X-Model-Affinity pins the model from the first task
|
||||
onward, so all tasks stay on the same model.
|
||||
|
||||
Returns (answer, first_model_used).
|
||||
"""
|
||||
headers = {"X-Model-Affinity": session_id} if session_id else {}
|
||||
messages: list[ChatCompletionMessageParam] = [
|
||||
{"role": "system", "content": SYSTEM_PROMPT},
|
||||
{"role": "user", "content": prompt},
|
||||
]
|
||||
first_model: str | None = None
|
||||
|
||||
while True:
|
||||
resp = await client.chat.completions.create(
|
||||
model="gpt-4o-mini", # Plano's router overrides this via routing_preferences
|
||||
messages=messages,
|
||||
tools=TOOLS,
|
||||
tool_choice="auto",
|
||||
max_completion_tokens=600,
|
||||
extra_headers=headers or None,
|
||||
)
|
||||
if first_model is None:
|
||||
first_model = resp.model
|
||||
|
||||
log.info(
|
||||
"task=%s model=%s finish=%s",
|
||||
task_name,
|
||||
resp.model,
|
||||
resp.choices[0].finish_reason,
|
||||
)
|
||||
|
||||
choice = resp.choices[0]
|
||||
if choice.finish_reason == "tool_calls" and choice.message.tool_calls:
|
||||
messages.append(choice.message)
|
||||
for tc in choice.message.tool_calls:
|
||||
args = json.loads(tc.function.arguments or "{}")
|
||||
result = _dispatch(tc.function.name, args)
|
||||
log.info(" tool %s(%s)", tc.function.name, args)
|
||||
messages.append(
|
||||
{"role": "tool", "content": result, "tool_call_id": tc.id}
|
||||
)
|
||||
else:
|
||||
return (choice.message.content or "").strip(), first_model or "unknown"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Research loop — runs all tasks, threading context forward
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def run_research_loop(
|
||||
client: AsyncOpenAI,
|
||||
session_id: str | None,
|
||||
) -> tuple[str, list[dict]]:
|
||||
"""
|
||||
Run all 3 research tasks in sequence, passing each task's output as
|
||||
context to the next. Returns (final_answer, routing_trace).
|
||||
"""
|
||||
context = ""
|
||||
trace: list[dict] = []
|
||||
final_answer = ""
|
||||
|
||||
for task in TASKS:
|
||||
prompt = task["prompt"].format(context=context)
|
||||
answer, model = await run_task(client, task["name"], prompt, session_id)
|
||||
trace.append({"task": task["name"], "model": model})
|
||||
context += f"\n### {task['name']}\n{answer}\n"
|
||||
final_answer = answer
|
||||
|
||||
return final_answer, trace
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FastAPI app
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
app = FastAPI(title="Research Agent", version="1.0.0")
|
||||
|
||||
|
||||
@app.post("/v1/chat/completions")
|
||||
async def chat(request: Request) -> JSONResponse:
|
||||
body = await request.json()
|
||||
session_id: str | None = request.headers.get("x-model-affinity")
|
||||
|
||||
log.info("request session_id=%s", session_id or "none")
|
||||
|
||||
client = AsyncOpenAI(base_url=f"{PLANO_URL}/v1", api_key="EMPTY")
|
||||
answer, trace = await run_research_loop(client, session_id)
|
||||
|
||||
return JSONResponse(
|
||||
{
|
||||
"id": f"chatcmpl-{uuid.uuid4().hex[:8]}",
|
||||
"object": "chat.completion",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {"role": "assistant", "content": answer},
|
||||
"finish_reason": "stop",
|
||||
}
|
||||
],
|
||||
"routing_trace": trace,
|
||||
"session_id": session_id,
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health() -> dict:
|
||||
return {"status": "ok", "plano_url": PLANO_URL}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Entry point
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
if __name__ == "__main__":
|
||||
log.info("starting on port %d plano=%s", PORT, PLANO_URL)
|
||||
uvicorn.run(app, host="0.0.0.0", port=PORT, log_level="warning")
|
||||
27
demos/llm_routing/model_affinity/config.yaml
Normal file
27
demos/llm_routing/model_affinity/config.yaml
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
version: v0.3.0
|
||||
|
||||
listeners:
|
||||
- type: model
|
||||
name: model_listener
|
||||
port: 12000
|
||||
|
||||
model_providers:
|
||||
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
routing_preferences:
|
||||
- name: complex_reasoning
|
||||
description: complex reasoning tasks, multi-step analysis, or detailed explanations
|
||||
|
||||
- model: anthropic/claude-sonnet-4-20250514
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
routing_preferences:
|
||||
- name: code_generation
|
||||
description: generating new code, writing functions, or creating boilerplate
|
||||
|
||||
tracing:
|
||||
random_sampling: 100
|
||||
307
demos/llm_routing/model_affinity/demo.py
Normal file
307
demos/llm_routing/model_affinity/demo.py
Normal file
|
|
@ -0,0 +1,307 @@
|
|||
#!/usr/bin/env -S uv run --script
|
||||
# /// script
|
||||
# requires-python = ">=3.12"
|
||||
# dependencies = ["openai>=1.0.0"]
|
||||
# ///
|
||||
"""
|
||||
Model Affinity Demo — Agentic Tool-Calling Loop
|
||||
|
||||
Runs the same agentic loop twice through Plano:
|
||||
1. Without model affinity — the router may pick different models per turn
|
||||
2. With model affinity — all turns use the model selected on turn 1
|
||||
|
||||
Each loop is a real tool-calling agent: the LLM decides which tools to call,
|
||||
we provide simulated results, and the LLM continues until it has enough
|
||||
information to produce a final answer. Each turn is a separate request to
|
||||
Plano, so the router classifies intent independently every time.
|
||||
|
||||
Usage:
|
||||
planoai up config.yaml # start Plano
|
||||
uv run demo.py # run this demo
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
import uuid
|
||||
|
||||
from openai import AsyncOpenAI
|
||||
from openai.types.chat import ChatCompletionMessageParam
|
||||
|
||||
PLANO_URL = os.environ.get("PLANO_URL", "http://localhost:12000")
|
||||
|
||||
SYSTEM_PROMPT = (
|
||||
"You are a database selection analyst. Use the provided tools to gather "
|
||||
"benchmark data and case studies, then recommend PostgreSQL or MongoDB "
|
||||
"for a high-traffic e-commerce backend. Be concise."
|
||||
)
|
||||
|
||||
USER_QUERY = (
|
||||
"Should we use PostgreSQL or MongoDB for our e-commerce platform? "
|
||||
"We need strong consistency for orders but flexible schemas for products. "
|
||||
"Use the tools to research both options, then give a recommendation."
|
||||
)
|
||||
|
||||
TOOLS = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_db_benchmarks",
|
||||
"description": "Fetch performance benchmarks for a database under a given workload.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"database": {
|
||||
"type": "string",
|
||||
"enum": ["postgresql", "mongodb"],
|
||||
},
|
||||
"workload": {
|
||||
"type": "string",
|
||||
"enum": ["read_heavy", "write_heavy", "mixed"],
|
||||
},
|
||||
},
|
||||
"required": ["database", "workload"],
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_case_studies",
|
||||
"description": "Retrieve real-world e-commerce case studies for a database.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"database": {
|
||||
"type": "string",
|
||||
"enum": ["postgresql", "mongodb"],
|
||||
},
|
||||
},
|
||||
"required": ["database"],
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "check_feature_support",
|
||||
"description": "Check if a database supports a specific feature.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"database": {
|
||||
"type": "string",
|
||||
"enum": ["postgresql", "mongodb"],
|
||||
},
|
||||
"feature": {"type": "string"},
|
||||
},
|
||||
"required": ["database", "feature"],
|
||||
},
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
# Simulated tool responses
|
||||
_BENCHMARKS = {
|
||||
("postgresql", "mixed"): {
|
||||
"read_qps": 42000,
|
||||
"write_qps": 21000,
|
||||
"p99_ms": 6,
|
||||
"notes": "Solid all-round; MVCC keeps reads non-blocking",
|
||||
},
|
||||
("mongodb", "mixed"): {
|
||||
"read_qps": 60000,
|
||||
"write_qps": 50000,
|
||||
"p99_ms": 3,
|
||||
"notes": "Flexible schema accelerates feature iteration",
|
||||
},
|
||||
}
|
||||
|
||||
_CASE_STUDIES = {
|
||||
"postgresql": [
|
||||
{"company": "Shopify", "notes": "Moved orders back to Postgres for ACID"},
|
||||
{
|
||||
"company": "Zalando",
|
||||
"notes": "Postgres + Citus for sharded order processing",
|
||||
},
|
||||
],
|
||||
"mongodb": [
|
||||
{"company": "eBay", "notes": "Product catalogue — flexible attribute schemas"},
|
||||
{"company": "Alibaba", "notes": "Session/cart data — high write throughput"},
|
||||
],
|
||||
}
|
||||
|
||||
_FEATURES = {
|
||||
("postgresql", "acid transactions"): {"supported": True, "notes": "Full ACID"},
|
||||
("mongodb", "acid transactions"): {
|
||||
"supported": True,
|
||||
"notes": "Multi-doc ACID since v4.0",
|
||||
},
|
||||
("postgresql", "horizontal sharding"): {
|
||||
"supported": True,
|
||||
"notes": "Via Citus extension",
|
||||
},
|
||||
("mongodb", "horizontal sharding"): {
|
||||
"supported": True,
|
||||
"notes": "Native auto-balancing",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def dispatch_tool(name: str, args: dict) -> str:
|
||||
if name == "get_db_benchmarks":
|
||||
key = (args["database"], args["workload"])
|
||||
return json.dumps(_BENCHMARKS.get(key, {"error": f"no data for {key}"}))
|
||||
if name == "get_case_studies":
|
||||
return json.dumps(_CASE_STUDIES.get(args["database"], {"error": "unknown db"}))
|
||||
if name == "check_feature_support":
|
||||
key = (args["database"], args["feature"].lower())
|
||||
for k, v in _FEATURES.items():
|
||||
if k[0] == key[0] and k[1] in key[1]:
|
||||
return json.dumps(v)
|
||||
return json.dumps({"error": f"no data for {key}"})
|
||||
return json.dumps({"error": f"unknown tool {name}"})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Agentic loop — runs tool calls until the LLM produces a final answer
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def run_agent_loop(
|
||||
affinity_id: str | None = None,
|
||||
max_turns: int = 10,
|
||||
) -> tuple[str, list[dict]]:
|
||||
"""
|
||||
Run a tool-calling agent loop against Plano.
|
||||
|
||||
Returns (final_answer, trace) where trace is a list of
|
||||
{"turn": int, "model": str, "tool_calls": [...]} dicts.
|
||||
"""
|
||||
client = AsyncOpenAI(base_url=f"{PLANO_URL}/v1", api_key="EMPTY")
|
||||
headers = {"X-Model-Affinity": affinity_id} if affinity_id else None
|
||||
|
||||
messages: list[ChatCompletionMessageParam] = [
|
||||
{"role": "system", "content": SYSTEM_PROMPT},
|
||||
{"role": "user", "content": USER_QUERY},
|
||||
]
|
||||
trace: list[dict] = []
|
||||
|
||||
for turn in range(1, max_turns + 1):
|
||||
resp = await client.chat.completions.create(
|
||||
model="gpt-4o-mini",
|
||||
messages=messages,
|
||||
tools=TOOLS,
|
||||
tool_choice="auto",
|
||||
max_completion_tokens=800,
|
||||
extra_headers=headers,
|
||||
)
|
||||
|
||||
choice = resp.choices[0]
|
||||
turn_info: dict = {"turn": turn, "model": resp.model}
|
||||
|
||||
if choice.finish_reason == "tool_calls" and choice.message.tool_calls:
|
||||
tool_names = [tc.function.name for tc in choice.message.tool_calls]
|
||||
turn_info["tool_calls"] = tool_names
|
||||
trace.append(turn_info)
|
||||
|
||||
messages.append(choice.message)
|
||||
for tc in choice.message.tool_calls:
|
||||
args = json.loads(tc.function.arguments or "{}")
|
||||
result = dispatch_tool(tc.function.name, args)
|
||||
messages.append(
|
||||
{"role": "tool", "content": result, "tool_call_id": tc.id}
|
||||
)
|
||||
else:
|
||||
turn_info["tool_calls"] = []
|
||||
trace.append(turn_info)
|
||||
return (choice.message.content or "").strip(), trace
|
||||
|
||||
return "(max turns reached)", trace
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Display helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def short_model(model: str) -> str:
|
||||
return model.split("/")[-1] if "/" in model else model
|
||||
|
||||
|
||||
def print_trace(trace: list[dict]) -> None:
|
||||
for t in trace:
|
||||
model = short_model(t["model"])
|
||||
tools = ", ".join(t["tool_calls"]) if t["tool_calls"] else "final answer"
|
||||
print(f" turn {t['turn']} [{model:<30}] {tools}")
|
||||
|
||||
|
||||
def print_summary(label: str, trace: list[dict]) -> None:
|
||||
models = [t["model"] for t in trace]
|
||||
unique = set(models)
|
||||
if len(unique) == 1:
|
||||
print(
|
||||
f" ✓ {label}: {short_model(next(iter(unique)))} "
|
||||
f"for all {len(models)} turns"
|
||||
)
|
||||
else:
|
||||
switches = sum(1 for a, b in zip(models, models[1:]) if a != b)
|
||||
names = ", ".join(sorted(short_model(m) for m in unique))
|
||||
print(f" ✗ {label}: model switched {switches} time(s) — {names}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
print()
|
||||
print(" ╔══════════════════════════════════════════════════════════╗")
|
||||
print(" ║ Model Affinity Demo — Agentic Loop ║")
|
||||
print(" ╚══════════════════════════════════════════════════════════╝")
|
||||
print()
|
||||
print(f" Plano : {PLANO_URL}")
|
||||
print(f' Query : "{USER_QUERY[:65]}…"')
|
||||
print()
|
||||
print(" The agent calls tools (get_db_benchmarks, get_case_studies,")
|
||||
print(" check_feature_support) across multiple turns. Each turn is")
|
||||
print(" a separate request to Plano — the router classifies intent")
|
||||
print(" independently, so different turns may get different models.")
|
||||
print()
|
||||
|
||||
# --- Run 1: without affinity ---
|
||||
print(" ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")
|
||||
print(" Run 1: WITHOUT Model Affinity")
|
||||
print(" ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")
|
||||
print()
|
||||
answer1, trace1 = await run_agent_loop(affinity_id=None)
|
||||
print_trace(trace1)
|
||||
print()
|
||||
print_summary("Without affinity", trace1)
|
||||
print()
|
||||
|
||||
# --- Run 2: with affinity ---
|
||||
aid = str(uuid.uuid4())
|
||||
print(" ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")
|
||||
print(f" Run 2: WITH Model Affinity (X-Model-Affinity: {aid[:8]}…)")
|
||||
print(" ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")
|
||||
print()
|
||||
answer2, trace2 = await run_agent_loop(affinity_id=aid)
|
||||
print_trace(trace2)
|
||||
print()
|
||||
print_summary("With affinity ", trace2)
|
||||
print()
|
||||
|
||||
# --- Final answer ---
|
||||
print(" ══ Agent recommendation (affinity session) ════════════════")
|
||||
print()
|
||||
for line in answer2.splitlines():
|
||||
print(f" {line}")
|
||||
print()
|
||||
print(" ═══════════════════════════════════════════════════════════")
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
7
demos/llm_routing/model_affinity/demo.sh
Executable file
7
demos/llm_routing/model_affinity/demo.sh
Executable file
|
|
@ -0,0 +1,7 @@
|
|||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
|
||||
# Run the demo directly against Plano (no agent server needed)
|
||||
uv run "$SCRIPT_DIR/demo.py"
|
||||
28
demos/llm_routing/model_affinity/start_agents.sh
Executable file
28
demos/llm_routing/model_affinity/start_agents.sh
Executable file
|
|
@ -0,0 +1,28 @@
|
|||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
PIDS=()
|
||||
|
||||
log() { echo "$(date '+%F %T') - $*"; }
|
||||
|
||||
cleanup() {
|
||||
log "Stopping agents..."
|
||||
for PID in "${PIDS[@]}"; do
|
||||
kill "$PID" 2>/dev/null && log "Stopped process $PID"
|
||||
done
|
||||
exit 0
|
||||
}
|
||||
|
||||
trap cleanup EXIT INT TERM
|
||||
|
||||
export PLANO_URL="${PLANO_URL:-http://localhost:12000}"
|
||||
export AGENT_PORT="${AGENT_PORT:-8000}"
|
||||
|
||||
log "Starting research_agent on port $AGENT_PORT..."
|
||||
uv run "$SCRIPT_DIR/agent.py" &
|
||||
PIDS+=($!)
|
||||
|
||||
for PID in "${PIDS[@]}"; do
|
||||
wait "$PID"
|
||||
done
|
||||
Loading…
Add table
Add a link
Reference in a new issue