23 KiB
Plano (ArchGW) — High-Level Architecture
Overview
Plano is an AI-native gateway built on Envoy Proxy, extended with custom WebAssembly (WASM) filters and a native Rust service called Brightstaff. It acts as an intelligent intermediary between client applications, AI agents, and LLM providers — handling intent-based routing, prompt guardrails, function calling, agent orchestration, rate limiting, and multi-provider LLM translation.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Plano Gateway │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Envoy Proxy (L7) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ prompt_gateway │──────▶│ llm_gateway │ │ │
│ │ │ (WASM) │ │ (WASM) │ │ │
│ │ │ │ │ │ │ │
│ │ │ • Intent matching│ │ • Provider routing│ │ │
│ │ │ • Guardrails │ │ • Auth injection │ │ │
│ │ │ • Function call │ │ • Rate limiting │ │ │
│ │ │ • Prompt targets │ │ • API translation │ │ │
│ │ └──────────────────┘ └────────┬─────────┘ │ │
│ │ │ │ │
│ └───────────────────────────────────────┼──────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────────────┼──────────────────────────────┐ │
│ │ Brightstaff (Rust HTTP Server :9091) │ │
│ │ │ │
│ │ • LLM request routing (Arch-Router model) │ │
│ │ • Agent orchestration (Plano-Orchestrator model) │ │
│ │ • Conversation state management (memory / PostgreSQL) │ │
│ │ • Function calling handler (Arch-Function model) │ │
│ │ • Observability & signal analysis │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Agents │ │ Developer │ │ LLM Providers│
│ (MCP/HTTP)│ │ APIs │ │ (OpenAI, etc)│
└──────────┘ └──────────────┘ └──────────────┘
The Role of Envoy
Envoy is the data plane of Plano. All client traffic — both inbound prompts and outbound LLM calls — flows through Envoy. It provides:
- L7 HTTP routing based on paths and custom headers
- WASM filter execution for inline request/response transformation
- Connection pooling and TLS to upstream LLM providers
- Retry policies for resilience
- Compression/decompression for LLM streaming responses
Envoy Listeners
Envoy defines six listener types, each serving a distinct role in the request flow:
| Listener | Port | Direction | Purpose |
|---|---|---|---|
ingress_traffic |
10000 (configurable) | Inbound | Client-facing entry point. Forwards all traffic to the prompt gateway listener. |
ingress_traffic_prompt |
10001 | Inbound | Core processing listener. Runs both WASM filters (prompt_gateway → llm_gateway). Routes to LLM providers by x-arch-llm-provider header. |
outbound_api_traffic |
11000 | Internal | Routes to upstream developer APIs and agents using x-arch-upstream header. No WASM filters. |
| Agent listeners | Per-config | Inbound | One per agent listener in config. Routes to Brightstaff with /agents/ path prefix. |
egress_traffic |
12000 (configurable) | Outbound | LLM gateway entry for agents/services reaching LLMs. Routes to Brightstaff for routing decisions. |
egress_traffic_llm |
12001 | Outbound | Final outbound LLM listener. Runs llm_gateway.wasm for auth injection, provider translation, and rate limiting before reaching the actual LLM provider. |
Envoy Clusters
Envoy manages connections to all upstream services:
LLM Provider Clusters — Pre-configured TLS clusters for: OpenAI, Anthropic (Claude), Groq, Mistral, DeepSeek, Gemini, xAI, MoonshotAI, Zhipu, Together AI, and Katanemo's hosted Arch models. Custom-URL providers (e.g., Azure OpenAI, Ollama) are dynamically added from config.
Internal Clusters:
| Cluster | Target | Purpose |
|---|---|---|
bright_staff |
localhost:9091 | The Brightstaff Rust service |
arch_prompt_gateway_listener |
localhost:10001 | Internal forwarding from ingress |
arch_listener_llm |
localhost:12001 | Internal forwarding for LLM egress |
arch_internal |
localhost:11000 | Outbound API router |
Dynamic Clusters — Generated from endpoints and agents config sections (developer APIs, agent services).
Custom Headers Used for Routing
| Header | Set By | Used By | Purpose |
|---|---|---|---|
x-arch-llm-provider |
WASM filters | Envoy routes | Selects the LLM provider cluster |
x-arch-llm-provider-hint |
Brightstaff | llm_gateway | Hints which provider/model to use |
x-arch-upstream / x-arch-upstream-host |
WASM filters / Brightstaff | Envoy routes | Targets a specific agent or API endpoint |
x-arch-is-streaming |
Brightstaff | llm_gateway | Indicates streaming mode |
x-arch-state |
prompt_gateway | prompt_gateway | Carries multi-turn conversation state |
x-arch-tool-call |
prompt_gateway | prompt_gateway | Carries tool call metadata |
x-arch-api-response |
prompt_gateway | prompt_gateway | Carries developer API response data |
x-arch-agent-listener-name |
Envoy | Brightstaff | Identifies which agent listener a request arrived on |
Request Flows
Flow 1: Direct LLM Chat (POST /v1/chat/completions)
This is the standard path for client-to-LLM requests with optional intent matching and routing.
Client
│
▼
[Envoy :10000 — ingress_traffic]
│ (simple passthrough)
▼
[Envoy :10001 — ingress_traffic_prompt]
│
├── prompt_gateway.wasm
│ 1. Parse ChatCompletions request
│ 2. Convert prompt_targets → tool definitions
│ 3. Dispatch to Arch-Function model at /function_calling
│ 4. If intent matched:
│ → Call developer API endpoint via :11000
│ → Augment prompt with API response context
│ 5. If no intent matched:
│ → Prepend system prompt, forward to LLM
│
├── llm_gateway.wasm
│ 1. Select LLM provider (from header hint or default)
│ 2. Enforce rate limits (token-based via tiktoken)
│ 3. Inject auth credentials (Bearer / x-api-key)
│ 4. Transform request format (OpenAI ↔ Anthropic ↔ Bedrock)
│ 5. Rewrite upstream path for target provider
│
▼
LLM Provider (OpenAI, Anthropic, Gemini, etc.)
│
▼
(Response flows back through llm_gateway for format translation)
│
▼
Client
Flow 2: Brightstaff LLM Routing (POST /v1/chat/completions via egress)
When requests reach Brightstaff (directly or via agent listeners), it performs intelligent model routing.
Client / Agent
│
▼
[Brightstaff :9091]
│
├── Resolve model aliases
├── Validate model exists in configured providers
├── Retrieve conversation state (if using Responses API)
│
├── Call Arch-Router model ──► [Envoy :12001]
│ (determines best model/provider for the request ──► LLM Provider
│ based on routing_preferences in config)
│
├── Forward actual request ──► [Envoy :12001]
│ (with x-arch-llm-provider-hint header) ──► LLM Provider
│
▼
[Stream response back with metrics, signal analysis, state capture]
│
▼
Client / Agent
Flow 3: Agent Orchestration (POST /agents/v1/chat/completions)
The agentic flow where Brightstaff selects and chains agents based on user intent.
Client
│
▼
[Envoy — Agent Listener :configurable]
│ (path rewrite: /agents/...)
▼
[Brightstaff :9091]
│
├── Identify listener from x-arch-agent-listener-name
├── Find configured agents for this listener
│
├── If multiple agents:
│ Call Plano-Orchestrator model ──► [Envoy :12001] ──► LLM
│ (selects which agents to run and in what order)
│
├── For each selected agent:
│ │
│ ├── Run filter chain (pre-processing)
│ │ └── [Envoy :11000] ──► Filter Service (MCP/HTTP)
│ │
│ ├── Invoke agent
│ │ └── [Envoy :11000] ──► Agent Service (MCP/HTTP)
│ │
│ ├── If intermediate agent:
│ │ Collect full response → feed as input to next agent
│ │
│ └── If final agent:
│ Stream response directly to client
│
▼
Client
Brightstaff Service
Brightstaff is a native Rust HTTP server (0.0.0.0:9091) built with Axum. It is the control plane brain of Plano — while Envoy handles the data plane (proxying, filtering), Brightstaff handles the intelligent decision-making.
Endpoints
| Method | Path | Handler | Purpose |
|---|---|---|---|
POST |
/v1/chat/completions |
llm_chat |
LLM passthrough with model routing |
POST |
/v1/messages |
llm_chat |
Anthropic Messages API compat |
POST |
/v1/responses |
llm_chat |
OpenAI Responses API with state |
POST |
/agents/v1/chat/completions |
agent_chat |
Agent orchestration pipeline |
POST |
/agents/v1/messages |
agent_chat |
Agent orchestration (Messages) |
POST |
/agents/v1/responses |
agent_chat |
Agent orchestration (Responses) |
POST |
/function_calling |
function_calling_chat_handler |
Arch-Function tool calling |
GET |
/v1/models |
list_models |
List configured LLM models |
Core Components
RouterService (LLM Routing)
Uses the Arch-Router model — a specialized LLM that determines which provider/model best matches a user's request based on routing_preferences defined in config. Constructs a system prompt describing available routes, sends the conversation, and parses a {"route": "route_name"} response.
OrchestratorService (Agent Selection)
Uses the Plano-Orchestrator model to determine which agent(s) should handle a request when multiple agents are available on a listener. Returns an ordered list of agents: {"route": ["agent1", "agent2"]}.
PipelineProcessor (Agent Execution)
Manages the sequential execution of agent filter chains and agent invocations:
- MCP agents: JSON-RPC 2.0 protocol over SSE transport (
initialize→notifications/initialized→tools/call) - HTTP agents: Direct POST with message array
- Routes through Envoy at
:11000usingx-arch-upstream-hostheader
Function Calling Handler
Specialized handler for the Arch-Function model:
- Converts OpenAI tool definitions into prompts
- Parses structured JSON responses (tool_calls, clarifications)
- Includes hallucination detection using entropy/varentropy/probability thresholds from logprobs
State Management
Manages conversation state for the OpenAI Responses API (v1/responses):
- Memory backend —
HashMapbehindArc<RwLock>for single-instance dev - PostgreSQL backend — Persistent storage with upsert semantics
ResponsesStateProcessorintercepts streaming responses to captureresponse_idand output items, storing them asynchronously for future conversation chaining viaprevious_response_id
Signal Analysis (Observability)
Analyzes conversation patterns for interaction quality:
- Frustration, repetition/looping, escalation requests, positive feedback, repair patterns
- Quality graded as Good / Fair / Poor / Severe
- Concerning signals flag spans with indicators for monitoring
Rust Crate Architecture
┌─────────────────────────────────────────────────────────────┐
│ brightstaff (binary) │
│ │
│ Native Rust HTTP server — routing, orchestration, state │
│ Depends on: hermesllm, common (non-WASM parts) │
└─────────────────────────────────────────────────────────────┘
┌──────────────────────┐ ┌──────────────────────┐
│ prompt_gateway │ │ llm_gateway │
│ (WASM) │ │ (WASM) │
│ │ │ │
│ Intent matching │ │ Provider routing │
│ Prompt guards │ │ Auth injection │
│ Function calling │ │ Rate limiting │
│ API orchestration │ │ Request/Response │
│ │ │ format translation │
├──────────────────────┤ ├───────────────────────┤
│ depends on: common │ │ depends on: common, │
│ │ │ hermesllm │
└──────────┬───────────┘ └──────────┬────────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ common (lib) │
│ │
│ Configuration types, LlmProviders, HTTP client trait, │
│ rate limiting (governor), tokenization (tiktoken), │
│ OpenAI API types, routing, metrics, tracing, constants │
│ Depends on: hermesllm │
└─────────────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ hermesllm (lib) │
│ │
│ LLM protocol abstraction — cross-provider request/response │
│ translation (OpenAI ↔ Anthropic ↔ Bedrock ↔ Gemini) │
│ SSE stream parsing, provider model catalog, endpoint │
│ mapping. No proxy-wasm dependency (pure Rust). │
└──────────────────────────────────────────────────────────────┘
WASM Compilation
Both prompt_gateway and llm_gateway compile to cdylib targets for wasm32-wasip1 using the proxy-wasm SDK (v0.2.1). Envoy loads them via its V8 WASM runtime. Each filter implements RootContext (for config parsing and per-stream creation) and HttpContext (for per-request processing).
Deployment Architecture
All components run inside a single container managed by Supervisord:
┌─────────────────────────────────────────────────────────────┐
│ Docker Container │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Supervisord │ │
│ │ │ │
│ │ ┌─────────────┐ ┌───────────────┐ ┌───────────┐ │ │
│ │ │ Brightstaff │ │ Envoy Proxy │ │ Log Tail │ │ │
│ │ │ (Rust) │ │ + WASM │ │ │ │ │
│ │ │ :9091 │ │ :10000-12001 │ │ │ │ │
│ │ └─────────────┘ └───────────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Startup sequence: │
│ 1. config_generator.py validates arch_config.yaml │
│ 2. Renders envoy.template.yaml → envoy.yaml (Jinja2) │
│ 3. Starts Brightstaff + Envoy in parallel │
│ │
└─────────────────────────────────────────────────────────────┘
Docker multi-stage build:
deps— Rust 1.93.0 withwasm32-wasip1target, dependency pre-compilationwasm-builder— Buildsprompt_gateway.wasm+llm_gateway.wasm(release)brightstaff-builder— Builds thebrightstaffnative binary (release)envoy— Pullsenvoyproxy/envoy:v1.37.0arch(final) — Python 3.13.6-slim base with Envoy binary, WASM plugins, Brightstaff binary, and theplanoaiCLI
Configuration Pipeline
User-facing configuration flows through a generation pipeline before reaching Envoy and Brightstaff:
arch_config.yaml (user-authored)
│
▼
config_generator.py (Python CLI)
1. Validate against arch_config_schema.yaml (JSON Schema)
2. Normalize legacy formats (llm_providers → model_providers)
3. Parse agents, filters, endpoints → infer Envoy clusters
4. Parse model_providers → validate provider/model format
5. Auto-add internal models (arch-function, arch-router, plano-orchestrator)
6. Validate model aliases, routing preferences, prompt target endpoints
│
├──► envoy.yaml (rendered from envoy.template.yaml via Jinja2)
│ → consumed by Envoy
│
└──► arch_config_rendered.yaml
→ consumed by Brightstaff
→ injected into WASM filter configs
Key Config Sections
| Section | Consumed By | Purpose |
|---|---|---|
model_providers |
llm_gateway, Brightstaff | LLM provider definitions with models, auth, routing preferences |
prompt_targets |
prompt_gateway | Intent-to-API mappings with parameter schemas |
prompt_guards |
prompt_gateway | Input guardrails (jailbreak detection) |
endpoints |
prompt_gateway, Envoy | Named upstream API endpoint definitions |
agents |
Brightstaff, Envoy | Agent service definitions (id, URL, type) |
listeners |
Brightstaff, Envoy | Listener configs binding agents to ports |
ratelimits |
llm_gateway | Per-model rate limits with token-based quotas |
routing |
Brightstaff | LLM routing model/provider config |
model_aliases |
Brightstaff | Friendly name → provider/model mappings |
state_storage |
Brightstaff | Conversation state backend (memory / postgres) |
tracing |
All components | OpenTelemetry config (sampling, OTLP endpoint) |
overrides |
prompt_gateway, Brightstaff | Tuning (intent threshold, agent orchestrator toggle) |
Supported LLM Providers
| Provider | Cluster | Auth Method |
|---|---|---|
| OpenAI | api.openai.com | Bearer token |
| Anthropic (Claude) | api.anthropic.com | x-api-key header |
| Google (Gemini) | generativelanguage.googleapis.com | API key in URL |
| Groq | api.groq.com | Bearer token |
| Mistral | api.mistral.ai | Bearer token |
| DeepSeek | api.deepseek.com | Bearer token |
| xAI | api.x.ai | Bearer token |
| Together AI | api.together.xyz | Bearer token |
| MoonshotAI | api.moonshot.ai | Bearer token |
| Zhipu | open.bigmodel.cn | Bearer token |
| Amazon Bedrock | Custom base_url | AWS Sig v4 |
| Azure OpenAI | Custom base_url | Bearer / API key |
| Ollama | Custom base_url | None |
| Katanemo (Arch) | archfc.katanemo.dev | Bearer token |
The hermesllm crate handles cross-provider request/response translation so clients can use a single API format (typically OpenAI-compatible) regardless of which upstream provider serves the request.