# Plano (ArchGW) — High-Level Architecture ## Overview Plano is an AI-native gateway built on **Envoy Proxy**, extended with custom **WebAssembly (WASM) filters** and a native Rust service called **Brightstaff**. It acts as an intelligent intermediary between client applications, AI agents, and LLM providers — handling intent-based routing, prompt guardrails, function calling, agent orchestration, rate limiting, and multi-provider LLM translation. ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Plano Gateway │ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ Envoy Proxy (L7) │ │ │ │ │ │ │ │ ┌──────────────────┐ ┌──────────────────┐ │ │ │ │ │ prompt_gateway │──────▶│ llm_gateway │ │ │ │ │ │ (WASM) │ │ (WASM) │ │ │ │ │ │ │ │ │ │ │ │ │ │ • Intent matching│ │ • Provider routing│ │ │ │ │ │ • Guardrails │ │ • Auth injection │ │ │ │ │ │ • Function call │ │ • Rate limiting │ │ │ │ │ │ • Prompt targets │ │ • API translation │ │ │ │ │ └──────────────────┘ └────────┬─────────┘ │ │ │ │ │ │ │ │ └───────────────────────────────────────┼──────────────────────────────┘ │ │ │ │ │ ┌───────────────────────────────────────┼──────────────────────────────┐ │ │ │ Brightstaff (Rust HTTP Server :9091) │ │ │ │ │ │ │ │ • LLM request routing (Arch-Router model) │ │ │ │ • Agent orchestration (Plano-Orchestrator model) │ │ │ │ • Conversation state management (memory / PostgreSQL) │ │ │ │ • Function calling handler (Arch-Function model) │ │ │ │ • Observability & signal analysis │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ Agents │ │ Developer │ │ LLM Providers│ │ (MCP/HTTP)│ │ APIs │ │ (OpenAI, etc)│ └──────────┘ └──────────────┘ └──────────────┘ ``` --- ## The Role of Envoy Envoy is the **data plane** of Plano. All client traffic — both inbound prompts and outbound LLM calls — flows through Envoy. It provides: - **L7 HTTP routing** based on paths and custom headers - **WASM filter execution** for inline request/response transformation - **Connection pooling and TLS** to upstream LLM providers - **Retry policies** for resilience - **Compression/decompression** for LLM streaming responses ### Envoy Listeners Envoy defines **six listener types**, each serving a distinct role in the request flow: | Listener | Port | Direction | Purpose | |---|---|---|---| | `ingress_traffic` | 10000 (configurable) | Inbound | Client-facing entry point. Forwards all traffic to the prompt gateway listener. | | `ingress_traffic_prompt` | 10001 | Inbound | **Core processing listener.** Runs both WASM filters (`prompt_gateway` → `llm_gateway`). Routes to LLM providers by `x-arch-llm-provider` header. | | `outbound_api_traffic` | 11000 | Internal | Routes to upstream developer APIs and agents using `x-arch-upstream` header. No WASM filters. | | Agent listeners | Per-config | Inbound | One per agent listener in config. Routes to Brightstaff with `/agents/` path prefix. | | `egress_traffic` | 12000 (configurable) | Outbound | LLM gateway entry for agents/services reaching LLMs. Routes to Brightstaff for routing decisions. | | `egress_traffic_llm` | 12001 | Outbound | **Final outbound LLM listener.** Runs `llm_gateway.wasm` for auth injection, provider translation, and rate limiting before reaching the actual LLM provider. | ### Envoy Clusters Envoy manages connections to all upstream services: **LLM Provider Clusters** — Pre-configured TLS clusters for: OpenAI, Anthropic (Claude), Groq, Mistral, DeepSeek, Gemini, xAI, MoonshotAI, Zhipu, Together AI, and Katanemo's hosted Arch models. Custom-URL providers (e.g., Azure OpenAI, Ollama) are dynamically added from config. **Internal Clusters:** | Cluster | Target | Purpose | |---|---|---| | `bright_staff` | localhost:9091 | The Brightstaff Rust service | | `arch_prompt_gateway_listener` | localhost:10001 | Internal forwarding from ingress | | `arch_listener_llm` | localhost:12001 | Internal forwarding for LLM egress | | `arch_internal` | localhost:11000 | Outbound API router | **Dynamic Clusters** — Generated from `endpoints` and `agents` config sections (developer APIs, agent services). ### Custom Headers Used for Routing | Header | Set By | Used By | Purpose | |---|---|---|---| | `x-arch-llm-provider` | WASM filters | Envoy routes | Selects the LLM provider cluster | | `x-arch-llm-provider-hint` | Brightstaff | llm_gateway | Hints which provider/model to use | | `x-arch-upstream` / `x-arch-upstream-host` | WASM filters / Brightstaff | Envoy routes | Targets a specific agent or API endpoint | | `x-arch-is-streaming` | Brightstaff | llm_gateway | Indicates streaming mode | | `x-arch-state` | prompt_gateway | prompt_gateway | Carries multi-turn conversation state | | `x-arch-tool-call` | prompt_gateway | prompt_gateway | Carries tool call metadata | | `x-arch-api-response` | prompt_gateway | prompt_gateway | Carries developer API response data | | `x-arch-agent-listener-name` | Envoy | Brightstaff | Identifies which agent listener a request arrived on | --- ## Request Flows ### Flow 1: Direct LLM Chat (`POST /v1/chat/completions`) This is the standard path for client-to-LLM requests with optional intent matching and routing. ``` Client │ ▼ [Envoy :10000 — ingress_traffic] │ (simple passthrough) ▼ [Envoy :10001 — ingress_traffic_prompt] │ ├── prompt_gateway.wasm │ 1. Parse ChatCompletions request │ 2. Convert prompt_targets → tool definitions │ 3. Dispatch to Arch-Function model at /function_calling │ 4. If intent matched: │ → Call developer API endpoint via :11000 │ → Augment prompt with API response context │ 5. If no intent matched: │ → Prepend system prompt, forward to LLM │ ├── llm_gateway.wasm │ 1. Select LLM provider (from header hint or default) │ 2. Enforce rate limits (token-based via tiktoken) │ 3. Inject auth credentials (Bearer / x-api-key) │ 4. Transform request format (OpenAI ↔ Anthropic ↔ Bedrock) │ 5. Rewrite upstream path for target provider │ ▼ LLM Provider (OpenAI, Anthropic, Gemini, etc.) │ ▼ (Response flows back through llm_gateway for format translation) │ ▼ Client ``` ### Flow 2: Brightstaff LLM Routing (`POST /v1/chat/completions` via egress) When requests reach Brightstaff (directly or via agent listeners), it performs intelligent model routing. ``` Client / Agent │ ▼ [Brightstaff :9091] │ ├── Resolve model aliases ├── Validate model exists in configured providers ├── Retrieve conversation state (if using Responses API) │ ├── Call Arch-Router model ──► [Envoy :12001] │ (determines best model/provider for the request ──► LLM Provider │ based on routing_preferences in config) │ ├── Forward actual request ──► [Envoy :12001] │ (with x-arch-llm-provider-hint header) ──► LLM Provider │ ▼ [Stream response back with metrics, signal analysis, state capture] │ ▼ Client / Agent ``` ### Flow 3: Agent Orchestration (`POST /agents/v1/chat/completions`) The agentic flow where Brightstaff selects and chains agents based on user intent. ``` Client │ ▼ [Envoy — Agent Listener :configurable] │ (path rewrite: /agents/...) ▼ [Brightstaff :9091] │ ├── Identify listener from x-arch-agent-listener-name ├── Find configured agents for this listener │ ├── If multiple agents: │ Call Plano-Orchestrator model ──► [Envoy :12001] ──► LLM │ (selects which agents to run and in what order) │ ├── For each selected agent: │ │ │ ├── Run filter chain (pre-processing) │ │ └── [Envoy :11000] ──► Filter Service (MCP/HTTP) │ │ │ ├── Invoke agent │ │ └── [Envoy :11000] ──► Agent Service (MCP/HTTP) │ │ │ ├── If intermediate agent: │ │ Collect full response → feed as input to next agent │ │ │ └── If final agent: │ Stream response directly to client │ ▼ Client ``` --- ## Brightstaff Service Brightstaff is a native Rust HTTP server (`0.0.0.0:9091`) built with Axum. It is the **control plane brain** of Plano — while Envoy handles the data plane (proxying, filtering), Brightstaff handles the intelligent decision-making. ### Endpoints | Method | Path | Handler | Purpose | |---|---|---|---| | `POST` | `/v1/chat/completions` | `llm_chat` | LLM passthrough with model routing | | `POST` | `/v1/messages` | `llm_chat` | Anthropic Messages API compat | | `POST` | `/v1/responses` | `llm_chat` | OpenAI Responses API with state | | `POST` | `/agents/v1/chat/completions` | `agent_chat` | Agent orchestration pipeline | | `POST` | `/agents/v1/messages` | `agent_chat` | Agent orchestration (Messages) | | `POST` | `/agents/v1/responses` | `agent_chat` | Agent orchestration (Responses) | | `POST` | `/function_calling` | `function_calling_chat_handler` | Arch-Function tool calling | | `GET` | `/v1/models` | `list_models` | List configured LLM models | ### Core Components #### RouterService (LLM Routing) Uses the **Arch-Router** model — a specialized LLM that determines which provider/model best matches a user's request based on `routing_preferences` defined in config. Constructs a system prompt describing available routes, sends the conversation, and parses a `{"route": "route_name"}` response. #### OrchestratorService (Agent Selection) Uses the **Plano-Orchestrator** model to determine which agent(s) should handle a request when multiple agents are available on a listener. Returns an ordered list of agents: `{"route": ["agent1", "agent2"]}`. #### PipelineProcessor (Agent Execution) Manages the sequential execution of agent filter chains and agent invocations: - **MCP agents**: JSON-RPC 2.0 protocol over SSE transport (`initialize` → `notifications/initialized` → `tools/call`) - **HTTP agents**: Direct POST with message array - Routes through Envoy at `:11000` using `x-arch-upstream-host` header #### Function Calling Handler Specialized handler for the **Arch-Function** model: - Converts OpenAI tool definitions into prompts - Parses structured JSON responses (tool_calls, clarifications) - Includes **hallucination detection** using entropy/varentropy/probability thresholds from logprobs #### State Management Manages conversation state for the OpenAI Responses API (`v1/responses`): - **Memory backend** — `HashMap` behind `Arc` for single-instance dev - **PostgreSQL backend** — Persistent storage with upsert semantics - `ResponsesStateProcessor` intercepts streaming responses to capture `response_id` and output items, storing them asynchronously for future conversation chaining via `previous_response_id` #### Signal Analysis (Observability) Analyzes conversation patterns for interaction quality: - Frustration, repetition/looping, escalation requests, positive feedback, repair patterns - Quality graded as Good / Fair / Poor / Severe - Concerning signals flag spans with indicators for monitoring --- ## Rust Crate Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ brightstaff (binary) │ │ │ │ Native Rust HTTP server — routing, orchestration, state │ │ Depends on: hermesllm, common (non-WASM parts) │ └─────────────────────────────────────────────────────────────┘ ┌──────────────────────┐ ┌──────────────────────┐ │ prompt_gateway │ │ llm_gateway │ │ (WASM) │ │ (WASM) │ │ │ │ │ │ Intent matching │ │ Provider routing │ │ Prompt guards │ │ Auth injection │ │ Function calling │ │ Rate limiting │ │ API orchestration │ │ Request/Response │ │ │ │ format translation │ ├──────────────────────┤ ├───────────────────────┤ │ depends on: common │ │ depends on: common, │ │ │ │ hermesllm │ └──────────┬───────────┘ └──────────┬────────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────────────────────────┐ │ common (lib) │ │ │ │ Configuration types, LlmProviders, HTTP client trait, │ │ rate limiting (governor), tokenization (tiktoken), │ │ OpenAI API types, routing, metrics, tracing, constants │ │ Depends on: hermesllm │ └─────────────────────────────┬───────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ hermesllm (lib) │ │ │ │ LLM protocol abstraction — cross-provider request/response │ │ translation (OpenAI ↔ Anthropic ↔ Bedrock ↔ Gemini) │ │ SSE stream parsing, provider model catalog, endpoint │ │ mapping. No proxy-wasm dependency (pure Rust). │ └──────────────────────────────────────────────────────────────┘ ``` ### WASM Compilation Both `prompt_gateway` and `llm_gateway` compile to `cdylib` targets for `wasm32-wasip1` using the `proxy-wasm` SDK (v0.2.1). Envoy loads them via its V8 WASM runtime. Each filter implements `RootContext` (for config parsing and per-stream creation) and `HttpContext` (for per-request processing). --- ## Deployment Architecture All components run inside a single container managed by **Supervisord**: ``` ┌─────────────────────────────────────────────────────────────┐ │ Docker Container │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Supervisord │ │ │ │ │ │ │ │ ┌─────────────┐ ┌───────────────┐ ┌───────────┐ │ │ │ │ │ Brightstaff │ │ Envoy Proxy │ │ Log Tail │ │ │ │ │ │ (Rust) │ │ + WASM │ │ │ │ │ │ │ │ :9091 │ │ :10000-12001 │ │ │ │ │ │ │ └─────────────┘ └───────────────┘ └───────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ Startup sequence: │ │ 1. config_generator.py validates arch_config.yaml │ │ 2. Renders envoy.template.yaml → envoy.yaml (Jinja2) │ │ 3. Starts Brightstaff + Envoy in parallel │ │ │ └─────────────────────────────────────────────────────────────┘ ``` **Docker multi-stage build:** 1. `deps` — Rust 1.93.0 with `wasm32-wasip1` target, dependency pre-compilation 2. `wasm-builder` — Builds `prompt_gateway.wasm` + `llm_gateway.wasm` (release) 3. `brightstaff-builder` — Builds the `brightstaff` native binary (release) 4. `envoy` — Pulls `envoyproxy/envoy:v1.37.0` 5. `arch` (final) — Python 3.13.6-slim base with Envoy binary, WASM plugins, Brightstaff binary, and the `planoai` CLI --- ## Configuration Pipeline User-facing configuration flows through a generation pipeline before reaching Envoy and Brightstaff: ``` arch_config.yaml (user-authored) │ ▼ config_generator.py (Python CLI) 1. Validate against arch_config_schema.yaml (JSON Schema) 2. Normalize legacy formats (llm_providers → model_providers) 3. Parse agents, filters, endpoints → infer Envoy clusters 4. Parse model_providers → validate provider/model format 5. Auto-add internal models (arch-function, arch-router, plano-orchestrator) 6. Validate model aliases, routing preferences, prompt target endpoints │ ├──► envoy.yaml (rendered from envoy.template.yaml via Jinja2) │ → consumed by Envoy │ └──► arch_config_rendered.yaml → consumed by Brightstaff → injected into WASM filter configs ``` ### Key Config Sections | Section | Consumed By | Purpose | |---|---|---| | `model_providers` | llm_gateway, Brightstaff | LLM provider definitions with models, auth, routing preferences | | `prompt_targets` | prompt_gateway | Intent-to-API mappings with parameter schemas | | `prompt_guards` | prompt_gateway | Input guardrails (jailbreak detection) | | `endpoints` | prompt_gateway, Envoy | Named upstream API endpoint definitions | | `agents` | Brightstaff, Envoy | Agent service definitions (id, URL, type) | | `listeners` | Brightstaff, Envoy | Listener configs binding agents to ports | | `ratelimits` | llm_gateway | Per-model rate limits with token-based quotas | | `routing` | Brightstaff | LLM routing model/provider config | | `model_aliases` | Brightstaff | Friendly name → provider/model mappings | | `state_storage` | Brightstaff | Conversation state backend (memory / postgres) | | `tracing` | All components | OpenTelemetry config (sampling, OTLP endpoint) | | `overrides` | prompt_gateway, Brightstaff | Tuning (intent threshold, agent orchestrator toggle) | --- ## Supported LLM Providers | Provider | Cluster | Auth Method | |---|---|---| | OpenAI | api.openai.com | Bearer token | | Anthropic (Claude) | api.anthropic.com | x-api-key header | | Google (Gemini) | generativelanguage.googleapis.com | API key in URL | | Groq | api.groq.com | Bearer token | | Mistral | api.mistral.ai | Bearer token | | DeepSeek | api.deepseek.com | Bearer token | | xAI | api.x.ai | Bearer token | | Together AI | api.together.xyz | Bearer token | | MoonshotAI | api.moonshot.ai | Bearer token | | Zhipu | open.bigmodel.cn | Bearer token | | Amazon Bedrock | Custom base_url | AWS Sig v4 | | Azure OpenAI | Custom base_url | Bearer / API key | | Ollama | Custom base_url | None | | Katanemo (Arch) | archfc.katanemo.dev | Bearer token | The `hermesllm` crate handles **cross-provider request/response translation** so clients can use a single API format (typically OpenAI-compatible) regardless of which upstream provider serves the request.