plano/architecture.md
2026-02-09 23:34:18 -08:00

23 KiB

Plano (ArchGW) — High-Level Architecture

Overview

Plano is an AI-native gateway built on Envoy Proxy, extended with custom WebAssembly (WASM) filters and a native Rust service called Brightstaff. It acts as an intelligent intermediary between client applications, AI agents, and LLM providers — handling intent-based routing, prompt guardrails, function calling, agent orchestration, rate limiting, and multi-provider LLM translation.

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Plano Gateway                                  │
│                                                                             │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                         Envoy Proxy (L7)                             │  │
│   │                                                                      │  │
│   │   ┌──────────────────┐       ┌──────────────────┐                    │  │
│   │   │  prompt_gateway  │──────▶│   llm_gateway     │                   │  │
│   │   │    (WASM)        │       │     (WASM)        │                   │  │
│   │   │                  │       │                   │                   │  │
│   │   │ • Intent matching│       │ • Provider routing│                   │  │
│   │   │ • Guardrails     │       │ • Auth injection  │                   │  │
│   │   │ • Function call  │       │ • Rate limiting   │                   │  │
│   │   │ • Prompt targets │       │ • API translation │                   │  │
│   │   └──────────────────┘       └────────┬─────────┘                   │  │
│   │                                       │                              │  │
│   └───────────────────────────────────────┼──────────────────────────────┘  │
│                                           │                                 │
│   ┌───────────────────────────────────────┼──────────────────────────────┐  │
│   │                    Brightstaff (Rust HTTP Server :9091)               │  │
│   │                                                                      │  │
│   │   • LLM request routing (Arch-Router model)                          │  │
│   │   • Agent orchestration (Plano-Orchestrator model)                   │  │
│   │   • Conversation state management (memory / PostgreSQL)              │  │
│   │   • Function calling handler (Arch-Function model)                   │  │
│   │   • Observability & signal analysis                                  │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
         │                    │                         │
         ▼                    ▼                         ▼
   ┌──────────┐      ┌──────────────┐          ┌──────────────┐
   │  Agents  │      │ Developer    │          │ LLM Providers│
   │ (MCP/HTTP)│     │   APIs       │          │ (OpenAI, etc)│
   └──────────┘      └──────────────┘          └──────────────┘

The Role of Envoy

Envoy is the data plane of Plano. All client traffic — both inbound prompts and outbound LLM calls — flows through Envoy. It provides:

  • L7 HTTP routing based on paths and custom headers
  • WASM filter execution for inline request/response transformation
  • Connection pooling and TLS to upstream LLM providers
  • Retry policies for resilience
  • Compression/decompression for LLM streaming responses

Envoy Listeners

Envoy defines six listener types, each serving a distinct role in the request flow:

Listener Port Direction Purpose
ingress_traffic 10000 (configurable) Inbound Client-facing entry point. Forwards all traffic to the prompt gateway listener.
ingress_traffic_prompt 10001 Inbound Core processing listener. Runs both WASM filters (prompt_gatewayllm_gateway). Routes to LLM providers by x-arch-llm-provider header.
outbound_api_traffic 11000 Internal Routes to upstream developer APIs and agents using x-arch-upstream header. No WASM filters.
Agent listeners Per-config Inbound One per agent listener in config. Routes to Brightstaff with /agents/ path prefix.
egress_traffic 12000 (configurable) Outbound LLM gateway entry for agents/services reaching LLMs. Routes to Brightstaff for routing decisions.
egress_traffic_llm 12001 Outbound Final outbound LLM listener. Runs llm_gateway.wasm for auth injection, provider translation, and rate limiting before reaching the actual LLM provider.

Envoy Clusters

Envoy manages connections to all upstream services:

LLM Provider Clusters — Pre-configured TLS clusters for: OpenAI, Anthropic (Claude), Groq, Mistral, DeepSeek, Gemini, xAI, MoonshotAI, Zhipu, Together AI, and Katanemo's hosted Arch models. Custom-URL providers (e.g., Azure OpenAI, Ollama) are dynamically added from config.

Internal Clusters:

Cluster Target Purpose
bright_staff localhost:9091 The Brightstaff Rust service
arch_prompt_gateway_listener localhost:10001 Internal forwarding from ingress
arch_listener_llm localhost:12001 Internal forwarding for LLM egress
arch_internal localhost:11000 Outbound API router

Dynamic Clusters — Generated from endpoints and agents config sections (developer APIs, agent services).

Custom Headers Used for Routing

Header Set By Used By Purpose
x-arch-llm-provider WASM filters Envoy routes Selects the LLM provider cluster
x-arch-llm-provider-hint Brightstaff llm_gateway Hints which provider/model to use
x-arch-upstream / x-arch-upstream-host WASM filters / Brightstaff Envoy routes Targets a specific agent or API endpoint
x-arch-is-streaming Brightstaff llm_gateway Indicates streaming mode
x-arch-state prompt_gateway prompt_gateway Carries multi-turn conversation state
x-arch-tool-call prompt_gateway prompt_gateway Carries tool call metadata
x-arch-api-response prompt_gateway prompt_gateway Carries developer API response data
x-arch-agent-listener-name Envoy Brightstaff Identifies which agent listener a request arrived on

Request Flows

Flow 1: Direct LLM Chat (POST /v1/chat/completions)

This is the standard path for client-to-LLM requests with optional intent matching and routing.

Client
  │
  ▼
[Envoy :10000 — ingress_traffic]
  │  (simple passthrough)
  ▼
[Envoy :10001 — ingress_traffic_prompt]
  │
  ├── prompt_gateway.wasm
  │     1. Parse ChatCompletions request
  │     2. Convert prompt_targets → tool definitions
  │     3. Dispatch to Arch-Function model at /function_calling
  │     4. If intent matched:
  │         → Call developer API endpoint via :11000
  │         → Augment prompt with API response context
  │     5. If no intent matched:
  │         → Prepend system prompt, forward to LLM
  │
  ├── llm_gateway.wasm
  │     1. Select LLM provider (from header hint or default)
  │     2. Enforce rate limits (token-based via tiktoken)
  │     3. Inject auth credentials (Bearer / x-api-key)
  │     4. Transform request format (OpenAI ↔ Anthropic ↔ Bedrock)
  │     5. Rewrite upstream path for target provider
  │
  ▼
LLM Provider (OpenAI, Anthropic, Gemini, etc.)
  │
  ▼
(Response flows back through llm_gateway for format translation)
  │
  ▼
Client

Flow 2: Brightstaff LLM Routing (POST /v1/chat/completions via egress)

When requests reach Brightstaff (directly or via agent listeners), it performs intelligent model routing.

Client / Agent
  │
  ▼
[Brightstaff :9091]
  │
  ├── Resolve model aliases
  ├── Validate model exists in configured providers
  ├── Retrieve conversation state (if using Responses API)
  │
  ├── Call Arch-Router model ──► [Envoy :12001]
  │     (determines best model/provider for the request    ──► LLM Provider
  │      based on routing_preferences in config)
  │
  ├── Forward actual request ──► [Envoy :12001]
  │     (with x-arch-llm-provider-hint header)             ──► LLM Provider
  │
  ▼
[Stream response back with metrics, signal analysis, state capture]
  │
  ▼
Client / Agent

Flow 3: Agent Orchestration (POST /agents/v1/chat/completions)

The agentic flow where Brightstaff selects and chains agents based on user intent.

Client
  │
  ▼
[Envoy — Agent Listener :configurable]
  │  (path rewrite: /agents/...)
  ▼
[Brightstaff :9091]
  │
  ├── Identify listener from x-arch-agent-listener-name
  ├── Find configured agents for this listener
  │
  ├── If multiple agents:
  │     Call Plano-Orchestrator model ──► [Envoy :12001] ──► LLM
  │     (selects which agents to run and in what order)
  │
  ├── For each selected agent:
  │     │
  │     ├── Run filter chain (pre-processing)
  │     │     └── [Envoy :11000] ──► Filter Service (MCP/HTTP)
  │     │
  │     ├── Invoke agent
  │     │     └── [Envoy :11000] ──► Agent Service (MCP/HTTP)
  │     │
  │     ├── If intermediate agent:
  │     │     Collect full response → feed as input to next agent
  │     │
  │     └── If final agent:
  │           Stream response directly to client
  │
  ▼
Client

Brightstaff Service

Brightstaff is a native Rust HTTP server (0.0.0.0:9091) built with Axum. It is the control plane brain of Plano — while Envoy handles the data plane (proxying, filtering), Brightstaff handles the intelligent decision-making.

Endpoints

Method Path Handler Purpose
POST /v1/chat/completions llm_chat LLM passthrough with model routing
POST /v1/messages llm_chat Anthropic Messages API compat
POST /v1/responses llm_chat OpenAI Responses API with state
POST /agents/v1/chat/completions agent_chat Agent orchestration pipeline
POST /agents/v1/messages agent_chat Agent orchestration (Messages)
POST /agents/v1/responses agent_chat Agent orchestration (Responses)
POST /function_calling function_calling_chat_handler Arch-Function tool calling
GET /v1/models list_models List configured LLM models

Core Components

RouterService (LLM Routing)

Uses the Arch-Router model — a specialized LLM that determines which provider/model best matches a user's request based on routing_preferences defined in config. Constructs a system prompt describing available routes, sends the conversation, and parses a {"route": "route_name"} response.

OrchestratorService (Agent Selection)

Uses the Plano-Orchestrator model to determine which agent(s) should handle a request when multiple agents are available on a listener. Returns an ordered list of agents: {"route": ["agent1", "agent2"]}.

PipelineProcessor (Agent Execution)

Manages the sequential execution of agent filter chains and agent invocations:

  • MCP agents: JSON-RPC 2.0 protocol over SSE transport (initializenotifications/initializedtools/call)
  • HTTP agents: Direct POST with message array
  • Routes through Envoy at :11000 using x-arch-upstream-host header

Function Calling Handler

Specialized handler for the Arch-Function model:

  • Converts OpenAI tool definitions into prompts
  • Parses structured JSON responses (tool_calls, clarifications)
  • Includes hallucination detection using entropy/varentropy/probability thresholds from logprobs

State Management

Manages conversation state for the OpenAI Responses API (v1/responses):

  • Memory backendHashMap behind Arc<RwLock> for single-instance dev
  • PostgreSQL backend — Persistent storage with upsert semantics
  • ResponsesStateProcessor intercepts streaming responses to capture response_id and output items, storing them asynchronously for future conversation chaining via previous_response_id

Signal Analysis (Observability)

Analyzes conversation patterns for interaction quality:

  • Frustration, repetition/looping, escalation requests, positive feedback, repair patterns
  • Quality graded as Good / Fair / Poor / Severe
  • Concerning signals flag spans with indicators for monitoring

Rust Crate Architecture

┌─────────────────────────────────────────────────────────────┐
│                     brightstaff (binary)                     │
│                                                             │
│   Native Rust HTTP server — routing, orchestration, state   │
│   Depends on: hermesllm, common (non-WASM parts)           │
└─────────────────────────────────────────────────────────────┘

┌──────────────────────┐    ┌──────────────────────┐
│   prompt_gateway     │    │   llm_gateway         │
│      (WASM)          │    │      (WASM)           │
│                      │    │                       │
│  Intent matching     │    │  Provider routing     │
│  Prompt guards       │    │  Auth injection       │
│  Function calling    │    │  Rate limiting        │
│  API orchestration   │    │  Request/Response     │
│                      │    │  format translation   │
├──────────────────────┤    ├───────────────────────┤
│  depends on: common  │    │  depends on: common,  │
│                      │    │  hermesllm            │
└──────────┬───────────┘    └──────────┬────────────┘
           │                           │
           ▼                           ▼
┌──────────────────────────────────────────────────────────────┐
│                        common (lib)                          │
│                                                             │
│  Configuration types, LlmProviders, HTTP client trait,      │
│  rate limiting (governor), tokenization (tiktoken),         │
│  OpenAI API types, routing, metrics, tracing, constants     │
│  Depends on: hermesllm                                      │
└─────────────────────────────┬───────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                       hermesllm (lib)                        │
│                                                             │
│  LLM protocol abstraction — cross-provider request/response │
│  translation (OpenAI ↔ Anthropic ↔ Bedrock ↔ Gemini)       │
│  SSE stream parsing, provider model catalog, endpoint       │
│  mapping. No proxy-wasm dependency (pure Rust).             │
└──────────────────────────────────────────────────────────────┘

WASM Compilation

Both prompt_gateway and llm_gateway compile to cdylib targets for wasm32-wasip1 using the proxy-wasm SDK (v0.2.1). Envoy loads them via its V8 WASM runtime. Each filter implements RootContext (for config parsing and per-stream creation) and HttpContext (for per-request processing).


Deployment Architecture

All components run inside a single container managed by Supervisord:

┌─────────────────────────────────────────────────────────────┐
│                     Docker Container                         │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                   Supervisord                        │    │
│  │                                                     │    │
│  │  ┌─────────────┐  ┌───────────────┐  ┌───────────┐ │    │
│  │  │ Brightstaff  │  │  Envoy Proxy  │  │  Log Tail │ │    │
│  │  │  (Rust)      │  │  + WASM       │  │           │ │    │
│  │  │  :9091       │  │  :10000-12001 │  │           │ │    │
│  │  └─────────────┘  └───────────────┘  └───────────┘ │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
│  Startup sequence:                                          │
│   1. config_generator.py validates arch_config.yaml         │
│   2. Renders envoy.template.yaml → envoy.yaml (Jinja2)     │
│   3. Starts Brightstaff + Envoy in parallel                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Docker multi-stage build:

  1. deps — Rust 1.93.0 with wasm32-wasip1 target, dependency pre-compilation
  2. wasm-builder — Builds prompt_gateway.wasm + llm_gateway.wasm (release)
  3. brightstaff-builder — Builds the brightstaff native binary (release)
  4. envoy — Pulls envoyproxy/envoy:v1.37.0
  5. arch (final) — Python 3.13.6-slim base with Envoy binary, WASM plugins, Brightstaff binary, and the planoai CLI

Configuration Pipeline

User-facing configuration flows through a generation pipeline before reaching Envoy and Brightstaff:

arch_config.yaml (user-authored)
        │
        ▼
config_generator.py (Python CLI)
  1. Validate against arch_config_schema.yaml (JSON Schema)
  2. Normalize legacy formats (llm_providers → model_providers)
  3. Parse agents, filters, endpoints → infer Envoy clusters
  4. Parse model_providers → validate provider/model format
  5. Auto-add internal models (arch-function, arch-router, plano-orchestrator)
  6. Validate model aliases, routing preferences, prompt target endpoints
        │
        ├──► envoy.yaml (rendered from envoy.template.yaml via Jinja2)
        │      → consumed by Envoy
        │
        └──► arch_config_rendered.yaml
               → consumed by Brightstaff
               → injected into WASM filter configs

Key Config Sections

Section Consumed By Purpose
model_providers llm_gateway, Brightstaff LLM provider definitions with models, auth, routing preferences
prompt_targets prompt_gateway Intent-to-API mappings with parameter schemas
prompt_guards prompt_gateway Input guardrails (jailbreak detection)
endpoints prompt_gateway, Envoy Named upstream API endpoint definitions
agents Brightstaff, Envoy Agent service definitions (id, URL, type)
listeners Brightstaff, Envoy Listener configs binding agents to ports
ratelimits llm_gateway Per-model rate limits with token-based quotas
routing Brightstaff LLM routing model/provider config
model_aliases Brightstaff Friendly name → provider/model mappings
state_storage Brightstaff Conversation state backend (memory / postgres)
tracing All components OpenTelemetry config (sampling, OTLP endpoint)
overrides prompt_gateway, Brightstaff Tuning (intent threshold, agent orchestrator toggle)

Supported LLM Providers

Provider Cluster Auth Method
OpenAI api.openai.com Bearer token
Anthropic (Claude) api.anthropic.com x-api-key header
Google (Gemini) generativelanguage.googleapis.com API key in URL
Groq api.groq.com Bearer token
Mistral api.mistral.ai Bearer token
DeepSeek api.deepseek.com Bearer token
xAI api.x.ai Bearer token
Together AI api.together.xyz Bearer token
MoonshotAI api.moonshot.ai Bearer token
Zhipu open.bigmodel.cn Bearer token
Amazon Bedrock Custom base_url AWS Sig v4
Azure OpenAI Custom base_url Bearer / API key
Ollama Custom base_url None
Katanemo (Arch) archfc.katanemo.dev Bearer token

The hermesllm crate handles cross-provider request/response translation so clients can use a single API format (typically OpenAI-compatible) regardless of which upstream provider serves the request.