create md files for coding agents and for humans

2026-06-17 15:25:17 +02:00 · 2026-02-09 23:34:18 -08:00 · 2026-02-09 23:34:18 -08:00 · 3f8aa14e4c
commit 3f8aa14e4c
parent 46de89590b
12 changed files with 1407 additions and 0 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@ -0,0 +1,130 @@
+# Copilot Instructions for Plano (ArchGW)
+
+## System Identity
+
+Plano is an AI-native gateway built on Envoy Proxy. It uses WASM filters for inline request processing and a native Rust service (Brightstaff) for orchestration. All components run in a single container managed by Supervisord.
+
+## Critical Architectural Rules
+
+### 1. Envoy Is the Data Plane — Never Bypass It
+
+All external traffic MUST flow through Envoy. Brightstaff NEVER makes direct outbound HTTP calls to LLM providers or developer APIs. It always routes through Envoy listeners:
+- LLM requests → `localhost:12001` (egress LLM listener with `llm_gateway.wasm`)
+- Agent/API requests → `localhost:11000` (outbound API listener)
+
+**Do not** add direct HTTP calls from Brightstaff to external services. Use Envoy's cluster routing via `x-arch-*` headers instead.
+
+### 2. WASM Crate Constraints
+
+`prompt_gateway` and `llm_gateway` compile to `wasm32-wasip1`. This means:
+- **No `tokio`, no `async/await`, no threads, no filesystem, no network sockets**
+- All I/O goes through `proxy-wasm` SDK's `dispatch_http_call` (async callback-based)
+- No crate with `std` networking features — use `governor` with `no_std`, etc.
+- The `crate-type` is `["cdylib"]` — these are shared libraries, not binaries
+- Test with `cargo test` (native), but build with `--target wasm32-wasip1`
+
+**Do not** add dependencies to WASM crates that require `std::net`, `tokio`, `reqwest`, `hyper`, or any async runtime.
+
+### 3. Crate Dependency Direction
+
+```
+prompt_gateway → common
+llm_gateway    → common, hermesllm
+common         → hermesllm
+brightstaff    → common (non-WASM parts), hermesllm
+hermesllm      → (standalone, no proxy-wasm)
+```
+
+- `hermesllm` must NEVER depend on `proxy-wasm` or `common` — it's a pure Rust library usable outside WASM
+- `common` provides the `proxy-wasm` abstractions — WASM crates use `common`, not raw `proxy-wasm` directly (except for the SDK traits)
+- `brightstaff` uses `hermesllm` directly for LLM types but does NOT use `common`'s WASM-specific code (like `proxy-wasm` Client trait)
+
+### 4. Header-Based Routing Protocol
+
+Envoy routes requests using custom headers. These are the canonical header names defined in `common/src/consts.rs`:
+
+| Header | Purpose | Do NOT change |
+|--------|---------|---------------|
+| `x-arch-llm-provider` | Envoy route matching for LLM provider cluster | Used in envoy.template.yaml |
+| `x-arch-llm-provider-hint` | Brightstaff → llm_gateway provider selection | Both sides must agree |
+| `x-arch-upstream` | Targets a specific agent/API cluster in Envoy | Used in envoy.template.yaml |
+| `x-arch-streaming-request` | Signals streaming mode | llm_gateway reads this |
+| `x-arch-state` | Multi-turn conversation state in prompt_gateway | Serialized JSON |
+| `x-arch-tool-call-message` | Tool call metadata | prompt_gateway internal |
+| `x-arch-api-response-message` | Developer API response | prompt_gateway internal |
+| `x-arch-agent-listener-name` | Identifies agent listener | Set by Envoy, read by Brightstaff |
+| `x-arch-llm-route` | LLM route decision result | Brightstaff ↔ llm_gateway |
+
+Changing header names requires updating: `consts.rs`, `envoy.template.yaml`, and all consumers.
+
+### 5. Build System
+
+```bash
+# WASM filters — must use wasm32-wasip1 target
+cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway
+
+# Brightstaff — native binary
+cargo build --release -p brightstaff
+```
+
+The workspace uses Rust edition 2021 and resolver "2". The workspace root is `crates/Cargo.toml`.
+
+### 6. Configuration Flow
+
+User config (`arch_config.yaml`) is validated and rendered by `cli/planoai/config_generator.py`:
+- Schema: `config/arch_config_schema.yaml`
+- Template: `config/envoy.template.yaml` (Jinja2)
+- Output: `envoy.yaml` (for Envoy) + `arch_config_rendered.yaml` (for Brightstaff + WASM filter configs)
+
+When adding new config fields: update the schema, the template (if Envoy-relevant), the Python generator, AND the Rust `Configuration` struct in `common/src/configuration.rs`.
+
+### 7. Internal Model Names
+
+These are reserved model names used internally — do not conflict with them:
+- `Arch-Function` — intent classification / function calling
+- `Arch-Router` — (used as route name prefix, not direct model name)
+- `Plano-Orchestrator` — agent selection orchestrator
+
+### 8. API Compatibility
+
+Brightstaff exposes OpenAI-compatible endpoints:
+- `/v1/chat/completions` — Chat Completions API
+- `/v1/messages` — Anthropic Messages API compatible
+- `/v1/responses` — OpenAI Responses API with state management
+- `/function_calling` — Internal Arch-Function endpoint
+
+The `/agents/` prefix variants mirror these for agent orchestration.
+
+Do NOT change these path structures without updating `consts.rs`, Brightstaff router, and `envoy.template.yaml`.
+
+### 9. Streaming
+
+- LLM responses use SSE (Server-Sent Events) format: `data: {json}\n\n`
+- The `llm_gateway` WASM filter handles SSE stream reassembly across chunk boundaries via `SseStreamBuffer`
+- Brightstaff uses `mpsc` channels for streaming responses back to clients
+- Bedrock uses AWS Event Stream binary protocol — decoded by `hermesllm`
+
+### 10. Testing Conventions
+
+- WASM crates: unit tests run natively (`cargo test`), NOT under WASM runtime
+- Brightstaff: unit tests with `mockito` for HTTP mocking
+- E2E tests: separate `tests/` directory, run via GitHub Actions workflows
+- Config validation tests: `cli/test/test_config_generator.py`
+
+## File Layout Reference
+
+```
+crates/
+  Cargo.toml          # Workspace root
+  brightstaff/        # Native Rust HTTP server (Axum)
+  common/             # Shared types, config, HTTP, rate limiting
+  hermesllm/          # LLM protocol translation (pure Rust)
+  llm_gateway/        # WASM filter: provider routing, auth, rate limits
+  prompt_gateway/     # WASM filter: intent matching, guardrails
+config/
+  arch_config_schema.yaml   # User config JSON schema
+  envoy.template.yaml       # Jinja2 template → envoy.yaml
+  docker-compose.dev.yaml   # Dev environment
+cli/
+  planoai/                  # Python CLI (config generator, Docker management)
+```
--- a/AGENTS.md
+++ b/AGENTS.md
@ -0,0 +1,138 @@
+# AGENTS.md — Coding Agent Reference
+
+> This file is optimized for AI coding agents. It contains hard constraints, ownership rules, and patterns that must not be violated. For human-readable architecture, see `architecture.md`.
+
+---
+
+## System Overview (30-second version)
+
+Plano is an AI gateway. Client traffic enters **Envoy Proxy**, passes through two **WASM filters** (`prompt_gateway` → `llm_gateway`), and reaches **LLM providers**. A native Rust service (**Brightstaff**) handles intelligent routing and agent orchestration, but always communicates with the outside world **through Envoy**, never directly.
+
+---
+
+## Hard Rules — Never Violate These
+
+### Rule 1: All external I/O goes through Envoy
+- Brightstaff sends LLM requests to `localhost:12001` (Envoy egress listener)
+- Brightstaff sends agent/API requests to `localhost:11000` (Envoy outbound listener)
+- **NEVER** add `reqwest`/`hyper` calls from Brightstaff directly to external hosts
+- Routing is controlled by setting `x-arch-llm-provider-hint` or `x-arch-upstream` headers
+
+### Rule 2: WASM crates cannot use async runtimes
+- `prompt_gateway` and `llm_gateway` compile to `wasm32-wasip1`
+- **Forbidden in WASM crates:** `tokio`, `async-std`, `reqwest`, `hyper`, `std::net`, `std::fs`, `std::thread`
+- All I/O uses `proxy-wasm` SDK's `dispatch_http_call` (callback-based, not async/await)
+- `governor` must use `no_std` feature; `rand` is fine
+
+### Rule 3: Dependency direction is one-way
+```
+prompt_gateway ──► common ──► hermesllm
+llm_gateway    ──► common ──► hermesllm
+                   llm_gateway ──► hermesllm (direct)
+brightstaff    ──► hermesllm (direct, no common WASM code)
+```
+- `hermesllm` has **zero** dependencies on `proxy-wasm` or `common`
+- `common` has **zero** dependencies on `brightstaff`
+- WASM crates have **zero** dependencies on `brightstaff`
+
+### Rule 4: Header names are canonical constants
+All `x-arch-*` headers are defined in `common/src/consts.rs`. Changing a header name requires updating:
+1. `common/src/consts.rs`
+2. `config/envoy.template.yaml`
+3. Every Rust consumer (grep for the old constant name)
+
+### Rule 5: Config changes require a 4-file update
+Adding a new user-facing config field:
+1. `config/arch_config_schema.yaml` — JSON schema
+2. `config/envoy.template.yaml` — Jinja2 template (if Envoy needs it)
+3. `cli/planoai/config_generator.py` — Python validation/rendering
+4. `common/src/configuration.rs` — Rust struct
+
+### Rule 6: API paths are load-bearing
+These paths appear in `consts.rs`, Brightstaff's Axum router, and `envoy.template.yaml`:
+- `/v1/chat/completions`, `/v1/messages`, `/v1/responses`
+- `/agents/v1/chat/completions`, `/agents/v1/messages`, `/agents/v1/responses`
+- `/function_calling`, `/v1/models`, `/healthz`
+
+Changing them breaks routing. Update all three locations simultaneously.
+
+### Rule 7: Reserved model names
+- `Arch-Function` — used for intent classification / function calling
+- `Plano-Orchestrator` — used for agent selection
+- Any model prefixed with `Arch` is treated as internal
+
+---
+
+## Crate Ownership Map
+
+| Crate | Type | Target | Owner of |
+|---|---|---|---|
+| `brightstaff` | Binary (Axum) | Native | LLM routing, agent orchestration, state management, observability |
+| `prompt_gateway` | cdylib (WASM) | wasm32-wasip1 | Intent matching, prompt guards, function calling, API orchestration |
+| `llm_gateway` | cdylib (WASM) | wasm32-wasip1 | Provider routing, auth injection, rate limiting, request/response translation |
+| `common` | Library | Both | Config types, HTTP client trait, constants, rate limiting, tokenization, shared OpenAI types |
+| `hermesllm` | Library | Native | LLM protocol translation (OpenAI ↔ Anthropic ↔ Bedrock ↔ Gemini), SSE parsing, provider model catalog |
+
+---
+
+## Where to Put New Code
+
+| You want to... | Put it in... | Why |
+|---|---|---|
+| Add a new LLM provider | `hermesllm` (protocol), `common/configuration.rs` (config type), `config/arch_config_schema.yaml`, `config/envoy.template.yaml` (cluster) | Provider translation is hermesllm's job |
+| Add a new header for inter-component communication | `common/src/consts.rs` + `config/envoy.template.yaml` | Canonical source for all header names |
+| Add rate limiting logic | `common/src/ratelimit.rs` | Shared between WASM filters |
+| Add a new API endpoint to Brightstaff | `brightstaff/src/handlers/` + `brightstaff/src/main.rs` (router) | Axum handler + route registration |
+| Add prompt guardrail logic | `prompt_gateway/src/stream_context.rs` or `prompt_gateway/src/http_context.rs` | Runs inline in Envoy |
+| Add request/response transformation for a provider | `hermesllm/src/transforms/` | Pure Rust, no WASM dependency |
+| Add config validation | `cli/planoai/config_generator.py` + `config/arch_config_schema.yaml` | Python validates before Envoy starts |
+| Add a new metric | `common/src/stats.rs` (WASM) or `brightstaff/src/tracing/` (native) | Different metric systems |
+
+---
+
+## Build & Test Quick Reference
+
+```bash
+# Full build (WASM + native)
+cd crates && ./build.sh
+
+# WASM filters only
+cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway
+
+# Brightstaff only
+cargo build --release -p brightstaff
+
+# Run all Rust tests (native)
+cargo test --workspace
+
+# Run config generator tests
+cd cli && python -m pytest test/
+
+# Dev environment (Docker Compose)
+cd config && docker compose -f docker-compose.dev.yaml up
+```
+
+---
+
+## Envoy Listener Map (for routing decisions)
+
+```
+:10000 (ingress)          → passthrough to :10001
+:10001 (prompt+llm)       → prompt_gateway.wasm → llm_gateway.wasm → LLM provider
+:11000 (outbound API)     → developer APIs & agents (by x-arch-upstream header)
+:agent_port (per-config)  → brightstaff :9091 /agents/...
+:12000 (LLM egress)       → brightstaff :9091 (routing decision)
+:12001 (LLM egress final) → llm_gateway.wasm → LLM provider
+```
+
+---
+
+## Common Mistakes to Avoid
+
+1. **Adding `tokio` to a WASM crate's Cargo.toml** — Will fail to compile for wasm32-wasip1
+2. **Making Brightstaff call OpenAI directly** — Must go through Envoy at localhost:12001
+3. **Adding a config field only in Rust** — Schema, Python generator, and template also need updates
+4. **Changing a header name in one place** — Must grep and update consts.rs, envoy.template.yaml, and all consumers
+5. **Adding `hermesllm` dependency on `proxy-wasm`** — hermesllm must stay pure Rust
+6. **Creating a new Envoy cluster without updating the template** — Envoy won't know about it
+7. **Forgetting `no_std` feature flag on `governor` in WASM crates** — std governor uses threads
--- a/architecture.md
+++ b/architecture.md
@ -0,0 +1,411 @@
+# Plano (ArchGW) — High-Level Architecture
+
+## Overview
+
+Plano is an AI-native gateway built on **Envoy Proxy**, extended with custom **WebAssembly (WASM) filters** and a native Rust service called **Brightstaff**. It acts as an intelligent intermediary between client applications, AI agents, and LLM providers — handling intent-based routing, prompt guardrails, function calling, agent orchestration, rate limiting, and multi-provider LLM translation.
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              Plano Gateway                                  │
+│                                                                             │
+│   ┌──────────────────────────────────────────────────────────────────────┐  │
+│   │                         Envoy Proxy (L7)                             │  │
+│   │                                                                      │  │
+│   │   ┌──────────────────┐       ┌──────────────────┐                    │  │
+│   │   │  prompt_gateway  │──────▶│   llm_gateway     │                   │  │
+│   │   │    (WASM)        │       │     (WASM)        │                   │  │
+│   │   │                  │       │                   │                   │  │
+│   │   │ • Intent matching│       │ • Provider routing│                   │  │
+│   │   │ • Guardrails     │       │ • Auth injection  │                   │  │
+│   │   │ • Function call  │       │ • Rate limiting   │                   │  │
+│   │   │ • Prompt targets │       │ • API translation │                   │  │
+│   │   └──────────────────┘       └────────┬─────────┘                   │  │
+│   │                                       │                              │  │
+│   └───────────────────────────────────────┼──────────────────────────────┘  │
+│                                           │                                 │
+│   ┌───────────────────────────────────────┼──────────────────────────────┐  │
+│   │                    Brightstaff (Rust HTTP Server :9091)               │  │
+│   │                                                                      │  │
+│   │   • LLM request routing (Arch-Router model)                          │  │
+│   │   • Agent orchestration (Plano-Orchestrator model)                   │  │
+│   │   • Conversation state management (memory / PostgreSQL)              │  │
+│   │   • Function calling handler (Arch-Function model)                   │  │
+│   │   • Observability & signal analysis                                  │  │
+│   └──────────────────────────────────────────────────────────────────────┘  │
+│                                                                             │
+└─────────────────────────────────────────────────────────────────────────────┘
+         │                    │                         │
+         ▼                    ▼                         ▼
+   ┌──────────┐      ┌──────────────┐          ┌──────────────┐
+   │  Agents  │      │ Developer    │          │ LLM Providers│
+   │ (MCP/HTTP)│     │   APIs       │          │ (OpenAI, etc)│
+   └──────────┘      └──────────────┘          └──────────────┘
+```
+
+---
+
+## The Role of Envoy
+
+Envoy is the **data plane** of Plano. All client traffic — both inbound prompts and outbound LLM calls — flows through Envoy. It provides:
+
+- **L7 HTTP routing** based on paths and custom headers
+- **WASM filter execution** for inline request/response transformation
+- **Connection pooling and TLS** to upstream LLM providers
+- **Retry policies** for resilience
+- **Compression/decompression** for LLM streaming responses
+
+### Envoy Listeners
+
+Envoy defines **six listener types**, each serving a distinct role in the request flow:
+
+| Listener | Port | Direction | Purpose |
+|---|---|---|---|
+| `ingress_traffic` | 10000 (configurable) | Inbound | Client-facing entry point. Forwards all traffic to the prompt gateway listener. |
+| `ingress_traffic_prompt` | 10001 | Inbound | **Core processing listener.** Runs both WASM filters (`prompt_gateway` → `llm_gateway`). Routes to LLM providers by `x-arch-llm-provider` header. |
+| `outbound_api_traffic` | 11000 | Internal | Routes to upstream developer APIs and agents using `x-arch-upstream` header. No WASM filters. |
+| Agent listeners | Per-config | Inbound | One per agent listener in config. Routes to Brightstaff with `/agents/` path prefix. |
+| `egress_traffic` | 12000 (configurable) | Outbound | LLM gateway entry for agents/services reaching LLMs. Routes to Brightstaff for routing decisions. |
+| `egress_traffic_llm` | 12001 | Outbound | **Final outbound LLM listener.** Runs `llm_gateway.wasm` for auth injection, provider translation, and rate limiting before reaching the actual LLM provider. |
+
+### Envoy Clusters
+
+Envoy manages connections to all upstream services:
+
+**LLM Provider Clusters** — Pre-configured TLS clusters for: OpenAI, Anthropic (Claude), Groq, Mistral, DeepSeek, Gemini, xAI, MoonshotAI, Zhipu, Together AI, and Katanemo's hosted Arch models. Custom-URL providers (e.g., Azure OpenAI, Ollama) are dynamically added from config.
+
+**Internal Clusters:**
+
+| Cluster | Target | Purpose |
+|---|---|---|
+| `bright_staff` | localhost:9091 | The Brightstaff Rust service |
+| `arch_prompt_gateway_listener` | localhost:10001 | Internal forwarding from ingress |
+| `arch_listener_llm` | localhost:12001 | Internal forwarding for LLM egress |
+| `arch_internal` | localhost:11000 | Outbound API router |
+
+**Dynamic Clusters** — Generated from `endpoints` and `agents` config sections (developer APIs, agent services).
+
+### Custom Headers Used for Routing
+
+| Header | Set By | Used By | Purpose |
+|---|---|---|---|
+| `x-arch-llm-provider` | WASM filters | Envoy routes | Selects the LLM provider cluster |
+| `x-arch-llm-provider-hint` | Brightstaff | llm_gateway | Hints which provider/model to use |
+| `x-arch-upstream` / `x-arch-upstream-host` | WASM filters / Brightstaff | Envoy routes | Targets a specific agent or API endpoint |
+| `x-arch-is-streaming` | Brightstaff | llm_gateway | Indicates streaming mode |
+| `x-arch-state` | prompt_gateway | prompt_gateway | Carries multi-turn conversation state |
+| `x-arch-tool-call` | prompt_gateway | prompt_gateway | Carries tool call metadata |
+| `x-arch-api-response` | prompt_gateway | prompt_gateway | Carries developer API response data |
+| `x-arch-agent-listener-name` | Envoy | Brightstaff | Identifies which agent listener a request arrived on |
+
+---
+
+## Request Flows
+
+### Flow 1: Direct LLM Chat (`POST /v1/chat/completions`)
+
+This is the standard path for client-to-LLM requests with optional intent matching and routing.
+
+```
+Client
+  │
+  ▼
+[Envoy :10000 — ingress_traffic]
+  │  (simple passthrough)
+  ▼
+[Envoy :10001 — ingress_traffic_prompt]
+  │
+  ├── prompt_gateway.wasm
+  │     1. Parse ChatCompletions request
+  │     2. Convert prompt_targets → tool definitions
+  │     3. Dispatch to Arch-Function model at /function_calling
+  │     4. If intent matched:
+  │         → Call developer API endpoint via :11000
+  │         → Augment prompt with API response context
+  │     5. If no intent matched:
+  │         → Prepend system prompt, forward to LLM
+  │
+  ├── llm_gateway.wasm
+  │     1. Select LLM provider (from header hint or default)
+  │     2. Enforce rate limits (token-based via tiktoken)
+  │     3. Inject auth credentials (Bearer / x-api-key)
+  │     4. Transform request format (OpenAI ↔ Anthropic ↔ Bedrock)
+  │     5. Rewrite upstream path for target provider
+  │
+  ▼
+LLM Provider (OpenAI, Anthropic, Gemini, etc.)
+  │
+  ▼
+(Response flows back through llm_gateway for format translation)
+  │
+  ▼
+Client
+```
+
+### Flow 2: Brightstaff LLM Routing (`POST /v1/chat/completions` via egress)
+
+When requests reach Brightstaff (directly or via agent listeners), it performs intelligent model routing.
+
+```
+Client / Agent
+  │
+  ▼
+[Brightstaff :9091]
+  │
+  ├── Resolve model aliases
+  ├── Validate model exists in configured providers
+  ├── Retrieve conversation state (if using Responses API)
+  │
+  ├── Call Arch-Router model ──► [Envoy :12001]
+  │     (determines best model/provider for the request    ──► LLM Provider
+  │      based on routing_preferences in config)
+  │
+  ├── Forward actual request ──► [Envoy :12001]
+  │     (with x-arch-llm-provider-hint header)             ──► LLM Provider
+  │
+  ▼
+[Stream response back with metrics, signal analysis, state capture]
+  │
+  ▼
+Client / Agent
+```
+
+### Flow 3: Agent Orchestration (`POST /agents/v1/chat/completions`)
+
+The agentic flow where Brightstaff selects and chains agents based on user intent.
+
+```
+Client
+  │
+  ▼
+[Envoy — Agent Listener :configurable]
+  │  (path rewrite: /agents/...)
+  ▼
+[Brightstaff :9091]
+  │
+  ├── Identify listener from x-arch-agent-listener-name
+  ├── Find configured agents for this listener
+  │
+  ├── If multiple agents:
+  │     Call Plano-Orchestrator model ──► [Envoy :12001] ──► LLM
+  │     (selects which agents to run and in what order)
+  │
+  ├── For each selected agent:
+  │     │
+  │     ├── Run filter chain (pre-processing)
+  │     │     └── [Envoy :11000] ──► Filter Service (MCP/HTTP)
+  │     │
+  │     ├── Invoke agent
+  │     │     └── [Envoy :11000] ──► Agent Service (MCP/HTTP)
+  │     │
+  │     ├── If intermediate agent:
+  │     │     Collect full response → feed as input to next agent
+  │     │
+  │     └── If final agent:
+  │           Stream response directly to client
+  │
+  ▼
+Client
+```
+
+---
+
+## Brightstaff Service
+
+Brightstaff is a native Rust HTTP server (`0.0.0.0:9091`) built with Axum. It is the **control plane brain** of Plano — while Envoy handles the data plane (proxying, filtering), Brightstaff handles the intelligent decision-making.
+
+### Endpoints
+
+| Method | Path | Handler | Purpose |
+|---|---|---|---|
+| `POST` | `/v1/chat/completions` | `llm_chat` | LLM passthrough with model routing |
+| `POST` | `/v1/messages` | `llm_chat` | Anthropic Messages API compat |
+| `POST` | `/v1/responses` | `llm_chat` | OpenAI Responses API with state |
+| `POST` | `/agents/v1/chat/completions` | `agent_chat` | Agent orchestration pipeline |
+| `POST` | `/agents/v1/messages` | `agent_chat` | Agent orchestration (Messages) |
+| `POST` | `/agents/v1/responses` | `agent_chat` | Agent orchestration (Responses) |
+| `POST` | `/function_calling` | `function_calling_chat_handler` | Arch-Function tool calling |
+| `GET` | `/v1/models` | `list_models` | List configured LLM models |
+
+### Core Components
+
+#### RouterService (LLM Routing)
+Uses the **Arch-Router** model — a specialized LLM that determines which provider/model best matches a user's request based on `routing_preferences` defined in config. Constructs a system prompt describing available routes, sends the conversation, and parses a `{"route": "route_name"}` response.
+
+#### OrchestratorService (Agent Selection)
+Uses the **Plano-Orchestrator** model to determine which agent(s) should handle a request when multiple agents are available on a listener. Returns an ordered list of agents: `{"route": ["agent1", "agent2"]}`.
+
+#### PipelineProcessor (Agent Execution)
+Manages the sequential execution of agent filter chains and agent invocations:
+- **MCP agents**: JSON-RPC 2.0 protocol over SSE transport (`initialize` → `notifications/initialized` → `tools/call`)
+- **HTTP agents**: Direct POST with message array
+- Routes through Envoy at `:11000` using `x-arch-upstream-host` header
+
+#### Function Calling Handler
+Specialized handler for the **Arch-Function** model:
+- Converts OpenAI tool definitions into prompts
+- Parses structured JSON responses (tool_calls, clarifications)
+- Includes **hallucination detection** using entropy/varentropy/probability thresholds from logprobs
+
+#### State Management
+Manages conversation state for the OpenAI Responses API (`v1/responses`):
+- **Memory backend** — `HashMap` behind `Arc<RwLock>` for single-instance dev
+- **PostgreSQL backend** — Persistent storage with upsert semantics
+- `ResponsesStateProcessor` intercepts streaming responses to capture `response_id` and output items, storing them asynchronously for future conversation chaining via `previous_response_id`
+
+#### Signal Analysis (Observability)
+Analyzes conversation patterns for interaction quality:
+- Frustration, repetition/looping, escalation requests, positive feedback, repair patterns
+- Quality graded as Good / Fair / Poor / Severe
+- Concerning signals flag spans with indicators for monitoring
+
+---
+
+## Rust Crate Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     brightstaff (binary)                     │
+│                                                             │
+│   Native Rust HTTP server — routing, orchestration, state   │
+│   Depends on: hermesllm, common (non-WASM parts)           │
+└─────────────────────────────────────────────────────────────┘
+
+┌──────────────────────┐    ┌──────────────────────┐
+│   prompt_gateway     │    │   llm_gateway         │
+│      (WASM)          │    │      (WASM)           │
+│                      │    │                       │
+│  Intent matching     │    │  Provider routing     │
+│  Prompt guards       │    │  Auth injection       │
+│  Function calling    │    │  Rate limiting        │
+│  API orchestration   │    │  Request/Response     │
+│                      │    │  format translation   │
+├──────────────────────┤    ├───────────────────────┤
+│  depends on: common  │    │  depends on: common,  │
+│                      │    │  hermesllm            │
+└──────────┬───────────┘    └──────────┬────────────┘
+           │                           │
+           ▼                           ▼
+┌──────────────────────────────────────────────────────────────┐
+│                        common (lib)                          │
+│                                                             │
+│  Configuration types, LlmProviders, HTTP client trait,      │
+│  rate limiting (governor), tokenization (tiktoken),         │
+│  OpenAI API types, routing, metrics, tracing, constants     │
+│  Depends on: hermesllm                                      │
+└─────────────────────────────┬───────────────────────────────┘
+                              │
+                              ▼
+┌──────────────────────────────────────────────────────────────┐
+│                       hermesllm (lib)                        │
+│                                                             │
+│  LLM protocol abstraction — cross-provider request/response │
+│  translation (OpenAI ↔ Anthropic ↔ Bedrock ↔ Gemini)       │
+│  SSE stream parsing, provider model catalog, endpoint       │
+│  mapping. No proxy-wasm dependency (pure Rust).             │
+└──────────────────────────────────────────────────────────────┘
+```
+
+### WASM Compilation
+
+Both `prompt_gateway` and `llm_gateway` compile to `cdylib` targets for `wasm32-wasip1` using the `proxy-wasm` SDK (v0.2.1). Envoy loads them via its V8 WASM runtime. Each filter implements `RootContext` (for config parsing and per-stream creation) and `HttpContext` (for per-request processing).
+
+---
+
+## Deployment Architecture
+
+All components run inside a single container managed by **Supervisord**:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     Docker Container                         │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐    │
+│  │                   Supervisord                        │    │
+│  │                                                     │    │
+│  │  ┌─────────────┐  ┌───────────────┐  ┌───────────┐ │    │
+│  │  │ Brightstaff  │  │  Envoy Proxy  │  │  Log Tail │ │    │
+│  │  │  (Rust)      │  │  + WASM       │  │           │ │    │
+│  │  │  :9091       │  │  :10000-12001 │  │           │ │    │
+│  │  └─────────────┘  └───────────────┘  └───────────┘ │    │
+│  └─────────────────────────────────────────────────────┘    │
+│                                                             │
+│  Startup sequence:                                          │
+│   1. config_generator.py validates arch_config.yaml         │
+│   2. Renders envoy.template.yaml → envoy.yaml (Jinja2)     │
+│   3. Starts Brightstaff + Envoy in parallel                 │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Docker multi-stage build:**
+1. `deps` — Rust 1.93.0 with `wasm32-wasip1` target, dependency pre-compilation
+2. `wasm-builder` — Builds `prompt_gateway.wasm` + `llm_gateway.wasm` (release)
+3. `brightstaff-builder` — Builds the `brightstaff` native binary (release)
+4. `envoy` — Pulls `envoyproxy/envoy:v1.37.0`
+5. `arch` (final) — Python 3.13.6-slim base with Envoy binary, WASM plugins, Brightstaff binary, and the `planoai` CLI
+
+---
+
+## Configuration Pipeline
+
+User-facing configuration flows through a generation pipeline before reaching Envoy and Brightstaff:
+
+```
+arch_config.yaml (user-authored)
+        │
+        ▼
+config_generator.py (Python CLI)
+  1. Validate against arch_config_schema.yaml (JSON Schema)
+  2. Normalize legacy formats (llm_providers → model_providers)
+  3. Parse agents, filters, endpoints → infer Envoy clusters
+  4. Parse model_providers → validate provider/model format
+  5. Auto-add internal models (arch-function, arch-router, plano-orchestrator)
+  6. Validate model aliases, routing preferences, prompt target endpoints
+        │
+        ├──► envoy.yaml (rendered from envoy.template.yaml via Jinja2)
+        │      → consumed by Envoy
+        │
+        └──► arch_config_rendered.yaml
+               → consumed by Brightstaff
+               → injected into WASM filter configs
+```
+
+### Key Config Sections
+
+| Section | Consumed By | Purpose |
+|---|---|---|
+| `model_providers` | llm_gateway, Brightstaff | LLM provider definitions with models, auth, routing preferences |
+| `prompt_targets` | prompt_gateway | Intent-to-API mappings with parameter schemas |
+| `prompt_guards` | prompt_gateway | Input guardrails (jailbreak detection) |
+| `endpoints` | prompt_gateway, Envoy | Named upstream API endpoint definitions |
+| `agents` | Brightstaff, Envoy | Agent service definitions (id, URL, type) |
+| `listeners` | Brightstaff, Envoy | Listener configs binding agents to ports |
+| `ratelimits` | llm_gateway | Per-model rate limits with token-based quotas |
+| `routing` | Brightstaff | LLM routing model/provider config |
+| `model_aliases` | Brightstaff | Friendly name → provider/model mappings |
+| `state_storage` | Brightstaff | Conversation state backend (memory / postgres) |
+| `tracing` | All components | OpenTelemetry config (sampling, OTLP endpoint) |
+| `overrides` | prompt_gateway, Brightstaff | Tuning (intent threshold, agent orchestrator toggle) |
+
+---
+
+## Supported LLM Providers
+
+| Provider | Cluster | Auth Method |
+|---|---|---|
+| OpenAI | api.openai.com | Bearer token |
+| Anthropic (Claude) | api.anthropic.com | x-api-key header |
+| Google (Gemini) | generativelanguage.googleapis.com | API key in URL |
+| Groq | api.groq.com | Bearer token |
+| Mistral | api.mistral.ai | Bearer token |
+| DeepSeek | api.deepseek.com | Bearer token |
+| xAI | api.x.ai | Bearer token |
+| Together AI | api.together.xyz | Bearer token |
+| MoonshotAI | api.moonshot.ai | Bearer token |
+| Zhipu | open.bigmodel.cn | Bearer token |
+| Amazon Bedrock | Custom base_url | AWS Sig v4 |
+| Azure OpenAI | Custom base_url | Bearer / API key |
+| Ollama | Custom base_url | None |
+| Katanemo (Arch) | archfc.katanemo.dev | Bearer token |
+
+The `hermesllm` crate handles **cross-provider request/response translation** so clients can use a single API format (typically OpenAI-compatible) regardless of which upstream provider serves the request.
--- a/crates/README.md
+++ b/crates/README.md
@ -0,0 +1,233 @@
+# Plano Rust Crates
+
+This workspace contains 5 Rust crates that form the core of the Plano AI gateway. They are organized by compilation target and responsibility.
+
+## Workspace Layout
+
+```
+crates/
+├── Cargo.toml          # Workspace root (resolver = "2")
+├── build.sh            # Builds WASM filters + native binary
+├── brightstaff/        # Native Rust HTTP server (Axum)
+├── common/             # Shared library (WASM-compatible)
+├── hermesllm/          # LLM protocol translation (pure Rust)
+├── llm_gateway/        # WASM filter: LLM routing & auth
+└── prompt_gateway/     # WASM filter: intent matching & guardrails
+```
+
+---
+
+## Crate Details
+
+### `prompt_gateway` — Inbound Prompt Processing
+
+| | |
+|---|---|
+| **Type** | `cdylib` (WASM filter) |
+| **Target** | `wasm32-wasip1` |
+| **Envoy listener** | `ingress_traffic_prompt` (:10001) |
+| **Root ID** | `prompt_gateway` |
+| **Depends on** | `common`, `proxy-wasm` |
+
+**Responsibilities:**
+- Intercepts incoming chat completion requests
+- Converts `prompt_targets` into OpenAI tool definitions
+- Dispatches to `Arch-Function` model for intent classification
+- If intent matches: calls developer API endpoints, augments prompt with response context
+- If no match: prepends system prompt, forwards to upstream LLM
+- Manages multi-turn state via `x-arch-state` header
+- Applies `prompt_guards` (jailbreak detection)
+
+**Key modules:**
+- `filter_context.rs` — RootContext, config parsing
+- `http_context.rs` — Request interception, tool definition construction
+- `stream_context.rs` — Core orchestration (intent matching, API calls, response handling)
+- `tools.rs` — URL path/query parameter substitution for API calls
+
+**Constraints:**
+- No `tokio`, `async/await`, threads, or network sockets
+- All HTTP calls via `proxy-wasm` `dispatch_http_call`
+
+---
+
+### `llm_gateway` — LLM Provider Routing & Translation
+
+| | |
+|---|---|
+| **Type** | `cdylib` (WASM filter) |
+| **Target** | `wasm32-wasip1` |
+| **Envoy listeners** | `ingress_traffic_prompt` (:10001), `egress_traffic_llm` (:12001) |
+| **Root ID** | `llm_gateway` |
+| **Depends on** | `common`, `hermesllm`, `proxy-wasm` |
+
+**Responsibilities:**
+- Selects LLM provider based on `x-arch-llm-provider-hint` header or default
+- Injects authentication credentials (Bearer token, x-api-key, passthrough)
+- Rewrites request path for target provider API
+- Transforms request/response formats between providers (OpenAI ↔ Anthropic ↔ Bedrock) via `hermesllm`
+- Enforces token-based rate limits (`governor` with `no_std`)
+- Handles SSE stream reassembly across chunk boundaries (`SseStreamBuffer`)
+- Records metrics: TTFT, tokens/sec, request latency, rate-limited count
+
+**Key modules:**
+- `filter_context.rs` — RootContext, provider & rate limit initialization
+- `stream_context.rs` — Request/response transformation, auth, rate limiting, streaming
+- `metrics.rs` — Gauge, counter, histogram definitions
+
+**Constraints:**
+- Same WASM constraints as `prompt_gateway`
+- Uses `hermesllm` for protocol translation — do NOT duplicate translation logic here
+
+---
+
+### `common` — Shared Types & Utilities
+
+| | |
+|---|---|
+| **Type** | `lib` |
+| **Target** | Both native and `wasm32-wasip1` |
+| **Depends on** | `hermesllm`, `proxy-wasm`, `governor` (no_std), `tiktoken-rs` |
+
+**Responsibilities:**
+- Central configuration schema (`Configuration`, `LlmProvider`, `PromptTarget`, `PromptGuards`, etc.)
+- `LlmProviders` collection — provider lookup with slug matching and wildcard expansion
+- HTTP client trait wrapping `proxy-wasm` `dispatch_http_call`
+- All `x-arch-*` header constants and path constants (`consts.rs`)
+- Token-based rate limiting (`governor`, keyed by model + header selector)
+- Token counting via `tiktoken-rs`
+- OpenAI-compatible API types (`ChatCompletionsRequest`, `Message`, `ToolCall`, etc.)
+- Error types (`ClientError`, `ServerError`)
+- Metrics primitives (`Gauge`, `Counter`, `Histogram`)
+- URL path parameter substitution
+- PII obfuscation for logging
+
+**Key modules:**
+- `configuration.rs` — All config structs, deserialization, validation
+- `consts.rs` — Canonical header names, paths, timeouts, cluster names
+- `llm_providers.rs` — Provider collection with lookup logic
+- `ratelimit.rs` — Token-based rate limiter (global `OnceLock`)
+- `http.rs` — `Client` trait for WASM HTTP dispatch
+- `tokenizer.rs` — Token counting (tiktoken, GPT-4 fallback)
+
+**Constraints:**
+- Must compile for `wasm32-wasip1` — no std networking, no threads
+- Must NOT depend on `brightstaff`
+
+---
+
+### `hermesllm` — LLM Protocol Translation
+
+| | |
+|---|---|
+| **Type** | `lib` |
+| **Target** | Native only (but no WASM-incompatible deps) |
+| **Depends on** | `serde`, `serde_json`, `aws-smithy-eventstream`, `uuid` |
+
+**Responsibilities:**
+- Cross-provider request/response translation (OpenAI ↔ Anthropic ↔ Amazon Bedrock ↔ Gemini)
+- `ProviderRequest` / `ProviderResponse` / `ProviderStreamResponse` traits
+- SSE stream parsing (`SseStreamIter`, `SseStreamBuffer`, `SseChunkProcessor`)
+- AWS Event Stream binary frame decoding (Bedrock)
+- Provider identification (`ProviderId` enum with model catalog from `provider_models.yaml`)
+- Target endpoint path rewriting (`/v1/chat/completions` → provider-specific paths)
+
+**Key modules:**
+- `apis/` — Format definitions: `openai.rs`, `anthropic.rs`, `amazon_bedrock.rs`, `openai_responses.rs`
+- `apis/streaming_shapes/` — SSE and binary stream parsing
+- `providers/` — `id.rs` (ProviderId), `request.rs`, `response.rs`, `streaming_response.rs`
+- `clients/endpoints.rs` — API path mapping
+- `transforms/` — Request/response transformations organized by direction
+
+**Constraints:**
+- **MUST NOT depend on `proxy-wasm` or `common`** — this is a pure Rust library
+- Must remain usable outside of the WASM/Envoy context
+- Optional `model-fetch` feature gates network dependencies (`ureq`)
+
+---
+
+### `brightstaff` — Native HTTP Server
+
+| | |
+|---|---|
+| **Type** | Binary (Axum) |
+| **Target** | Native only |
+| **Port** | `0.0.0.0:9091` |
+| **Depends on** | `hermesllm`, `common` (non-WASM parts), `tokio`, `axum`, `reqwest`, `opentelemetry` |
+
+**Responsibilities:**
+- LLM request routing via `Arch-Router` model (selects best provider/model)
+- Agent orchestration via `Plano-Orchestrator` model (selects and chains agents)
+- Agent execution pipeline: filter chains → agent invocation (MCP JSON-RPC or HTTP)
+- `Arch-Function` handler: tool calling with hallucination detection
+- Conversation state management for Responses API (memory or PostgreSQL)
+- Model alias resolution
+- OpenTelemetry tracing with per-component service names
+- Interaction signal analysis (frustration, repetition, escalation detection)
+
+**Key modules:**
+- `handlers/llm.rs` — LLM passthrough with routing
+- `handlers/agent_chat_completions.rs` — Agent orchestration entry point
+- `handlers/agent_selector.rs` — Agent selection logic
+- `handlers/pipeline_processor.rs` — Sequential agent/filter execution
+- `handlers/function_calling.rs` — Arch-Function tool calling
+- `router/llm_router.rs` — `RouterService` (Arch-Router model)
+- `router/plano_orchestrator.rs` — `OrchestratorService` (Plano-Orchestrator model)
+- `state/` — `StateStorage` trait, memory & PostgreSQL backends
+- `signals/` — Conversation quality analysis
+- `tracing/` — OpenTelemetry setup with custom service name routing
+
+**Constraints:**
+- All external calls go through Envoy (localhost:12001 for LLMs, localhost:11000 for agents)
+- Does NOT use `common`'s `proxy-wasm` Client trait — uses `reqwest` instead
+
+---
+
+## Dependency Graph
+
+```
+prompt_gateway ──► common ──► hermesllm
+llm_gateway ───┬► common ──► hermesllm
+               └► hermesllm
+brightstaff ───┬► hermesllm
+               └► common (config types only, not WASM code)
+
+hermesllm ────► (standalone — no proxy-wasm, no common)
+```
+
+**Direction is strictly enforced:**
+- Arrows point toward dependencies
+- No cycles allowed
+- `hermesllm` is the leaf node — it must never depend on any other workspace crate
+
+---
+
+## Build Commands
+
+```bash
+# Everything (recommended)
+./build.sh
+
+# Equivalent to:
+cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway
+cargo build --release -p brightstaff
+
+# Tests (all crates, native target)
+cargo test --workspace
+
+# Single crate test
+cargo test -p common
+cargo test -p hermesllm
+cargo test -p prompt_gateway
+cargo test -p llm_gateway
+cargo test -p brightstaff
+```
+
+## WASM Output Location
+
+After building, WASM filter binaries are at:
+```
+target/wasm32-wasip1/release/prompt_gateway.wasm
+target/wasm32-wasip1/release/llm_gateway.wasm
+```
+
+These are loaded by Envoy at startup from `/etc/envoy/proxy-wasm-plugins/` in the Docker image.
--- a/docs/ADR/001-envoy-as-data-plane.md
+++ b/docs/ADR/001-envoy-as-data-plane.md
@ -0,0 +1,35 @@
+# ADR 001: Envoy as the Data Plane
+
+**Status:** Accepted
+
+## Context
+
+Plano needs to proxy all traffic between clients, LLM providers, and developer APIs. The options were:
+1. Build a custom proxy from scratch in Rust (e.g., using `hyper`/`axum` directly)
+2. Use an existing L7 proxy (Envoy, NGINX, HAProxy) and extend it
+3. Use a service mesh sidecar approach
+
+We need: TLS termination, connection pooling, retry policies, load balancing, header-based routing, streaming support (SSE), compression, and observability — all at production quality.
+
+## Decision
+
+Use **Envoy Proxy** as the data plane. All external traffic — both inbound client requests and outbound LLM/API calls — flows through Envoy. The native Rust service (Brightstaff) never makes direct outbound connections to external hosts.
+
+## Consequences
+
+**Enables:**
+- Production-grade L7 proxying (TLS, HTTP/2, connection pooling, retries) without building it ourselves
+- WASM filter extension model for inline request/response processing
+- Standard observability (access logs, stats, tracing) out of the box
+- Header-based routing via Envoy's route configuration — no custom routing code needed for cluster selection
+- Hot-restart and graceful draining for zero-downtime updates
+
+**Requires:**
+- All Brightstaff external calls must go through Envoy listeners (localhost:12001 for LLMs, localhost:11000 for APIs)
+- Custom headers (`x-arch-*`) for routing decisions — Envoy matches on these in its route config
+- Envoy configuration must be generated from user config (Jinja2 template → envoy.yaml)
+- Team must understand Envoy's configuration model (listeners, clusters, filter chains)
+
+**Prevents:**
+- Direct HTTP calls from Brightstaff to external services (this is intentional — it ensures all traffic gets WASM filter processing, auth injection, rate limiting, and observability)
+- Simple single-binary deployment (we need Envoy + Brightstaff, managed by Supervisord)
--- a/docs/ADR/002-wasm-filters-over-native.md
+++ b/docs/ADR/002-wasm-filters-over-native.md
@ -0,0 +1,42 @@
+# ADR 002: WASM Filters Over Native Envoy Filters
+
+**Status:** Accepted
+
+## Context
+
+Envoy supports three extension mechanisms:
+1. **Native C++ filters** — compiled into the Envoy binary, highest performance
+2. **WASM filters** — compiled to WebAssembly, loaded at runtime via Envoy's WASM VM
+3. **Lua filters** — scripted, limited functionality
+4. **External processing (ext_proc)** — gRPC callout to an external service
+
+We need filters that: parse and transform LLM request/response bodies, perform intent matching, inject authentication headers, enforce rate limits, and handle SSE stream reassembly.
+
+## Decision
+
+Use **WASM filters** written in Rust, compiled to `wasm32-wasip1`, loaded by Envoy's V8 runtime. We have two filters:
+- `prompt_gateway.wasm` — inbound prompt processing (intent matching, guardrails, function calling)
+- `llm_gateway.wasm` — outbound LLM processing (provider routing, auth, rate limiting, format translation)
+
+## Consequences
+
+**Enables:**
+- Filters written in Rust with strong type safety and shared crates (`common`, `hermesllm`)
+- Runtime-loadable: no need to rebuild Envoy itself
+- Sandboxed execution: a filter crash doesn't bring down Envoy
+- Same language (Rust) for WASM filters and Brightstaff — shared types and logic via workspace crates
+
+**Requires:**
+- No `tokio`, `async/await`, threads, filesystem, or network sockets in WASM crates
+- All I/O must use `proxy-wasm` SDK's `dispatch_http_call` (callback-based)
+- Dependencies must be WASM-compatible: `governor` needs `no_std` feature, no crates using `std::net`
+- `crate-type = ["cdylib"]` — these build as shared libraries, not binaries
+- Testing runs natively (`cargo test`), but building requires `--target wasm32-wasip1`
+
+**Prevents:**
+- Using async Rust patterns in filter code (callback-based `on_http_call_response` instead)
+- Using popular HTTP client crates (`reqwest`, `hyper`) in filters
+- Easy debugging — WASM filters run inside Envoy's V8 VM with limited introspection
+
+**Trade-off vs. ext_proc:**
+External processing would allow using Brightstaff (native Rust with full async) for all processing, but would add network round-trips for every request. WASM filters run inline in Envoy's filter chain — zero additional network hops for common operations like auth injection and rate limiting.
--- a/docs/ADR/003-single-container-supervisord.md
+++ b/docs/ADR/003-single-container-supervisord.md
@ -0,0 +1,42 @@
+# ADR 003: Single Container with Supervisord
+
+**Status:** Accepted
+
+## Context
+
+Plano has three runtime processes:
+1. **Envoy Proxy** — the data plane with WASM filters
+2. **Brightstaff** — the Rust HTTP service for routing and orchestration
+3. **Config generator** — Python script that validates config and renders Envoy's YAML (runs at startup)
+
+The options for deployment were:
+1. **Separate containers** — each process in its own container, orchestrated by Docker Compose / K8s
+2. **Single container with process manager** — all processes in one container, managed by Supervisord
+3. **Single binary** — embed Envoy or reimplement its core functionality
+
+## Decision
+
+Run all processes in a **single container** managed by **Supervisord**. The startup sequence:
+1. Config generator validates `arch_config.yaml` and renders `envoy.yaml`
+2. Supervisord starts Brightstaff and Envoy in parallel
+3. A log tail process unifies access log output
+
+## Consequences
+
+**Enables:**
+- Simple deployment: one container, one image, `docker run` just works
+- No network latency between Envoy and Brightstaff (localhost communication)
+- Config generation happens at container startup — no external config rendering step
+- Easy development: `docker compose up` with volume mounts for hot-reload
+
+**Requires:**
+- Supervisord configuration (`config/supervisord.conf`) to manage process lifecycle
+- Health checks must account for both Envoy and Brightstaff readiness
+- Logs from all processes need unified output (handled by the tail process)
+
+**Prevents:**
+- Independent scaling of Envoy vs. Brightstaff (they scale together as one unit)
+- Kubernetes sidecar pattern (though this could be reconsidered)
+- Process-level fault isolation (though Supervisord restarts failed processes)
+
+**Trade-off:** Simplicity of deployment over horizontal scaling flexibility. For a gateway that needs to be deployed at the edge or as a sidecar, single-container simplicity is more valuable than the ability to scale components independently.
--- a/docs/ADR/004-hermesllm-pure-rust.md
+++ b/docs/ADR/004-hermesllm-pure-rust.md
@ -0,0 +1,45 @@
+# ADR 004: hermesllm as a Pure Rust Library
+
+**Status:** Accepted
+
+## Context
+
+LLM providers use different API formats (OpenAI Chat Completions, Anthropic Messages, Amazon Bedrock Converse, Gemini). The gateway needs to translate between these formats in two places:
+1. In the `llm_gateway` WASM filter (inline in Envoy)
+2. In Brightstaff (for routing decisions and response processing)
+
+The options were:
+1. Duplicate translation logic in both places
+2. Put translation logic in `common` (shared crate, but WASM-constrained)
+3. Create a separate pure Rust library with no WASM dependencies
+
+## Decision
+
+Create **`hermesllm`** as a standalone Rust library that handles all LLM protocol translation. It must never depend on `proxy-wasm` or `common`. Both WASM crates (via `common`) and Brightstaff use `hermesllm` directly.
+
+## Consequences
+
+**Enables:**
+- Single source of truth for LLM protocol translation
+- Reusable outside the gateway context (could be published as an independent crate)
+- Full Rust standard library available (no WASM constraints on the library itself)
+- Clean separation: protocol knowledge lives in `hermesllm`, gateway logic lives in filters
+
+**Requires:**
+- `hermesllm` must not import `proxy-wasm`, `common`, or any WASM-specific crate
+- Adding a new provider requires changes only in `hermesllm` (plus config in `common/configuration.rs` and `envoy.template.yaml`)
+- Types shared between `hermesllm` and the filters go through `common`'s re-exports
+
+**Prevents:**
+- Circular dependencies (hermesllm is always a leaf in the dependency graph)
+- Accidentally coupling protocol translation to WASM runtime specifics
+- Needing to maintain two separate translation implementations
+
+**Dependency direction:**
+```
+prompt_gateway → common → hermesllm
+llm_gateway    → common → hermesllm
+llm_gateway    → hermesllm (direct)
+brightstaff    → hermesllm (direct)
+hermesllm      → (no workspace deps)
+```
--- a/docs/ADR/005-header-based-routing.md
+++ b/docs/ADR/005-header-based-routing.md
@ -0,0 +1,40 @@
+# ADR 005: Header-Based Routing Protocol
+
+**Status:** Accepted
+
+## Context
+
+Envoy needs to route requests to different upstream clusters (LLM providers, developer APIs, agents) based on runtime decisions made by WASM filters and Brightstaff. The options were:
+1. **Path-based routing** — different URL paths for different upstreams
+2. **Header-based routing** — custom headers to signal routing decisions
+3. **Dynamic cluster selection** — programmatic cluster selection in filters
+
+## Decision
+
+Use **custom `x-arch-*` headers** for all routing decisions. WASM filters and Brightstaff set headers like `x-arch-llm-provider` and `x-arch-upstream`, and Envoy's route configuration matches on these headers to select the upstream cluster.
+
+All header names are defined as constants in `common/src/consts.rs` — this is the single source of truth.
+
+## Consequences
+
+**Enables:**
+- Decoupled routing: WASM filters decide *where* to route, Envoy handles *how* to connect
+- Transparent to the client — custom headers are internal, clients see standard HTTP
+- Easy to debug: inspect headers to understand routing decisions
+- Composable: multiple filters can add/modify routing headers in the filter chain
+
+**Requires:**
+- Header names must be consistent between `consts.rs` and `envoy.template.yaml`
+- Any new routing dimension needs a new header constant + Envoy route match rule
+- Developers must grep all consumers when changing a header name
+
+**Prevents:**
+- Routing logic in Envoy's configuration alone (routing decisions are made by Rust code, not Envoy config)
+- Using Envoy's native routing features (like weighted clusters) independently — they must be combined with header matching
+
+**Key headers:**
+- `x-arch-llm-provider` — LLM provider cluster selection (Envoy route matching)
+- `x-arch-llm-provider-hint` — Provider hint from Brightstaff to llm_gateway
+- `x-arch-upstream` — Agent/API endpoint cluster selection
+- `x-arch-streaming-request` — Streaming mode signal
+- `x-arch-state` — Multi-turn conversation state (prompt_gateway internal)
--- a/docs/ADR/006-config-generation-pipeline.md
+++ b/docs/ADR/006-config-generation-pipeline.md
@ -0,0 +1,48 @@
+# ADR 006: Config Generation Pipeline (Python + Jinja2)
+
+**Status:** Accepted
+
+## Context
+
+Envoy's configuration is a large YAML file that must describe all listeners, clusters, filter chains, TLS contexts, and WASM filter configs. This configuration depends on user-provided settings (which LLM providers to use, which agents to connect, which endpoints to expose).
+
+The options were:
+1. **Static Envoy config** — users edit Envoy YAML directly
+2. **Rust-based config generator** — generate Envoy config from a Rust binary
+3. **Python + Jinja2 template** — validate user config against a schema, then render Envoy config from a template
+
+## Decision
+
+Use a **Python config generator** (`cli/planoai/config_generator.py`) that:
+1. Validates user's `arch_config.yaml` against a JSON Schema (`config/arch_config_schema.yaml`)
+2. Applies transformations (legacy format conversion, cluster inference, internal model injection)
+3. Renders `config/envoy.template.yaml` (Jinja2) into the final `envoy.yaml`
+4. Produces `arch_config_rendered.yaml` for Brightstaff and WASM filter consumption
+
+This runs at container startup, before Envoy starts.
+
+## Consequences
+
+**Enables:**
+- Simple user-facing config format (`arch_config.yaml`) — users don't need to understand Envoy internals
+- JSON Schema validation catches errors before Envoy starts
+- Jinja2 templating is mature, well-understood, and powerful for generating complex YAML
+- Python CLI (`planoai`) can also handle Docker management and other tooling
+- Config validation is independently testable (`cli/test/test_config_generator.py`)
+
+**Requires:**
+- Python runtime in the Docker image (adds image size)
+- Config changes need updates in 4 places: schema, template, Python validator, Rust struct
+- Understanding of Jinja2 templating for Envoy config modifications
+- `arch_config_rendered.yaml` must be kept in sync between Python generator and Rust deserialization
+
+**Prevents:**
+- Dynamic config reloading without container restart (config is generated at startup)
+- Using Envoy's xDS protocol for dynamic configuration (could be added later)
+- Rust-only development workflow — Python is required for config generation
+
+**4-file update rule:** Every new user-facing config field requires changes to:
+1. `config/arch_config_schema.yaml` — JSON Schema definition
+2. `config/envoy.template.yaml` — Jinja2 template (if Envoy needs the value)
+3. `cli/planoai/config_generator.py` — Python validation and rendering logic
+4. `common/src/configuration.rs` — Rust `Configuration` struct (for runtime consumption)
--- a/docs/ADR/README.md
+++ b/docs/ADR/README.md
@ -0,0 +1,22 @@
+# Architecture Decision Records
+
+This directory contains Architecture Decision Records (ADRs) for the Plano project. ADRs document key architectural decisions, their context, and rationale — preventing future contributors (human or AI) from unknowingly reversing deliberate choices.
+
+## Index
+
+| ADR | Title | Status |
+|-----|-------|--------|
+| [001](001-envoy-as-data-plane.md) | Envoy as the Data Plane | Accepted |
+| [002](002-wasm-filters-over-native.md) | WASM Filters Over Native Envoy Filters | Accepted |
+| [003](003-single-container-supervisord.md) | Single Container with Supervisord | Accepted |
+| [004](004-hermesllm-pure-rust.md) | hermesllm as a Pure Rust Library | Accepted |
+| [005](005-header-based-routing.md) | Header-Based Routing Protocol | Accepted |
+| [006](006-config-generation-pipeline.md) | Config Generation Pipeline (Python + Jinja2) | Accepted |
+
+## ADR Format
+
+Each ADR follows this structure:
+- **Status**: Proposed / Accepted / Deprecated / Superseded
+- **Context**: What problem or question prompted this decision
+- **Decision**: What was decided
+- **Consequences**: Trade-offs, implications, and what this enables or prevents
--- a/docs/DATA_CONTRACTS.md
+++ b/docs/DATA_CONTRACTS.md
@ -0,0 +1,221 @@
+# Data Contracts — Inter-Component Communication
+
+This document defines the contracts between Plano's components: custom HTTP headers, internal API formats, streaming protocols, and Envoy routing conventions. Breaking any of these contracts will cause silent routing failures.
+
+---
+
+## 1. Custom Header Protocol
+
+All custom headers are defined in `common/src/consts.rs`. This is the **single source of truth** — if a header name appears in `envoy.template.yaml` or Brightstaff code, it must match the constant in `consts.rs`.
+
+### Routing Headers (Envoy-critical)
+
+These headers are used in Envoy's `route_config` for cluster selection. Changing them requires updating `envoy.template.yaml`.
+
+| Header | Constant | Set By | Read By | Value Format | Purpose |
+|---|---|---|---|---|---|
+| `x-arch-llm-provider` | `ARCH_ROUTING_HEADER` | WASM filters | Envoy routes | Provider slug (e.g., `openai`, `anthropic`) | Selects the LLM provider cluster in Envoy |
+| `x-arch-upstream` | `ARCH_UPSTREAM_HOST_HEADER` | WASM filters, Brightstaff | Envoy routes | Cluster name (e.g., agent endpoint name) | Routes to a specific upstream cluster |
+| `x-arch-llm-provider-hint` | `ARCH_PROVIDER_HINT_HEADER` | Brightstaff | llm_gateway | `provider/model` (e.g., `openai/gpt-4`) | Hints which provider+model to use |
+| `x-arch-agent-listener-name` | — | Envoy (set in route config) | Brightstaff | Listener name string | Identifies which agent listener a request arrived on |
+
+### Internal State Headers (WASM filter internal)
+
+These headers pass state between the prompt_gateway filter's request/response phases or between prompt_gateway and the function calling service.
+
+| Header | Constant | Set By | Read By | Value Format | Purpose |
+|---|---|---|---|---|---|
+| `x-arch-state` | `X_ARCH_STATE_HEADER` | prompt_gateway | prompt_gateway | Base64-encoded JSON (`ArchState`) | Multi-turn conversation state across filter invocations |
+| `x-arch-tool-call-message` | `X_ARCH_TOOL_CALL` | prompt_gateway | prompt_gateway | JSON string | Tool call metadata for API orchestration |
+| `x-arch-api-response-message` | `X_ARCH_API_RESPONSE` | prompt_gateway | prompt_gateway | JSON string | Developer API response data |
+| `x-arch-fc-model-response` | `X_ARCH_FC_MODEL_RESPONSE` | prompt_gateway | prompt_gateway | JSON string | Raw Arch-Function model response |
+| `x-arch-llm-route` | `LLM_ROUTE_HEADER` | Brightstaff | llm_gateway | Route name string | LLM route decision result |
+
+### Signaling Headers
+
+| Header | Constant | Set By | Read By | Purpose |
+|---|---|---|---|---|
+| `x-arch-streaming-request` | `ARCH_IS_STREAMING_HEADER` | Brightstaff | llm_gateway | Indicates the request is streaming mode |
+| `x-arch-ratelimit-selector` | `RATELIMIT_SELECTOR_HEADER_KEY` | Client / Envoy | llm_gateway | Key for per-tenant rate limit partitioning |
+
+### Standard Headers Used
+
+| Header | Constant | Purpose |
+|---|---|---|
+| `x-request-id` | `REQUEST_ID_HEADER` | Request tracing (set by Envoy or caller) |
+| `x-envoy-original-path` | `ENVOY_ORIGINAL_PATH_HEADER` | Original path before Envoy rewrites |
+| `x-envoy-max-retries` | `ENVOY_RETRY_HEADER` | Retry count for Envoy's retry policy |
+| `traceparent` | `TRACE_PARENT_HEADER` | W3C Trace Context for OpenTelemetry |
+
+---
+
+## 2. Internal Cluster Names
+
+Defined in `consts.rs` and referenced in `envoy.template.yaml`:
+
+| Constant | Value | Target | Purpose |
+|---|---|---|---|
+| `MODEL_SERVER_NAME` | `"bright_staff"` | localhost:9091 | Brightstaff service |
+| `ARCH_INTERNAL_CLUSTER_NAME` | `"arch_internal"` | localhost:11000 | Outbound API router |
+| `ARCH_FC_CLUSTER` | `"arch"` | archfc.katanemo.dev:443 | Katanemo Arch-Function model |
+
+Additional clusters generated from config:
+- `arch_prompt_gateway_listener` → localhost:10001
+- `arch_listener_llm` → localhost:12001
+- Per-provider clusters (e.g., `openai`, `anthropic`, `gemini`) from `envoy.template.yaml`
+- Per-agent/endpoint clusters from user config
+
+---
+
+## 3. Internal API Formats
+
+### Brightstaff → Envoy (LLM requests via :12001)
+
+Brightstaff sends OpenAI-compatible `ChatCompletionsRequest` JSON to `localhost:12001` with:
+- `x-arch-llm-provider-hint: <provider>/<model>` to select the provider
+- `x-arch-is-streaming: true/false` to indicate streaming
+- Standard `Content-Type: application/json`
+- `traceparent` for distributed tracing
+
+The `llm_gateway` WASM filter at :12001 transforms the request to the target provider's format.
+
+### Brightstaff → Envoy (Agent/API requests via :11000)
+
+Brightstaff sends requests to `localhost:11000` with:
+- `x-arch-upstream-host: <cluster_name>` to route to the target agent/API
+- `x-envoy-max-retries: 3` for resilience
+
+**MCP Agent Protocol:**
+```
+POST /  (with x-arch-upstream-host)
+Content-Type: application/json
+
+# Step 1: Initialize
+{"jsonrpc":"2.0","method":"initialize","id":"<uuid>","params":{...}}
+
+# Step 2: Initialized notification
+{"jsonrpc":"2.0","method":"notifications/initialized"}
+
+# Step 3: Tool call
+{"jsonrpc":"2.0","method":"tools/call","id":"<uuid>","params":{"name":"<tool>","arguments":{...}}}
+```
+
+**HTTP Agent Protocol:**
+```
+POST /  (with x-arch-upstream-host)
+Content-Type: application/json
+
+[{"role":"user","content":"..."},{"role":"assistant","content":"..."}]
+```
+Response: Array of messages.
+
+### prompt_gateway → Arch-Function (/function_calling)
+
+```
+POST /function_calling
+Content-Type: application/json
+
+{
+  "messages": [...],
+  "tools": [...],
+  "model": "Arch-Function",
+  "stream": false,
+  "metadata": {"raw_response": true, "logprobs": true}
+}
+```
+
+Response contains `tool_calls`, `response`, or `clarification` in the assistant message content (JSON string).
+
+---
+
+## 4. Streaming Protocol
+
+### SSE (Server-Sent Events) — Standard LLM Streaming
+
+All streaming LLM responses use SSE format:
+```
+data: {"id":"...","choices":[...]}\n\n
+data: {"id":"...","choices":[...]}\n\n
+data: [DONE]\n\n
+```
+
+**Important:** SSE events can be split across HTTP chunks. The `llm_gateway` uses `SseStreamBuffer` and `SseChunkProcessor` (from `hermesllm`) to reassemble events across chunk boundaries before processing.
+
+### Bedrock Binary Streaming
+
+Amazon Bedrock uses AWS Event Stream binary protocol instead of SSE. The `BedrockBinaryFrameDecoder` in `hermesllm` handles decoding.
+
+### Brightstaff Streaming
+
+Brightstaff uses `tokio::sync::mpsc` channels to stream responses:
+1. Spawns a background task to read from upstream (via `reqwest`)
+2. Parses SSE events, optionally transforms them
+3. Sends chunks through the mpsc channel
+4. Axum's `StreamBody` delivers to the client
+
+---
+
+## 5. Configuration Injection
+
+### WASM Filter Configuration
+
+Envoy injects config into WASM filters via the `configuration` field in the filter definition:
+
+- **prompt_gateway** receives: `prompt_targets`, `prompt_guards`, `system_prompt`, `endpoints`, `overrides`, `tracing`
+- **llm_gateway** receives: `model_providers`, `ratelimits`, `overrides`
+
+Both receive YAML strings parsed by `serde_yaml` in each filter's `RootContext::on_configure()`.
+
+### Brightstaff Configuration
+
+Brightstaff reads `arch_config_rendered.yaml` (path from `ARCH_CONFIG_PATH_RENDERED` env var), which contains the full rendered config including `model_providers`, `agents`, `filters`, `listeners`, `routing`, `model_aliases`, `state_storage`, `tracing`, and `overrides`.
+
+---
+
+## 6. Timeouts
+
+All timeouts are defined in `consts.rs`:
+
+| Constant | Value | Used For |
+|---|---|---|
+| `ARCH_FC_REQUEST_TIMEOUT_MS` | 30,000 ms | Arch-Function model calls from prompt_gateway |
+| `DEFAULT_TARGET_REQUEST_TIMEOUT_MS` | 30,000 ms | Default prompt target endpoint calls |
+| `API_REQUEST_TIMEOUT_MS` | 30,000 ms | Developer API calls from prompt_gateway |
+| `MODEL_SERVER_REQUEST_TIMEOUT_MS` | 30,000 ms | Model server calls |
+
+Envoy also enforces its own route-level timeouts configured in `envoy.template.yaml` (default 300s for LLM routes).
+
+---
+
+## 7. Error Response Format
+
+All error responses from Brightstaff follow this format:
+
+```json
+{
+  "error": {
+    "message": "Human-readable error description",
+    "type": "error_type",
+    "code": 400
+  }
+}
+```
+
+The `llm_gateway` WASM filter returns errors as:
+- HTTP 429 for rate limit exceeded
+- HTTP 503 for provider unavailable
+- The original upstream error status code for pass-through errors
+
+---
+
+## 8. Contract Change Checklist
+
+When modifying any data contract:
+
+- [ ] Update the constant in `common/src/consts.rs`
+- [ ] Grep the entire codebase for the old value (`grep -r "old_value" crates/`)
+- [ ] Update `config/envoy.template.yaml` if the header is used in routing
+- [ ] Update `cli/planoai/config_generator.py` if the config schema changed
+- [ ] Update `config/arch_config_schema.yaml` if user-facing config changed
+- [ ] Run `cargo test --workspace` to catch compile/test failures
+- [ ] Run `cd cli && python -m pytest test/` for config generation tests