From 3f8aa14e4c521a545d156c247faba2d7a8c866b6 Mon Sep 17 00:00:00 2001 From: Adil Hafeez Date: Mon, 9 Feb 2026 23:34:18 -0800 Subject: [PATCH] create md files for coding agents and for humans --- .github/copilot-instructions.md | 130 ++++++ AGENTS.md | 138 +++++++ architecture.md | 411 +++++++++++++++++++ crates/README.md | 233 +++++++++++ docs/ADR/001-envoy-as-data-plane.md | 35 ++ docs/ADR/002-wasm-filters-over-native.md | 42 ++ docs/ADR/003-single-container-supervisord.md | 42 ++ docs/ADR/004-hermesllm-pure-rust.md | 45 ++ docs/ADR/005-header-based-routing.md | 40 ++ docs/ADR/006-config-generation-pipeline.md | 48 +++ docs/ADR/README.md | 22 + docs/DATA_CONTRACTS.md | 221 ++++++++++ 12 files changed, 1407 insertions(+) create mode 100644 .github/copilot-instructions.md create mode 100644 AGENTS.md create mode 100644 architecture.md create mode 100644 crates/README.md create mode 100644 docs/ADR/001-envoy-as-data-plane.md create mode 100644 docs/ADR/002-wasm-filters-over-native.md create mode 100644 docs/ADR/003-single-container-supervisord.md create mode 100644 docs/ADR/004-hermesllm-pure-rust.md create mode 100644 docs/ADR/005-header-based-routing.md create mode 100644 docs/ADR/006-config-generation-pipeline.md create mode 100644 docs/ADR/README.md create mode 100644 docs/DATA_CONTRACTS.md diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 00000000..b7a6b98d --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,130 @@ +# Copilot Instructions for Plano (ArchGW) + +## System Identity + +Plano is an AI-native gateway built on Envoy Proxy. It uses WASM filters for inline request processing and a native Rust service (Brightstaff) for orchestration. All components run in a single container managed by Supervisord. + +## Critical Architectural Rules + +### 1. Envoy Is the Data Plane — Never Bypass It + +All external traffic MUST flow through Envoy. Brightstaff NEVER makes direct outbound HTTP calls to LLM providers or developer APIs. It always routes through Envoy listeners: +- LLM requests → `localhost:12001` (egress LLM listener with `llm_gateway.wasm`) +- Agent/API requests → `localhost:11000` (outbound API listener) + +**Do not** add direct HTTP calls from Brightstaff to external services. Use Envoy's cluster routing via `x-arch-*` headers instead. + +### 2. WASM Crate Constraints + +`prompt_gateway` and `llm_gateway` compile to `wasm32-wasip1`. This means: +- **No `tokio`, no `async/await`, no threads, no filesystem, no network sockets** +- All I/O goes through `proxy-wasm` SDK's `dispatch_http_call` (async callback-based) +- No crate with `std` networking features — use `governor` with `no_std`, etc. +- The `crate-type` is `["cdylib"]` — these are shared libraries, not binaries +- Test with `cargo test` (native), but build with `--target wasm32-wasip1` + +**Do not** add dependencies to WASM crates that require `std::net`, `tokio`, `reqwest`, `hyper`, or any async runtime. + +### 3. Crate Dependency Direction + +``` +prompt_gateway → common +llm_gateway → common, hermesllm +common → hermesllm +brightstaff → common (non-WASM parts), hermesllm +hermesllm → (standalone, no proxy-wasm) +``` + +- `hermesllm` must NEVER depend on `proxy-wasm` or `common` — it's a pure Rust library usable outside WASM +- `common` provides the `proxy-wasm` abstractions — WASM crates use `common`, not raw `proxy-wasm` directly (except for the SDK traits) +- `brightstaff` uses `hermesllm` directly for LLM types but does NOT use `common`'s WASM-specific code (like `proxy-wasm` Client trait) + +### 4. Header-Based Routing Protocol + +Envoy routes requests using custom headers. These are the canonical header names defined in `common/src/consts.rs`: + +| Header | Purpose | Do NOT change | +|--------|---------|---------------| +| `x-arch-llm-provider` | Envoy route matching for LLM provider cluster | Used in envoy.template.yaml | +| `x-arch-llm-provider-hint` | Brightstaff → llm_gateway provider selection | Both sides must agree | +| `x-arch-upstream` | Targets a specific agent/API cluster in Envoy | Used in envoy.template.yaml | +| `x-arch-streaming-request` | Signals streaming mode | llm_gateway reads this | +| `x-arch-state` | Multi-turn conversation state in prompt_gateway | Serialized JSON | +| `x-arch-tool-call-message` | Tool call metadata | prompt_gateway internal | +| `x-arch-api-response-message` | Developer API response | prompt_gateway internal | +| `x-arch-agent-listener-name` | Identifies agent listener | Set by Envoy, read by Brightstaff | +| `x-arch-llm-route` | LLM route decision result | Brightstaff ↔ llm_gateway | + +Changing header names requires updating: `consts.rs`, `envoy.template.yaml`, and all consumers. + +### 5. Build System + +```bash +# WASM filters — must use wasm32-wasip1 target +cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway + +# Brightstaff — native binary +cargo build --release -p brightstaff +``` + +The workspace uses Rust edition 2021 and resolver "2". The workspace root is `crates/Cargo.toml`. + +### 6. Configuration Flow + +User config (`arch_config.yaml`) is validated and rendered by `cli/planoai/config_generator.py`: +- Schema: `config/arch_config_schema.yaml` +- Template: `config/envoy.template.yaml` (Jinja2) +- Output: `envoy.yaml` (for Envoy) + `arch_config_rendered.yaml` (for Brightstaff + WASM filter configs) + +When adding new config fields: update the schema, the template (if Envoy-relevant), the Python generator, AND the Rust `Configuration` struct in `common/src/configuration.rs`. + +### 7. Internal Model Names + +These are reserved model names used internally — do not conflict with them: +- `Arch-Function` — intent classification / function calling +- `Arch-Router` — (used as route name prefix, not direct model name) +- `Plano-Orchestrator` — agent selection orchestrator + +### 8. API Compatibility + +Brightstaff exposes OpenAI-compatible endpoints: +- `/v1/chat/completions` — Chat Completions API +- `/v1/messages` — Anthropic Messages API compatible +- `/v1/responses` — OpenAI Responses API with state management +- `/function_calling` — Internal Arch-Function endpoint + +The `/agents/` prefix variants mirror these for agent orchestration. + +Do NOT change these path structures without updating `consts.rs`, Brightstaff router, and `envoy.template.yaml`. + +### 9. Streaming + +- LLM responses use SSE (Server-Sent Events) format: `data: {json}\n\n` +- The `llm_gateway` WASM filter handles SSE stream reassembly across chunk boundaries via `SseStreamBuffer` +- Brightstaff uses `mpsc` channels for streaming responses back to clients +- Bedrock uses AWS Event Stream binary protocol — decoded by `hermesllm` + +### 10. Testing Conventions + +- WASM crates: unit tests run natively (`cargo test`), NOT under WASM runtime +- Brightstaff: unit tests with `mockito` for HTTP mocking +- E2E tests: separate `tests/` directory, run via GitHub Actions workflows +- Config validation tests: `cli/test/test_config_generator.py` + +## File Layout Reference + +``` +crates/ + Cargo.toml # Workspace root + brightstaff/ # Native Rust HTTP server (Axum) + common/ # Shared types, config, HTTP, rate limiting + hermesllm/ # LLM protocol translation (pure Rust) + llm_gateway/ # WASM filter: provider routing, auth, rate limits + prompt_gateway/ # WASM filter: intent matching, guardrails +config/ + arch_config_schema.yaml # User config JSON schema + envoy.template.yaml # Jinja2 template → envoy.yaml + docker-compose.dev.yaml # Dev environment +cli/ + planoai/ # Python CLI (config generator, Docker management) +``` diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..de50b4dc --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,138 @@ +# AGENTS.md — Coding Agent Reference + +> This file is optimized for AI coding agents. It contains hard constraints, ownership rules, and patterns that must not be violated. For human-readable architecture, see `architecture.md`. + +--- + +## System Overview (30-second version) + +Plano is an AI gateway. Client traffic enters **Envoy Proxy**, passes through two **WASM filters** (`prompt_gateway` → `llm_gateway`), and reaches **LLM providers**. A native Rust service (**Brightstaff**) handles intelligent routing and agent orchestration, but always communicates with the outside world **through Envoy**, never directly. + +--- + +## Hard Rules — Never Violate These + +### Rule 1: All external I/O goes through Envoy +- Brightstaff sends LLM requests to `localhost:12001` (Envoy egress listener) +- Brightstaff sends agent/API requests to `localhost:11000` (Envoy outbound listener) +- **NEVER** add `reqwest`/`hyper` calls from Brightstaff directly to external hosts +- Routing is controlled by setting `x-arch-llm-provider-hint` or `x-arch-upstream` headers + +### Rule 2: WASM crates cannot use async runtimes +- `prompt_gateway` and `llm_gateway` compile to `wasm32-wasip1` +- **Forbidden in WASM crates:** `tokio`, `async-std`, `reqwest`, `hyper`, `std::net`, `std::fs`, `std::thread` +- All I/O uses `proxy-wasm` SDK's `dispatch_http_call` (callback-based, not async/await) +- `governor` must use `no_std` feature; `rand` is fine + +### Rule 3: Dependency direction is one-way +``` +prompt_gateway ──► common ──► hermesllm +llm_gateway ──► common ──► hermesllm + llm_gateway ──► hermesllm (direct) +brightstaff ──► hermesllm (direct, no common WASM code) +``` +- `hermesllm` has **zero** dependencies on `proxy-wasm` or `common` +- `common` has **zero** dependencies on `brightstaff` +- WASM crates have **zero** dependencies on `brightstaff` + +### Rule 4: Header names are canonical constants +All `x-arch-*` headers are defined in `common/src/consts.rs`. Changing a header name requires updating: +1. `common/src/consts.rs` +2. `config/envoy.template.yaml` +3. Every Rust consumer (grep for the old constant name) + +### Rule 5: Config changes require a 4-file update +Adding a new user-facing config field: +1. `config/arch_config_schema.yaml` — JSON schema +2. `config/envoy.template.yaml` — Jinja2 template (if Envoy needs it) +3. `cli/planoai/config_generator.py` — Python validation/rendering +4. `common/src/configuration.rs` — Rust struct + +### Rule 6: API paths are load-bearing +These paths appear in `consts.rs`, Brightstaff's Axum router, and `envoy.template.yaml`: +- `/v1/chat/completions`, `/v1/messages`, `/v1/responses` +- `/agents/v1/chat/completions`, `/agents/v1/messages`, `/agents/v1/responses` +- `/function_calling`, `/v1/models`, `/healthz` + +Changing them breaks routing. Update all three locations simultaneously. + +### Rule 7: Reserved model names +- `Arch-Function` — used for intent classification / function calling +- `Plano-Orchestrator` — used for agent selection +- Any model prefixed with `Arch` is treated as internal + +--- + +## Crate Ownership Map + +| Crate | Type | Target | Owner of | +|---|---|---|---| +| `brightstaff` | Binary (Axum) | Native | LLM routing, agent orchestration, state management, observability | +| `prompt_gateway` | cdylib (WASM) | wasm32-wasip1 | Intent matching, prompt guards, function calling, API orchestration | +| `llm_gateway` | cdylib (WASM) | wasm32-wasip1 | Provider routing, auth injection, rate limiting, request/response translation | +| `common` | Library | Both | Config types, HTTP client trait, constants, rate limiting, tokenization, shared OpenAI types | +| `hermesllm` | Library | Native | LLM protocol translation (OpenAI ↔ Anthropic ↔ Bedrock ↔ Gemini), SSE parsing, provider model catalog | + +--- + +## Where to Put New Code + +| You want to... | Put it in... | Why | +|---|---|---| +| Add a new LLM provider | `hermesllm` (protocol), `common/configuration.rs` (config type), `config/arch_config_schema.yaml`, `config/envoy.template.yaml` (cluster) | Provider translation is hermesllm's job | +| Add a new header for inter-component communication | `common/src/consts.rs` + `config/envoy.template.yaml` | Canonical source for all header names | +| Add rate limiting logic | `common/src/ratelimit.rs` | Shared between WASM filters | +| Add a new API endpoint to Brightstaff | `brightstaff/src/handlers/` + `brightstaff/src/main.rs` (router) | Axum handler + route registration | +| Add prompt guardrail logic | `prompt_gateway/src/stream_context.rs` or `prompt_gateway/src/http_context.rs` | Runs inline in Envoy | +| Add request/response transformation for a provider | `hermesllm/src/transforms/` | Pure Rust, no WASM dependency | +| Add config validation | `cli/planoai/config_generator.py` + `config/arch_config_schema.yaml` | Python validates before Envoy starts | +| Add a new metric | `common/src/stats.rs` (WASM) or `brightstaff/src/tracing/` (native) | Different metric systems | + +--- + +## Build & Test Quick Reference + +```bash +# Full build (WASM + native) +cd crates && ./build.sh + +# WASM filters only +cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway + +# Brightstaff only +cargo build --release -p brightstaff + +# Run all Rust tests (native) +cargo test --workspace + +# Run config generator tests +cd cli && python -m pytest test/ + +# Dev environment (Docker Compose) +cd config && docker compose -f docker-compose.dev.yaml up +``` + +--- + +## Envoy Listener Map (for routing decisions) + +``` +:10000 (ingress) → passthrough to :10001 +:10001 (prompt+llm) → prompt_gateway.wasm → llm_gateway.wasm → LLM provider +:11000 (outbound API) → developer APIs & agents (by x-arch-upstream header) +:agent_port (per-config) → brightstaff :9091 /agents/... +:12000 (LLM egress) → brightstaff :9091 (routing decision) +:12001 (LLM egress final) → llm_gateway.wasm → LLM provider +``` + +--- + +## Common Mistakes to Avoid + +1. **Adding `tokio` to a WASM crate's Cargo.toml** — Will fail to compile for wasm32-wasip1 +2. **Making Brightstaff call OpenAI directly** — Must go through Envoy at localhost:12001 +3. **Adding a config field only in Rust** — Schema, Python generator, and template also need updates +4. **Changing a header name in one place** — Must grep and update consts.rs, envoy.template.yaml, and all consumers +5. **Adding `hermesllm` dependency on `proxy-wasm`** — hermesllm must stay pure Rust +6. **Creating a new Envoy cluster without updating the template** — Envoy won't know about it +7. **Forgetting `no_std` feature flag on `governor` in WASM crates** — std governor uses threads diff --git a/architecture.md b/architecture.md new file mode 100644 index 00000000..5f0c2807 --- /dev/null +++ b/architecture.md @@ -0,0 +1,411 @@ +# Plano (ArchGW) — High-Level Architecture + +## Overview + +Plano is an AI-native gateway built on **Envoy Proxy**, extended with custom **WebAssembly (WASM) filters** and a native Rust service called **Brightstaff**. It acts as an intelligent intermediary between client applications, AI agents, and LLM providers — handling intent-based routing, prompt guardrails, function calling, agent orchestration, rate limiting, and multi-provider LLM translation. + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Plano Gateway │ +│ │ +│ ┌──────────────────────────────────────────────────────────────────────┐ │ +│ │ Envoy Proxy (L7) │ │ +│ │ │ │ +│ │ ┌──────────────────┐ ┌──────────────────┐ │ │ +│ │ │ prompt_gateway │──────▶│ llm_gateway │ │ │ +│ │ │ (WASM) │ │ (WASM) │ │ │ +│ │ │ │ │ │ │ │ +│ │ │ • Intent matching│ │ • Provider routing│ │ │ +│ │ │ • Guardrails │ │ • Auth injection │ │ │ +│ │ │ • Function call │ │ • Rate limiting │ │ │ +│ │ │ • Prompt targets │ │ • API translation │ │ │ +│ │ └──────────────────┘ └────────┬─────────┘ │ │ +│ │ │ │ │ +│ └───────────────────────────────────────┼──────────────────────────────┘ │ +│ │ │ +│ ┌───────────────────────────────────────┼──────────────────────────────┐ │ +│ │ Brightstaff (Rust HTTP Server :9091) │ │ +│ │ │ │ +│ │ • LLM request routing (Arch-Router model) │ │ +│ │ • Agent orchestration (Plano-Orchestrator model) │ │ +│ │ • Conversation state management (memory / PostgreSQL) │ │ +│ │ • Function calling handler (Arch-Function model) │ │ +│ │ • Observability & signal analysis │ │ +│ └──────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ │ │ + ▼ ▼ ▼ + ┌──────────┐ ┌──────────────┐ ┌──────────────┐ + │ Agents │ │ Developer │ │ LLM Providers│ + │ (MCP/HTTP)│ │ APIs │ │ (OpenAI, etc)│ + └──────────┘ └──────────────┘ └──────────────┘ +``` + +--- + +## The Role of Envoy + +Envoy is the **data plane** of Plano. All client traffic — both inbound prompts and outbound LLM calls — flows through Envoy. It provides: + +- **L7 HTTP routing** based on paths and custom headers +- **WASM filter execution** for inline request/response transformation +- **Connection pooling and TLS** to upstream LLM providers +- **Retry policies** for resilience +- **Compression/decompression** for LLM streaming responses + +### Envoy Listeners + +Envoy defines **six listener types**, each serving a distinct role in the request flow: + +| Listener | Port | Direction | Purpose | +|---|---|---|---| +| `ingress_traffic` | 10000 (configurable) | Inbound | Client-facing entry point. Forwards all traffic to the prompt gateway listener. | +| `ingress_traffic_prompt` | 10001 | Inbound | **Core processing listener.** Runs both WASM filters (`prompt_gateway` → `llm_gateway`). Routes to LLM providers by `x-arch-llm-provider` header. | +| `outbound_api_traffic` | 11000 | Internal | Routes to upstream developer APIs and agents using `x-arch-upstream` header. No WASM filters. | +| Agent listeners | Per-config | Inbound | One per agent listener in config. Routes to Brightstaff with `/agents/` path prefix. | +| `egress_traffic` | 12000 (configurable) | Outbound | LLM gateway entry for agents/services reaching LLMs. Routes to Brightstaff for routing decisions. | +| `egress_traffic_llm` | 12001 | Outbound | **Final outbound LLM listener.** Runs `llm_gateway.wasm` for auth injection, provider translation, and rate limiting before reaching the actual LLM provider. | + +### Envoy Clusters + +Envoy manages connections to all upstream services: + +**LLM Provider Clusters** — Pre-configured TLS clusters for: OpenAI, Anthropic (Claude), Groq, Mistral, DeepSeek, Gemini, xAI, MoonshotAI, Zhipu, Together AI, and Katanemo's hosted Arch models. Custom-URL providers (e.g., Azure OpenAI, Ollama) are dynamically added from config. + +**Internal Clusters:** + +| Cluster | Target | Purpose | +|---|---|---| +| `bright_staff` | localhost:9091 | The Brightstaff Rust service | +| `arch_prompt_gateway_listener` | localhost:10001 | Internal forwarding from ingress | +| `arch_listener_llm` | localhost:12001 | Internal forwarding for LLM egress | +| `arch_internal` | localhost:11000 | Outbound API router | + +**Dynamic Clusters** — Generated from `endpoints` and `agents` config sections (developer APIs, agent services). + +### Custom Headers Used for Routing + +| Header | Set By | Used By | Purpose | +|---|---|---|---| +| `x-arch-llm-provider` | WASM filters | Envoy routes | Selects the LLM provider cluster | +| `x-arch-llm-provider-hint` | Brightstaff | llm_gateway | Hints which provider/model to use | +| `x-arch-upstream` / `x-arch-upstream-host` | WASM filters / Brightstaff | Envoy routes | Targets a specific agent or API endpoint | +| `x-arch-is-streaming` | Brightstaff | llm_gateway | Indicates streaming mode | +| `x-arch-state` | prompt_gateway | prompt_gateway | Carries multi-turn conversation state | +| `x-arch-tool-call` | prompt_gateway | prompt_gateway | Carries tool call metadata | +| `x-arch-api-response` | prompt_gateway | prompt_gateway | Carries developer API response data | +| `x-arch-agent-listener-name` | Envoy | Brightstaff | Identifies which agent listener a request arrived on | + +--- + +## Request Flows + +### Flow 1: Direct LLM Chat (`POST /v1/chat/completions`) + +This is the standard path for client-to-LLM requests with optional intent matching and routing. + +``` +Client + │ + ▼ +[Envoy :10000 — ingress_traffic] + │ (simple passthrough) + ▼ +[Envoy :10001 — ingress_traffic_prompt] + │ + ├── prompt_gateway.wasm + │ 1. Parse ChatCompletions request + │ 2. Convert prompt_targets → tool definitions + │ 3. Dispatch to Arch-Function model at /function_calling + │ 4. If intent matched: + │ → Call developer API endpoint via :11000 + │ → Augment prompt with API response context + │ 5. If no intent matched: + │ → Prepend system prompt, forward to LLM + │ + ├── llm_gateway.wasm + │ 1. Select LLM provider (from header hint or default) + │ 2. Enforce rate limits (token-based via tiktoken) + │ 3. Inject auth credentials (Bearer / x-api-key) + │ 4. Transform request format (OpenAI ↔ Anthropic ↔ Bedrock) + │ 5. Rewrite upstream path for target provider + │ + ▼ +LLM Provider (OpenAI, Anthropic, Gemini, etc.) + │ + ▼ +(Response flows back through llm_gateway for format translation) + │ + ▼ +Client +``` + +### Flow 2: Brightstaff LLM Routing (`POST /v1/chat/completions` via egress) + +When requests reach Brightstaff (directly or via agent listeners), it performs intelligent model routing. + +``` +Client / Agent + │ + ▼ +[Brightstaff :9091] + │ + ├── Resolve model aliases + ├── Validate model exists in configured providers + ├── Retrieve conversation state (if using Responses API) + │ + ├── Call Arch-Router model ──► [Envoy :12001] + │ (determines best model/provider for the request ──► LLM Provider + │ based on routing_preferences in config) + │ + ├── Forward actual request ──► [Envoy :12001] + │ (with x-arch-llm-provider-hint header) ──► LLM Provider + │ + ▼ +[Stream response back with metrics, signal analysis, state capture] + │ + ▼ +Client / Agent +``` + +### Flow 3: Agent Orchestration (`POST /agents/v1/chat/completions`) + +The agentic flow where Brightstaff selects and chains agents based on user intent. + +``` +Client + │ + ▼ +[Envoy — Agent Listener :configurable] + │ (path rewrite: /agents/...) + ▼ +[Brightstaff :9091] + │ + ├── Identify listener from x-arch-agent-listener-name + ├── Find configured agents for this listener + │ + ├── If multiple agents: + │ Call Plano-Orchestrator model ──► [Envoy :12001] ──► LLM + │ (selects which agents to run and in what order) + │ + ├── For each selected agent: + │ │ + │ ├── Run filter chain (pre-processing) + │ │ └── [Envoy :11000] ──► Filter Service (MCP/HTTP) + │ │ + │ ├── Invoke agent + │ │ └── [Envoy :11000] ──► Agent Service (MCP/HTTP) + │ │ + │ ├── If intermediate agent: + │ │ Collect full response → feed as input to next agent + │ │ + │ └── If final agent: + │ Stream response directly to client + │ + ▼ +Client +``` + +--- + +## Brightstaff Service + +Brightstaff is a native Rust HTTP server (`0.0.0.0:9091`) built with Axum. It is the **control plane brain** of Plano — while Envoy handles the data plane (proxying, filtering), Brightstaff handles the intelligent decision-making. + +### Endpoints + +| Method | Path | Handler | Purpose | +|---|---|---|---| +| `POST` | `/v1/chat/completions` | `llm_chat` | LLM passthrough with model routing | +| `POST` | `/v1/messages` | `llm_chat` | Anthropic Messages API compat | +| `POST` | `/v1/responses` | `llm_chat` | OpenAI Responses API with state | +| `POST` | `/agents/v1/chat/completions` | `agent_chat` | Agent orchestration pipeline | +| `POST` | `/agents/v1/messages` | `agent_chat` | Agent orchestration (Messages) | +| `POST` | `/agents/v1/responses` | `agent_chat` | Agent orchestration (Responses) | +| `POST` | `/function_calling` | `function_calling_chat_handler` | Arch-Function tool calling | +| `GET` | `/v1/models` | `list_models` | List configured LLM models | + +### Core Components + +#### RouterService (LLM Routing) +Uses the **Arch-Router** model — a specialized LLM that determines which provider/model best matches a user's request based on `routing_preferences` defined in config. Constructs a system prompt describing available routes, sends the conversation, and parses a `{"route": "route_name"}` response. + +#### OrchestratorService (Agent Selection) +Uses the **Plano-Orchestrator** model to determine which agent(s) should handle a request when multiple agents are available on a listener. Returns an ordered list of agents: `{"route": ["agent1", "agent2"]}`. + +#### PipelineProcessor (Agent Execution) +Manages the sequential execution of agent filter chains and agent invocations: +- **MCP agents**: JSON-RPC 2.0 protocol over SSE transport (`initialize` → `notifications/initialized` → `tools/call`) +- **HTTP agents**: Direct POST with message array +- Routes through Envoy at `:11000` using `x-arch-upstream-host` header + +#### Function Calling Handler +Specialized handler for the **Arch-Function** model: +- Converts OpenAI tool definitions into prompts +- Parses structured JSON responses (tool_calls, clarifications) +- Includes **hallucination detection** using entropy/varentropy/probability thresholds from logprobs + +#### State Management +Manages conversation state for the OpenAI Responses API (`v1/responses`): +- **Memory backend** — `HashMap` behind `Arc` for single-instance dev +- **PostgreSQL backend** — Persistent storage with upsert semantics +- `ResponsesStateProcessor` intercepts streaming responses to capture `response_id` and output items, storing them asynchronously for future conversation chaining via `previous_response_id` + +#### Signal Analysis (Observability) +Analyzes conversation patterns for interaction quality: +- Frustration, repetition/looping, escalation requests, positive feedback, repair patterns +- Quality graded as Good / Fair / Poor / Severe +- Concerning signals flag spans with indicators for monitoring + +--- + +## Rust Crate Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ brightstaff (binary) │ +│ │ +│ Native Rust HTTP server — routing, orchestration, state │ +│ Depends on: hermesllm, common (non-WASM parts) │ +└─────────────────────────────────────────────────────────────┘ + +┌──────────────────────┐ ┌──────────────────────┐ +│ prompt_gateway │ │ llm_gateway │ +│ (WASM) │ │ (WASM) │ +│ │ │ │ +│ Intent matching │ │ Provider routing │ +│ Prompt guards │ │ Auth injection │ +│ Function calling │ │ Rate limiting │ +│ API orchestration │ │ Request/Response │ +│ │ │ format translation │ +├──────────────────────┤ ├───────────────────────┤ +│ depends on: common │ │ depends on: common, │ +│ │ │ hermesllm │ +└──────────┬───────────┘ └──────────┬────────────┘ + │ │ + ▼ ▼ +┌──────────────────────────────────────────────────────────────┐ +│ common (lib) │ +│ │ +│ Configuration types, LlmProviders, HTTP client trait, │ +│ rate limiting (governor), tokenization (tiktoken), │ +│ OpenAI API types, routing, metrics, tracing, constants │ +│ Depends on: hermesllm │ +└─────────────────────────────┬───────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────┐ +│ hermesllm (lib) │ +│ │ +│ LLM protocol abstraction — cross-provider request/response │ +│ translation (OpenAI ↔ Anthropic ↔ Bedrock ↔ Gemini) │ +│ SSE stream parsing, provider model catalog, endpoint │ +│ mapping. No proxy-wasm dependency (pure Rust). │ +└──────────────────────────────────────────────────────────────┘ +``` + +### WASM Compilation + +Both `prompt_gateway` and `llm_gateway` compile to `cdylib` targets for `wasm32-wasip1` using the `proxy-wasm` SDK (v0.2.1). Envoy loads them via its V8 WASM runtime. Each filter implements `RootContext` (for config parsing and per-stream creation) and `HttpContext` (for per-request processing). + +--- + +## Deployment Architecture + +All components run inside a single container managed by **Supervisord**: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Docker Container │ +│ │ +│ ┌─────────────────────────────────────────────────────┐ │ +│ │ Supervisord │ │ +│ │ │ │ +│ │ ┌─────────────┐ ┌───────────────┐ ┌───────────┐ │ │ +│ │ │ Brightstaff │ │ Envoy Proxy │ │ Log Tail │ │ │ +│ │ │ (Rust) │ │ + WASM │ │ │ │ │ +│ │ │ :9091 │ │ :10000-12001 │ │ │ │ │ +│ │ └─────────────┘ └───────────────┘ └───────────┘ │ │ +│ └─────────────────────────────────────────────────────┘ │ +│ │ +│ Startup sequence: │ +│ 1. config_generator.py validates arch_config.yaml │ +│ 2. Renders envoy.template.yaml → envoy.yaml (Jinja2) │ +│ 3. Starts Brightstaff + Envoy in parallel │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +**Docker multi-stage build:** +1. `deps` — Rust 1.93.0 with `wasm32-wasip1` target, dependency pre-compilation +2. `wasm-builder` — Builds `prompt_gateway.wasm` + `llm_gateway.wasm` (release) +3. `brightstaff-builder` — Builds the `brightstaff` native binary (release) +4. `envoy` — Pulls `envoyproxy/envoy:v1.37.0` +5. `arch` (final) — Python 3.13.6-slim base with Envoy binary, WASM plugins, Brightstaff binary, and the `planoai` CLI + +--- + +## Configuration Pipeline + +User-facing configuration flows through a generation pipeline before reaching Envoy and Brightstaff: + +``` +arch_config.yaml (user-authored) + │ + ▼ +config_generator.py (Python CLI) + 1. Validate against arch_config_schema.yaml (JSON Schema) + 2. Normalize legacy formats (llm_providers → model_providers) + 3. Parse agents, filters, endpoints → infer Envoy clusters + 4. Parse model_providers → validate provider/model format + 5. Auto-add internal models (arch-function, arch-router, plano-orchestrator) + 6. Validate model aliases, routing preferences, prompt target endpoints + │ + ├──► envoy.yaml (rendered from envoy.template.yaml via Jinja2) + │ → consumed by Envoy + │ + └──► arch_config_rendered.yaml + → consumed by Brightstaff + → injected into WASM filter configs +``` + +### Key Config Sections + +| Section | Consumed By | Purpose | +|---|---|---| +| `model_providers` | llm_gateway, Brightstaff | LLM provider definitions with models, auth, routing preferences | +| `prompt_targets` | prompt_gateway | Intent-to-API mappings with parameter schemas | +| `prompt_guards` | prompt_gateway | Input guardrails (jailbreak detection) | +| `endpoints` | prompt_gateway, Envoy | Named upstream API endpoint definitions | +| `agents` | Brightstaff, Envoy | Agent service definitions (id, URL, type) | +| `listeners` | Brightstaff, Envoy | Listener configs binding agents to ports | +| `ratelimits` | llm_gateway | Per-model rate limits with token-based quotas | +| `routing` | Brightstaff | LLM routing model/provider config | +| `model_aliases` | Brightstaff | Friendly name → provider/model mappings | +| `state_storage` | Brightstaff | Conversation state backend (memory / postgres) | +| `tracing` | All components | OpenTelemetry config (sampling, OTLP endpoint) | +| `overrides` | prompt_gateway, Brightstaff | Tuning (intent threshold, agent orchestrator toggle) | + +--- + +## Supported LLM Providers + +| Provider | Cluster | Auth Method | +|---|---|---| +| OpenAI | api.openai.com | Bearer token | +| Anthropic (Claude) | api.anthropic.com | x-api-key header | +| Google (Gemini) | generativelanguage.googleapis.com | API key in URL | +| Groq | api.groq.com | Bearer token | +| Mistral | api.mistral.ai | Bearer token | +| DeepSeek | api.deepseek.com | Bearer token | +| xAI | api.x.ai | Bearer token | +| Together AI | api.together.xyz | Bearer token | +| MoonshotAI | api.moonshot.ai | Bearer token | +| Zhipu | open.bigmodel.cn | Bearer token | +| Amazon Bedrock | Custom base_url | AWS Sig v4 | +| Azure OpenAI | Custom base_url | Bearer / API key | +| Ollama | Custom base_url | None | +| Katanemo (Arch) | archfc.katanemo.dev | Bearer token | + +The `hermesllm` crate handles **cross-provider request/response translation** so clients can use a single API format (typically OpenAI-compatible) regardless of which upstream provider serves the request. diff --git a/crates/README.md b/crates/README.md new file mode 100644 index 00000000..8839f325 --- /dev/null +++ b/crates/README.md @@ -0,0 +1,233 @@ +# Plano Rust Crates + +This workspace contains 5 Rust crates that form the core of the Plano AI gateway. They are organized by compilation target and responsibility. + +## Workspace Layout + +``` +crates/ +├── Cargo.toml # Workspace root (resolver = "2") +├── build.sh # Builds WASM filters + native binary +├── brightstaff/ # Native Rust HTTP server (Axum) +├── common/ # Shared library (WASM-compatible) +├── hermesllm/ # LLM protocol translation (pure Rust) +├── llm_gateway/ # WASM filter: LLM routing & auth +└── prompt_gateway/ # WASM filter: intent matching & guardrails +``` + +--- + +## Crate Details + +### `prompt_gateway` — Inbound Prompt Processing + +| | | +|---|---| +| **Type** | `cdylib` (WASM filter) | +| **Target** | `wasm32-wasip1` | +| **Envoy listener** | `ingress_traffic_prompt` (:10001) | +| **Root ID** | `prompt_gateway` | +| **Depends on** | `common`, `proxy-wasm` | + +**Responsibilities:** +- Intercepts incoming chat completion requests +- Converts `prompt_targets` into OpenAI tool definitions +- Dispatches to `Arch-Function` model for intent classification +- If intent matches: calls developer API endpoints, augments prompt with response context +- If no match: prepends system prompt, forwards to upstream LLM +- Manages multi-turn state via `x-arch-state` header +- Applies `prompt_guards` (jailbreak detection) + +**Key modules:** +- `filter_context.rs` — RootContext, config parsing +- `http_context.rs` — Request interception, tool definition construction +- `stream_context.rs` — Core orchestration (intent matching, API calls, response handling) +- `tools.rs` — URL path/query parameter substitution for API calls + +**Constraints:** +- No `tokio`, `async/await`, threads, or network sockets +- All HTTP calls via `proxy-wasm` `dispatch_http_call` + +--- + +### `llm_gateway` — LLM Provider Routing & Translation + +| | | +|---|---| +| **Type** | `cdylib` (WASM filter) | +| **Target** | `wasm32-wasip1` | +| **Envoy listeners** | `ingress_traffic_prompt` (:10001), `egress_traffic_llm` (:12001) | +| **Root ID** | `llm_gateway` | +| **Depends on** | `common`, `hermesllm`, `proxy-wasm` | + +**Responsibilities:** +- Selects LLM provider based on `x-arch-llm-provider-hint` header or default +- Injects authentication credentials (Bearer token, x-api-key, passthrough) +- Rewrites request path for target provider API +- Transforms request/response formats between providers (OpenAI ↔ Anthropic ↔ Bedrock) via `hermesllm` +- Enforces token-based rate limits (`governor` with `no_std`) +- Handles SSE stream reassembly across chunk boundaries (`SseStreamBuffer`) +- Records metrics: TTFT, tokens/sec, request latency, rate-limited count + +**Key modules:** +- `filter_context.rs` — RootContext, provider & rate limit initialization +- `stream_context.rs` — Request/response transformation, auth, rate limiting, streaming +- `metrics.rs` — Gauge, counter, histogram definitions + +**Constraints:** +- Same WASM constraints as `prompt_gateway` +- Uses `hermesllm` for protocol translation — do NOT duplicate translation logic here + +--- + +### `common` — Shared Types & Utilities + +| | | +|---|---| +| **Type** | `lib` | +| **Target** | Both native and `wasm32-wasip1` | +| **Depends on** | `hermesllm`, `proxy-wasm`, `governor` (no_std), `tiktoken-rs` | + +**Responsibilities:** +- Central configuration schema (`Configuration`, `LlmProvider`, `PromptTarget`, `PromptGuards`, etc.) +- `LlmProviders` collection — provider lookup with slug matching and wildcard expansion +- HTTP client trait wrapping `proxy-wasm` `dispatch_http_call` +- All `x-arch-*` header constants and path constants (`consts.rs`) +- Token-based rate limiting (`governor`, keyed by model + header selector) +- Token counting via `tiktoken-rs` +- OpenAI-compatible API types (`ChatCompletionsRequest`, `Message`, `ToolCall`, etc.) +- Error types (`ClientError`, `ServerError`) +- Metrics primitives (`Gauge`, `Counter`, `Histogram`) +- URL path parameter substitution +- PII obfuscation for logging + +**Key modules:** +- `configuration.rs` — All config structs, deserialization, validation +- `consts.rs` — Canonical header names, paths, timeouts, cluster names +- `llm_providers.rs` — Provider collection with lookup logic +- `ratelimit.rs` — Token-based rate limiter (global `OnceLock`) +- `http.rs` — `Client` trait for WASM HTTP dispatch +- `tokenizer.rs` — Token counting (tiktoken, GPT-4 fallback) + +**Constraints:** +- Must compile for `wasm32-wasip1` — no std networking, no threads +- Must NOT depend on `brightstaff` + +--- + +### `hermesllm` — LLM Protocol Translation + +| | | +|---|---| +| **Type** | `lib` | +| **Target** | Native only (but no WASM-incompatible deps) | +| **Depends on** | `serde`, `serde_json`, `aws-smithy-eventstream`, `uuid` | + +**Responsibilities:** +- Cross-provider request/response translation (OpenAI ↔ Anthropic ↔ Amazon Bedrock ↔ Gemini) +- `ProviderRequest` / `ProviderResponse` / `ProviderStreamResponse` traits +- SSE stream parsing (`SseStreamIter`, `SseStreamBuffer`, `SseChunkProcessor`) +- AWS Event Stream binary frame decoding (Bedrock) +- Provider identification (`ProviderId` enum with model catalog from `provider_models.yaml`) +- Target endpoint path rewriting (`/v1/chat/completions` → provider-specific paths) + +**Key modules:** +- `apis/` — Format definitions: `openai.rs`, `anthropic.rs`, `amazon_bedrock.rs`, `openai_responses.rs` +- `apis/streaming_shapes/` — SSE and binary stream parsing +- `providers/` — `id.rs` (ProviderId), `request.rs`, `response.rs`, `streaming_response.rs` +- `clients/endpoints.rs` — API path mapping +- `transforms/` — Request/response transformations organized by direction + +**Constraints:** +- **MUST NOT depend on `proxy-wasm` or `common`** — this is a pure Rust library +- Must remain usable outside of the WASM/Envoy context +- Optional `model-fetch` feature gates network dependencies (`ureq`) + +--- + +### `brightstaff` — Native HTTP Server + +| | | +|---|---| +| **Type** | Binary (Axum) | +| **Target** | Native only | +| **Port** | `0.0.0.0:9091` | +| **Depends on** | `hermesllm`, `common` (non-WASM parts), `tokio`, `axum`, `reqwest`, `opentelemetry` | + +**Responsibilities:** +- LLM request routing via `Arch-Router` model (selects best provider/model) +- Agent orchestration via `Plano-Orchestrator` model (selects and chains agents) +- Agent execution pipeline: filter chains → agent invocation (MCP JSON-RPC or HTTP) +- `Arch-Function` handler: tool calling with hallucination detection +- Conversation state management for Responses API (memory or PostgreSQL) +- Model alias resolution +- OpenTelemetry tracing with per-component service names +- Interaction signal analysis (frustration, repetition, escalation detection) + +**Key modules:** +- `handlers/llm.rs` — LLM passthrough with routing +- `handlers/agent_chat_completions.rs` — Agent orchestration entry point +- `handlers/agent_selector.rs` — Agent selection logic +- `handlers/pipeline_processor.rs` — Sequential agent/filter execution +- `handlers/function_calling.rs` — Arch-Function tool calling +- `router/llm_router.rs` — `RouterService` (Arch-Router model) +- `router/plano_orchestrator.rs` — `OrchestratorService` (Plano-Orchestrator model) +- `state/` — `StateStorage` trait, memory & PostgreSQL backends +- `signals/` — Conversation quality analysis +- `tracing/` — OpenTelemetry setup with custom service name routing + +**Constraints:** +- All external calls go through Envoy (localhost:12001 for LLMs, localhost:11000 for agents) +- Does NOT use `common`'s `proxy-wasm` Client trait — uses `reqwest` instead + +--- + +## Dependency Graph + +``` +prompt_gateway ──► common ──► hermesllm +llm_gateway ───┬► common ──► hermesllm + └► hermesllm +brightstaff ───┬► hermesllm + └► common (config types only, not WASM code) + +hermesllm ────► (standalone — no proxy-wasm, no common) +``` + +**Direction is strictly enforced:** +- Arrows point toward dependencies +- No cycles allowed +- `hermesllm` is the leaf node — it must never depend on any other workspace crate + +--- + +## Build Commands + +```bash +# Everything (recommended) +./build.sh + +# Equivalent to: +cargo build --release --target wasm32-wasip1 -p prompt_gateway -p llm_gateway +cargo build --release -p brightstaff + +# Tests (all crates, native target) +cargo test --workspace + +# Single crate test +cargo test -p common +cargo test -p hermesllm +cargo test -p prompt_gateway +cargo test -p llm_gateway +cargo test -p brightstaff +``` + +## WASM Output Location + +After building, WASM filter binaries are at: +``` +target/wasm32-wasip1/release/prompt_gateway.wasm +target/wasm32-wasip1/release/llm_gateway.wasm +``` + +These are loaded by Envoy at startup from `/etc/envoy/proxy-wasm-plugins/` in the Docker image. diff --git a/docs/ADR/001-envoy-as-data-plane.md b/docs/ADR/001-envoy-as-data-plane.md new file mode 100644 index 00000000..97c20183 --- /dev/null +++ b/docs/ADR/001-envoy-as-data-plane.md @@ -0,0 +1,35 @@ +# ADR 001: Envoy as the Data Plane + +**Status:** Accepted + +## Context + +Plano needs to proxy all traffic between clients, LLM providers, and developer APIs. The options were: +1. Build a custom proxy from scratch in Rust (e.g., using `hyper`/`axum` directly) +2. Use an existing L7 proxy (Envoy, NGINX, HAProxy) and extend it +3. Use a service mesh sidecar approach + +We need: TLS termination, connection pooling, retry policies, load balancing, header-based routing, streaming support (SSE), compression, and observability — all at production quality. + +## Decision + +Use **Envoy Proxy** as the data plane. All external traffic — both inbound client requests and outbound LLM/API calls — flows through Envoy. The native Rust service (Brightstaff) never makes direct outbound connections to external hosts. + +## Consequences + +**Enables:** +- Production-grade L7 proxying (TLS, HTTP/2, connection pooling, retries) without building it ourselves +- WASM filter extension model for inline request/response processing +- Standard observability (access logs, stats, tracing) out of the box +- Header-based routing via Envoy's route configuration — no custom routing code needed for cluster selection +- Hot-restart and graceful draining for zero-downtime updates + +**Requires:** +- All Brightstaff external calls must go through Envoy listeners (localhost:12001 for LLMs, localhost:11000 for APIs) +- Custom headers (`x-arch-*`) for routing decisions — Envoy matches on these in its route config +- Envoy configuration must be generated from user config (Jinja2 template → envoy.yaml) +- Team must understand Envoy's configuration model (listeners, clusters, filter chains) + +**Prevents:** +- Direct HTTP calls from Brightstaff to external services (this is intentional — it ensures all traffic gets WASM filter processing, auth injection, rate limiting, and observability) +- Simple single-binary deployment (we need Envoy + Brightstaff, managed by Supervisord) diff --git a/docs/ADR/002-wasm-filters-over-native.md b/docs/ADR/002-wasm-filters-over-native.md new file mode 100644 index 00000000..cc121c54 --- /dev/null +++ b/docs/ADR/002-wasm-filters-over-native.md @@ -0,0 +1,42 @@ +# ADR 002: WASM Filters Over Native Envoy Filters + +**Status:** Accepted + +## Context + +Envoy supports three extension mechanisms: +1. **Native C++ filters** — compiled into the Envoy binary, highest performance +2. **WASM filters** — compiled to WebAssembly, loaded at runtime via Envoy's WASM VM +3. **Lua filters** — scripted, limited functionality +4. **External processing (ext_proc)** — gRPC callout to an external service + +We need filters that: parse and transform LLM request/response bodies, perform intent matching, inject authentication headers, enforce rate limits, and handle SSE stream reassembly. + +## Decision + +Use **WASM filters** written in Rust, compiled to `wasm32-wasip1`, loaded by Envoy's V8 runtime. We have two filters: +- `prompt_gateway.wasm` — inbound prompt processing (intent matching, guardrails, function calling) +- `llm_gateway.wasm` — outbound LLM processing (provider routing, auth, rate limiting, format translation) + +## Consequences + +**Enables:** +- Filters written in Rust with strong type safety and shared crates (`common`, `hermesllm`) +- Runtime-loadable: no need to rebuild Envoy itself +- Sandboxed execution: a filter crash doesn't bring down Envoy +- Same language (Rust) for WASM filters and Brightstaff — shared types and logic via workspace crates + +**Requires:** +- No `tokio`, `async/await`, threads, filesystem, or network sockets in WASM crates +- All I/O must use `proxy-wasm` SDK's `dispatch_http_call` (callback-based) +- Dependencies must be WASM-compatible: `governor` needs `no_std` feature, no crates using `std::net` +- `crate-type = ["cdylib"]` — these build as shared libraries, not binaries +- Testing runs natively (`cargo test`), but building requires `--target wasm32-wasip1` + +**Prevents:** +- Using async Rust patterns in filter code (callback-based `on_http_call_response` instead) +- Using popular HTTP client crates (`reqwest`, `hyper`) in filters +- Easy debugging — WASM filters run inside Envoy's V8 VM with limited introspection + +**Trade-off vs. ext_proc:** +External processing would allow using Brightstaff (native Rust with full async) for all processing, but would add network round-trips for every request. WASM filters run inline in Envoy's filter chain — zero additional network hops for common operations like auth injection and rate limiting. diff --git a/docs/ADR/003-single-container-supervisord.md b/docs/ADR/003-single-container-supervisord.md new file mode 100644 index 00000000..23028f6a --- /dev/null +++ b/docs/ADR/003-single-container-supervisord.md @@ -0,0 +1,42 @@ +# ADR 003: Single Container with Supervisord + +**Status:** Accepted + +## Context + +Plano has three runtime processes: +1. **Envoy Proxy** — the data plane with WASM filters +2. **Brightstaff** — the Rust HTTP service for routing and orchestration +3. **Config generator** — Python script that validates config and renders Envoy's YAML (runs at startup) + +The options for deployment were: +1. **Separate containers** — each process in its own container, orchestrated by Docker Compose / K8s +2. **Single container with process manager** — all processes in one container, managed by Supervisord +3. **Single binary** — embed Envoy or reimplement its core functionality + +## Decision + +Run all processes in a **single container** managed by **Supervisord**. The startup sequence: +1. Config generator validates `arch_config.yaml` and renders `envoy.yaml` +2. Supervisord starts Brightstaff and Envoy in parallel +3. A log tail process unifies access log output + +## Consequences + +**Enables:** +- Simple deployment: one container, one image, `docker run` just works +- No network latency between Envoy and Brightstaff (localhost communication) +- Config generation happens at container startup — no external config rendering step +- Easy development: `docker compose up` with volume mounts for hot-reload + +**Requires:** +- Supervisord configuration (`config/supervisord.conf`) to manage process lifecycle +- Health checks must account for both Envoy and Brightstaff readiness +- Logs from all processes need unified output (handled by the tail process) + +**Prevents:** +- Independent scaling of Envoy vs. Brightstaff (they scale together as one unit) +- Kubernetes sidecar pattern (though this could be reconsidered) +- Process-level fault isolation (though Supervisord restarts failed processes) + +**Trade-off:** Simplicity of deployment over horizontal scaling flexibility. For a gateway that needs to be deployed at the edge or as a sidecar, single-container simplicity is more valuable than the ability to scale components independently. diff --git a/docs/ADR/004-hermesllm-pure-rust.md b/docs/ADR/004-hermesllm-pure-rust.md new file mode 100644 index 00000000..c2d52d3c --- /dev/null +++ b/docs/ADR/004-hermesllm-pure-rust.md @@ -0,0 +1,45 @@ +# ADR 004: hermesllm as a Pure Rust Library + +**Status:** Accepted + +## Context + +LLM providers use different API formats (OpenAI Chat Completions, Anthropic Messages, Amazon Bedrock Converse, Gemini). The gateway needs to translate between these formats in two places: +1. In the `llm_gateway` WASM filter (inline in Envoy) +2. In Brightstaff (for routing decisions and response processing) + +The options were: +1. Duplicate translation logic in both places +2. Put translation logic in `common` (shared crate, but WASM-constrained) +3. Create a separate pure Rust library with no WASM dependencies + +## Decision + +Create **`hermesllm`** as a standalone Rust library that handles all LLM protocol translation. It must never depend on `proxy-wasm` or `common`. Both WASM crates (via `common`) and Brightstaff use `hermesllm` directly. + +## Consequences + +**Enables:** +- Single source of truth for LLM protocol translation +- Reusable outside the gateway context (could be published as an independent crate) +- Full Rust standard library available (no WASM constraints on the library itself) +- Clean separation: protocol knowledge lives in `hermesllm`, gateway logic lives in filters + +**Requires:** +- `hermesllm` must not import `proxy-wasm`, `common`, or any WASM-specific crate +- Adding a new provider requires changes only in `hermesllm` (plus config in `common/configuration.rs` and `envoy.template.yaml`) +- Types shared between `hermesllm` and the filters go through `common`'s re-exports + +**Prevents:** +- Circular dependencies (hermesllm is always a leaf in the dependency graph) +- Accidentally coupling protocol translation to WASM runtime specifics +- Needing to maintain two separate translation implementations + +**Dependency direction:** +``` +prompt_gateway → common → hermesllm +llm_gateway → common → hermesllm +llm_gateway → hermesllm (direct) +brightstaff → hermesllm (direct) +hermesllm → (no workspace deps) +``` diff --git a/docs/ADR/005-header-based-routing.md b/docs/ADR/005-header-based-routing.md new file mode 100644 index 00000000..0adab6ac --- /dev/null +++ b/docs/ADR/005-header-based-routing.md @@ -0,0 +1,40 @@ +# ADR 005: Header-Based Routing Protocol + +**Status:** Accepted + +## Context + +Envoy needs to route requests to different upstream clusters (LLM providers, developer APIs, agents) based on runtime decisions made by WASM filters and Brightstaff. The options were: +1. **Path-based routing** — different URL paths for different upstreams +2. **Header-based routing** — custom headers to signal routing decisions +3. **Dynamic cluster selection** — programmatic cluster selection in filters + +## Decision + +Use **custom `x-arch-*` headers** for all routing decisions. WASM filters and Brightstaff set headers like `x-arch-llm-provider` and `x-arch-upstream`, and Envoy's route configuration matches on these headers to select the upstream cluster. + +All header names are defined as constants in `common/src/consts.rs` — this is the single source of truth. + +## Consequences + +**Enables:** +- Decoupled routing: WASM filters decide *where* to route, Envoy handles *how* to connect +- Transparent to the client — custom headers are internal, clients see standard HTTP +- Easy to debug: inspect headers to understand routing decisions +- Composable: multiple filters can add/modify routing headers in the filter chain + +**Requires:** +- Header names must be consistent between `consts.rs` and `envoy.template.yaml` +- Any new routing dimension needs a new header constant + Envoy route match rule +- Developers must grep all consumers when changing a header name + +**Prevents:** +- Routing logic in Envoy's configuration alone (routing decisions are made by Rust code, not Envoy config) +- Using Envoy's native routing features (like weighted clusters) independently — they must be combined with header matching + +**Key headers:** +- `x-arch-llm-provider` — LLM provider cluster selection (Envoy route matching) +- `x-arch-llm-provider-hint` — Provider hint from Brightstaff to llm_gateway +- `x-arch-upstream` — Agent/API endpoint cluster selection +- `x-arch-streaming-request` — Streaming mode signal +- `x-arch-state` — Multi-turn conversation state (prompt_gateway internal) diff --git a/docs/ADR/006-config-generation-pipeline.md b/docs/ADR/006-config-generation-pipeline.md new file mode 100644 index 00000000..c8e9ac51 --- /dev/null +++ b/docs/ADR/006-config-generation-pipeline.md @@ -0,0 +1,48 @@ +# ADR 006: Config Generation Pipeline (Python + Jinja2) + +**Status:** Accepted + +## Context + +Envoy's configuration is a large YAML file that must describe all listeners, clusters, filter chains, TLS contexts, and WASM filter configs. This configuration depends on user-provided settings (which LLM providers to use, which agents to connect, which endpoints to expose). + +The options were: +1. **Static Envoy config** — users edit Envoy YAML directly +2. **Rust-based config generator** — generate Envoy config from a Rust binary +3. **Python + Jinja2 template** — validate user config against a schema, then render Envoy config from a template + +## Decision + +Use a **Python config generator** (`cli/planoai/config_generator.py`) that: +1. Validates user's `arch_config.yaml` against a JSON Schema (`config/arch_config_schema.yaml`) +2. Applies transformations (legacy format conversion, cluster inference, internal model injection) +3. Renders `config/envoy.template.yaml` (Jinja2) into the final `envoy.yaml` +4. Produces `arch_config_rendered.yaml` for Brightstaff and WASM filter consumption + +This runs at container startup, before Envoy starts. + +## Consequences + +**Enables:** +- Simple user-facing config format (`arch_config.yaml`) — users don't need to understand Envoy internals +- JSON Schema validation catches errors before Envoy starts +- Jinja2 templating is mature, well-understood, and powerful for generating complex YAML +- Python CLI (`planoai`) can also handle Docker management and other tooling +- Config validation is independently testable (`cli/test/test_config_generator.py`) + +**Requires:** +- Python runtime in the Docker image (adds image size) +- Config changes need updates in 4 places: schema, template, Python validator, Rust struct +- Understanding of Jinja2 templating for Envoy config modifications +- `arch_config_rendered.yaml` must be kept in sync between Python generator and Rust deserialization + +**Prevents:** +- Dynamic config reloading without container restart (config is generated at startup) +- Using Envoy's xDS protocol for dynamic configuration (could be added later) +- Rust-only development workflow — Python is required for config generation + +**4-file update rule:** Every new user-facing config field requires changes to: +1. `config/arch_config_schema.yaml` — JSON Schema definition +2. `config/envoy.template.yaml` — Jinja2 template (if Envoy needs the value) +3. `cli/planoai/config_generator.py` — Python validation and rendering logic +4. `common/src/configuration.rs` — Rust `Configuration` struct (for runtime consumption) diff --git a/docs/ADR/README.md b/docs/ADR/README.md new file mode 100644 index 00000000..749e209f --- /dev/null +++ b/docs/ADR/README.md @@ -0,0 +1,22 @@ +# Architecture Decision Records + +This directory contains Architecture Decision Records (ADRs) for the Plano project. ADRs document key architectural decisions, their context, and rationale — preventing future contributors (human or AI) from unknowingly reversing deliberate choices. + +## Index + +| ADR | Title | Status | +|-----|-------|--------| +| [001](001-envoy-as-data-plane.md) | Envoy as the Data Plane | Accepted | +| [002](002-wasm-filters-over-native.md) | WASM Filters Over Native Envoy Filters | Accepted | +| [003](003-single-container-supervisord.md) | Single Container with Supervisord | Accepted | +| [004](004-hermesllm-pure-rust.md) | hermesllm as a Pure Rust Library | Accepted | +| [005](005-header-based-routing.md) | Header-Based Routing Protocol | Accepted | +| [006](006-config-generation-pipeline.md) | Config Generation Pipeline (Python + Jinja2) | Accepted | + +## ADR Format + +Each ADR follows this structure: +- **Status**: Proposed / Accepted / Deprecated / Superseded +- **Context**: What problem or question prompted this decision +- **Decision**: What was decided +- **Consequences**: Trade-offs, implications, and what this enables or prevents diff --git a/docs/DATA_CONTRACTS.md b/docs/DATA_CONTRACTS.md new file mode 100644 index 00000000..3c7c0417 --- /dev/null +++ b/docs/DATA_CONTRACTS.md @@ -0,0 +1,221 @@ +# Data Contracts — Inter-Component Communication + +This document defines the contracts between Plano's components: custom HTTP headers, internal API formats, streaming protocols, and Envoy routing conventions. Breaking any of these contracts will cause silent routing failures. + +--- + +## 1. Custom Header Protocol + +All custom headers are defined in `common/src/consts.rs`. This is the **single source of truth** — if a header name appears in `envoy.template.yaml` or Brightstaff code, it must match the constant in `consts.rs`. + +### Routing Headers (Envoy-critical) + +These headers are used in Envoy's `route_config` for cluster selection. Changing them requires updating `envoy.template.yaml`. + +| Header | Constant | Set By | Read By | Value Format | Purpose | +|---|---|---|---|---|---| +| `x-arch-llm-provider` | `ARCH_ROUTING_HEADER` | WASM filters | Envoy routes | Provider slug (e.g., `openai`, `anthropic`) | Selects the LLM provider cluster in Envoy | +| `x-arch-upstream` | `ARCH_UPSTREAM_HOST_HEADER` | WASM filters, Brightstaff | Envoy routes | Cluster name (e.g., agent endpoint name) | Routes to a specific upstream cluster | +| `x-arch-llm-provider-hint` | `ARCH_PROVIDER_HINT_HEADER` | Brightstaff | llm_gateway | `provider/model` (e.g., `openai/gpt-4`) | Hints which provider+model to use | +| `x-arch-agent-listener-name` | — | Envoy (set in route config) | Brightstaff | Listener name string | Identifies which agent listener a request arrived on | + +### Internal State Headers (WASM filter internal) + +These headers pass state between the prompt_gateway filter's request/response phases or between prompt_gateway and the function calling service. + +| Header | Constant | Set By | Read By | Value Format | Purpose | +|---|---|---|---|---|---| +| `x-arch-state` | `X_ARCH_STATE_HEADER` | prompt_gateway | prompt_gateway | Base64-encoded JSON (`ArchState`) | Multi-turn conversation state across filter invocations | +| `x-arch-tool-call-message` | `X_ARCH_TOOL_CALL` | prompt_gateway | prompt_gateway | JSON string | Tool call metadata for API orchestration | +| `x-arch-api-response-message` | `X_ARCH_API_RESPONSE` | prompt_gateway | prompt_gateway | JSON string | Developer API response data | +| `x-arch-fc-model-response` | `X_ARCH_FC_MODEL_RESPONSE` | prompt_gateway | prompt_gateway | JSON string | Raw Arch-Function model response | +| `x-arch-llm-route` | `LLM_ROUTE_HEADER` | Brightstaff | llm_gateway | Route name string | LLM route decision result | + +### Signaling Headers + +| Header | Constant | Set By | Read By | Purpose | +|---|---|---|---|---| +| `x-arch-streaming-request` | `ARCH_IS_STREAMING_HEADER` | Brightstaff | llm_gateway | Indicates the request is streaming mode | +| `x-arch-ratelimit-selector` | `RATELIMIT_SELECTOR_HEADER_KEY` | Client / Envoy | llm_gateway | Key for per-tenant rate limit partitioning | + +### Standard Headers Used + +| Header | Constant | Purpose | +|---|---|---| +| `x-request-id` | `REQUEST_ID_HEADER` | Request tracing (set by Envoy or caller) | +| `x-envoy-original-path` | `ENVOY_ORIGINAL_PATH_HEADER` | Original path before Envoy rewrites | +| `x-envoy-max-retries` | `ENVOY_RETRY_HEADER` | Retry count for Envoy's retry policy | +| `traceparent` | `TRACE_PARENT_HEADER` | W3C Trace Context for OpenTelemetry | + +--- + +## 2. Internal Cluster Names + +Defined in `consts.rs` and referenced in `envoy.template.yaml`: + +| Constant | Value | Target | Purpose | +|---|---|---|---| +| `MODEL_SERVER_NAME` | `"bright_staff"` | localhost:9091 | Brightstaff service | +| `ARCH_INTERNAL_CLUSTER_NAME` | `"arch_internal"` | localhost:11000 | Outbound API router | +| `ARCH_FC_CLUSTER` | `"arch"` | archfc.katanemo.dev:443 | Katanemo Arch-Function model | + +Additional clusters generated from config: +- `arch_prompt_gateway_listener` → localhost:10001 +- `arch_listener_llm` → localhost:12001 +- Per-provider clusters (e.g., `openai`, `anthropic`, `gemini`) from `envoy.template.yaml` +- Per-agent/endpoint clusters from user config + +--- + +## 3. Internal API Formats + +### Brightstaff → Envoy (LLM requests via :12001) + +Brightstaff sends OpenAI-compatible `ChatCompletionsRequest` JSON to `localhost:12001` with: +- `x-arch-llm-provider-hint: /` to select the provider +- `x-arch-is-streaming: true/false` to indicate streaming +- Standard `Content-Type: application/json` +- `traceparent` for distributed tracing + +The `llm_gateway` WASM filter at :12001 transforms the request to the target provider's format. + +### Brightstaff → Envoy (Agent/API requests via :11000) + +Brightstaff sends requests to `localhost:11000` with: +- `x-arch-upstream-host: ` to route to the target agent/API +- `x-envoy-max-retries: 3` for resilience + +**MCP Agent Protocol:** +``` +POST / (with x-arch-upstream-host) +Content-Type: application/json + +# Step 1: Initialize +{"jsonrpc":"2.0","method":"initialize","id":"","params":{...}} + +# Step 2: Initialized notification +{"jsonrpc":"2.0","method":"notifications/initialized"} + +# Step 3: Tool call +{"jsonrpc":"2.0","method":"tools/call","id":"","params":{"name":"","arguments":{...}}} +``` + +**HTTP Agent Protocol:** +``` +POST / (with x-arch-upstream-host) +Content-Type: application/json + +[{"role":"user","content":"..."},{"role":"assistant","content":"..."}] +``` +Response: Array of messages. + +### prompt_gateway → Arch-Function (/function_calling) + +``` +POST /function_calling +Content-Type: application/json + +{ + "messages": [...], + "tools": [...], + "model": "Arch-Function", + "stream": false, + "metadata": {"raw_response": true, "logprobs": true} +} +``` + +Response contains `tool_calls`, `response`, or `clarification` in the assistant message content (JSON string). + +--- + +## 4. Streaming Protocol + +### SSE (Server-Sent Events) — Standard LLM Streaming + +All streaming LLM responses use SSE format: +``` +data: {"id":"...","choices":[...]}\n\n +data: {"id":"...","choices":[...]}\n\n +data: [DONE]\n\n +``` + +**Important:** SSE events can be split across HTTP chunks. The `llm_gateway` uses `SseStreamBuffer` and `SseChunkProcessor` (from `hermesllm`) to reassemble events across chunk boundaries before processing. + +### Bedrock Binary Streaming + +Amazon Bedrock uses AWS Event Stream binary protocol instead of SSE. The `BedrockBinaryFrameDecoder` in `hermesllm` handles decoding. + +### Brightstaff Streaming + +Brightstaff uses `tokio::sync::mpsc` channels to stream responses: +1. Spawns a background task to read from upstream (via `reqwest`) +2. Parses SSE events, optionally transforms them +3. Sends chunks through the mpsc channel +4. Axum's `StreamBody` delivers to the client + +--- + +## 5. Configuration Injection + +### WASM Filter Configuration + +Envoy injects config into WASM filters via the `configuration` field in the filter definition: + +- **prompt_gateway** receives: `prompt_targets`, `prompt_guards`, `system_prompt`, `endpoints`, `overrides`, `tracing` +- **llm_gateway** receives: `model_providers`, `ratelimits`, `overrides` + +Both receive YAML strings parsed by `serde_yaml` in each filter's `RootContext::on_configure()`. + +### Brightstaff Configuration + +Brightstaff reads `arch_config_rendered.yaml` (path from `ARCH_CONFIG_PATH_RENDERED` env var), which contains the full rendered config including `model_providers`, `agents`, `filters`, `listeners`, `routing`, `model_aliases`, `state_storage`, `tracing`, and `overrides`. + +--- + +## 6. Timeouts + +All timeouts are defined in `consts.rs`: + +| Constant | Value | Used For | +|---|---|---| +| `ARCH_FC_REQUEST_TIMEOUT_MS` | 30,000 ms | Arch-Function model calls from prompt_gateway | +| `DEFAULT_TARGET_REQUEST_TIMEOUT_MS` | 30,000 ms | Default prompt target endpoint calls | +| `API_REQUEST_TIMEOUT_MS` | 30,000 ms | Developer API calls from prompt_gateway | +| `MODEL_SERVER_REQUEST_TIMEOUT_MS` | 30,000 ms | Model server calls | + +Envoy also enforces its own route-level timeouts configured in `envoy.template.yaml` (default 300s for LLM routes). + +--- + +## 7. Error Response Format + +All error responses from Brightstaff follow this format: + +```json +{ + "error": { + "message": "Human-readable error description", + "type": "error_type", + "code": 400 + } +} +``` + +The `llm_gateway` WASM filter returns errors as: +- HTTP 429 for rate limit exceeded +- HTTP 503 for provider unavailable +- The original upstream error status code for pass-through errors + +--- + +## 8. Contract Change Checklist + +When modifying any data contract: + +- [ ] Update the constant in `common/src/consts.rs` +- [ ] Grep the entire codebase for the old value (`grep -r "old_value" crates/`) +- [ ] Update `config/envoy.template.yaml` if the header is used in routing +- [ ] Update `cli/planoai/config_generator.py` if the config schema changed +- [ ] Update `config/arch_config_schema.yaml` if user-facing config changed +- [ ] Run `cargo test --workspace` to catch compile/test failures +- [ ] Run `cd cli && python -m pytest test/` for config generation tests