Update docs to Plano (#639)

2026-04-28 10:26:36 +02:00 · 2025-12-23 17:14:50 -08:00 · 2025-12-23 17:14:50 -08:00 · e224cba3e3
commit e224cba3e3
parent 15fbb6c3af
139 changed files with 4407 additions and 24735 deletions
--- a/docs/source/guides/agent_routing.rst
+++ b/docs/source/guides/agent_routing.rst
@ -1,105 +0,0 @@
-.. _agent_routing:
-
-Agent Routing and Hand Off
-===========================
-
-Agent Routing and Hand Off is a key feature in Arch that enables intelligent routing of user prompts to specialized AI agents or human agents based on the nature and complexity of the user's request.
-
-This capability significantly enhances the efficiency and personalization of interactions, ensuring each prompt receives the most appropriate and effective handling. The following section describes
-the workflow, configuration, and implementation of Agent routing and hand off in Arch.
-
-#. **Agent Selection**
-   When a user submits a prompt, Arch analyzes the input to determine the intent and complexity. Based on the analysis, Arch selects the most suitable agent configured within your application to handle the specific category of the user's request—such as sales inquiries, technical issues, or complex scenarios requiring human attention.
-
-#. **Prompt Routing**
-   After selecting the appropriate agent, Arch routes the user's prompt to the designated agent's endpoint and waits for the agent to respond back with the processed output or further instructions.
-
-#. **Hand Off**
-   Based on follow-up queries from the user, Arch repeats the process of analysis, agent selection, and routing to ensure a seamless hand off between AI agents as needed.
-
-.. code-block:: yaml
-    :caption: Agent Routing and Hand Off Configuration Example
-
-    prompt_targets:
-      - name: sales_agent
-        description: Handles queries related to sales and purchases
-
-      - name: issues_and_repairs
-        description: handles issues, repairs, or refunds
-
-      - name: escalate_to_human
-        description: escalates to human agent
-
-.. code-block:: python
-    :caption: Agent Routing and Hand Off Implementation Example via FastAPI
-
-    class Agent:
-        def __init__(self, role: str, instructions: str):
-            self.system_prompt = f"You are a {role}.\n{instructions}"
-
-        def handle(self, req: ChatCompletionsRequest):
-            messages = [{"role": "system", "content": self.get_system_prompt()}] + [
-                message.model_dump() for message in req.messages
-            ]
-            return call_openai(messages, req.stream) #call_openai is a placeholder for the actual API call
-
-        def get_system_prompt(self) -> str:
-            return self.system_prompt
-
-    # Define your agents
-    AGENTS = {
-        "sales_agent": Agent(
-            role="sales agent",
-            instructions=(
-                "Always answer in a sentence or less.\n"
-                "Follow the following routine with the user:\n"
-                "1. Engage\n"
-                "2. Quote ridiculous price\n"
-                "3. Reveal caveat if user agrees."
-            ),
-        ),
-        "issues_and_repairs": Agent(
-            role="issues and repairs agent",
-            instructions="Propose a solution, offer refund if necessary.",
-        ),
-        "escalate_to_human": Agent(
-            role="human escalation agent", instructions="Escalate issues to a human."
-        ),
-        "unknown_agent": Agent(
-            role="general assistant", instructions="Assist the user in general queries."
-        ),
-    }
-
-    #handle the request from arch gateway
-    @app.post("/v1/chat/completions")
-    def completion_api(req: ChatCompletionsRequest, request: Request):
-
-        agent_name = req.metadata.get("agent-name", "unknown_agent")
-        agent = AGENTS.get(agent_name)
-        logger.info(f"Routing to agent: {agent_name}")
-
-        return agent.handle(req)
-
-.. note::
-    The above example demonstrates a simple implementation of Agent Routing and Hand Off using FastAPI. For the full implementation of this example
-    please see our `GitHub demo <https://github.com/katanemo/archgw/tree/main/demos/use_cases/orchestrating_agents>`_.
-
-Example Use Cases
-----------------
-Agent Routing and Hand Off is particularly beneficial in scenarios such as:
-
- **Customer Support**: Routing common customer queries to automated support agents, while escalating complex or sensitive issues to human support staff.
- **Sales and Marketing**: Automatically directing potential leads and sales inquiries to specialized sales agents for timely and targeted follow-ups.
- **Technical Assistance**: Managing user-reported issues, repairs, or refunds by assigning them to the correct technical or support agent efficiently.
-
-Best Practices and Tips
------------------------
-When implementing Agent Routing and Hand Off in your applications, consider these best practices:
-
- Clearly define agent responsibilities: Ensure each agent or human endpoint has a clear, specific description of the prompts they handle, reducing mis-routing.
- Monitor and optimize routes: Regularly review how prompts are routed to adjust and optimize agent definitions and configurations.
-
-.. note::
-    To observe traffic to and from agents, please read more about :ref:`observability <observability>` in Arch.
-
-By carefully configuring and managing your Agent routing and hand off, you can significantly improve your application's responsiveness, performance, and overall user satisfaction.
--- a/docs/source/guides/function_calling.rst
+++ b/docs/source/guides/function_calling.rst
@ -3,7 +3,7 @@
 Function Calling
 ================

-**Function Calling** is a powerful feature in Arch that allows your application to dynamically execute backend functions or services based on user prompts.
+**Function Calling** is a powerful feature in Plano that allows your application to dynamically execute backend functions or services based on user prompts.
 This enables seamless integration between natural language interactions and backend operations, turning user inputs into actionable results.


@ -18,15 +18,15 @@ Function Calling Workflow

 #. **Prompt Parsing**

-    When a user submits a prompt, Arch analyzes it to determine the intent. Based on this intent, the system identifies whether a function needs to be invoked and which parameters should be extracted.
+    When a user submits a prompt, Plano analyzes it to determine the intent. Based on this intent, the system identifies whether a function needs to be invoked and which parameters should be extracted.

 #. **Parameter Extraction**

-    Arch’s advanced natural language processing capabilities automatically extract parameters from the prompt that are necessary for executing the function. These parameters can include text, numbers, dates, locations, or other relevant data points.
+    Plano’s advanced natural language processing capabilities automatically extract parameters from the prompt that are necessary for executing the function. These parameters can include text, numbers, dates, locations, or other relevant data points.

 #. **Function Invocation**

-    Once the necessary parameters have been extracted, Arch invokes the relevant backend function. This function could be an API, a database query, or any other form of backend logic. The function is executed with the extracted parameters to produce the desired output.
+    Once the necessary parameters have been extracted, Plano invokes the relevant backend function. This function could be an API, a database query, or any other form of backend logic. The function is executed with the extracted parameters to produce the desired output.

 #. **Response Handling**

@ -34,7 +34,7 @@ Function Calling Workflow


 Arch-Function
-------------------------
+-------------
 The `Arch-Function <https://huggingface.co/collections/katanemo/arch-function-66f209a693ea8df14317ad68>`_ collection of large language models (LLMs) is a collection state-of-the-art (SOTA) LLMs specifically designed for **function calling** tasks.
 The models are designed to understand complex function signatures, identify required parameters, and produce accurate function call outputs based on natural language prompts.
 Achieving performance on par with GPT-4, these models set a new benchmark in the domain of function-oriented tasks, making them suitable for scenarios where automated API interaction and function execution is crucial.
@ -64,11 +64,11 @@ Key Features
 Implementing Function Calling
 -----------------------------

-Here’s a step-by-step guide to configuring function calling within your Arch setup:
+Here’s a step-by-step guide to configuring function calling within your Plano setup:

 Step 1: Define the Function
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-First, create or identify the backend function you want Arch to call. This could be an API endpoint, a script, or any other executable backend logic.
+First, create or identify the backend function you want Plano to call. This could be an API endpoint, a script, or any other executable backend logic.

 .. code-block:: python

@ -96,8 +96,8 @@ First, create or identify the backend function you want Arch to call. This could

 Step 2: Configure Prompt Targets
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Next, map the function to a prompt target, defining the intent and parameters that Arch will extract from the user’s prompt.
-Specify the parameters your function needs and how Arch should interpret these.
+Next, map the function to a prompt target, defining the intent and parameters that Plano will extract from the user’s prompt.
+Specify the parameters your function needs and how Plano should interpret these.

 .. code-block:: yaml
    :caption: Prompt Target Example Configuration
@ -121,22 +121,22 @@ Specify the parameters your function needs and how Arch should interpret these.
 .. Note::
    For a complete refernce of attributes that you can configure in a prompt target, see :ref:`here <defining_prompt_target_parameters>`.

-Step 3: Arch Takes Over
-~~~~~~~~~~~~~~~~~~~~~~~
-Once you have defined the functions and configured the prompt targets, Arch Gateway takes care of the remaining work.
+Step 3: Plano Takes Over
+~~~~~~~~~~~~~~~~~~~~~~~~
+Once you have defined the functions and configured the prompt targets, Plano takes care of the remaining work.
 It will automatically validate parameters, and ensure that the required parameters (e.g., location) are present in the prompt, and add validation rules if necessary.

-.. figure:: /_static/img/arch_network_diagram_high_level.png
+.. figure:: /_static/img/plano_network_diagram_high_level.png
   :width: 100%
   :align: center

-   High-level network flow of where Arch Gateway sits in your agentic stack. Managing incoming and outgoing prompt traffic
+   High-level network flow of where Plano sits in your agentic stack. Managing incoming and outgoing prompt traffic


-Once a downstream function (API) is called, Arch Gateway takes the response and sends it an upstream LLM to complete the request (for summarization, Q/A, text generation tasks).
-For more details on how Arch Gateway enables you to centralize usage of LLMs, please read :ref:`LLM providers <llm_providers>`.
+Once a downstream function (API) is called, Plano  takes the response and sends it an upstream LLM to complete the request (for summarization, Q/A, text generation tasks).
+For more details on how Plano  enables you to centralize usage of LLMs, please read :ref:`LLM providers <llm_providers>`.

-By completing these steps, you enable Arch to manage the process from validation to response, ensuring users receive consistent, reliable results - and that you are focused
+By completing these steps, you enable Plano to manage the process from validation to response, ensuring users receive consistent, reliable results - and that you are focused
 on the stuff that matters most.

 Example Use Cases
@ -152,7 +152,7 @@ Here are some common use cases where Function Calling can be highly beneficial:

 Best Practices and Tips
 -----------------------
-When integrating function calling into your generative AI applications, keep these tips in mind to get the most out of our Arch-Function models:
+When integrating function calling into your generative AI applications, keep these tips in mind to get the most out of our Plano-Function models:

 - **Keep it clear and simple**: Your function names and parameters should be straightforward and easy to understand. Think of it like explaining a task to a smart colleague - the clearer you are, the better the results.

--- a/docs/source/guides/includes/arch_config.yaml
+++ b/docs/source/guides/includes/arch_config.yaml
@ -16,12 +16,6 @@ llm_providers:
 # default system prompt used by all prompt targets
 system_prompt: You are a network assistant that just offers facts; not advice on manufacturers or purchasing decisions.

-prompt_guards:
-  input_guards:
-    jailbreak:
-      on_exception:
-        message: Looks like you're curious about my abilities, but I can only provide assistance within my programmed parameters.
-
 prompt_targets:
  - name: information_extraction
    default: true
--- a/docs/source/guides/llm_router.rst
+++ b/docs/source/guides/llm_router.rst
@ -3,130 +3,199 @@
 LLM Routing
 ==============================================================

-With the rapid proliferation of large language models (LLM) — each optimized for different strengths, style, or latency/cost profile — routing has become an essential technique to operationalize the use of different models.
-
-Arch provides three distinct routing approaches to meet different use cases:
-
-1. **Model-based Routing**: Direct routing to specific models using provider/model names
-2. **Alias-based Routing**: Semantic routing using custom aliases that map to underlying models
-3. **Preference-aligned Routing**: Intelligent routing using the Arch-Router model based on context and user-defined preferences
-
-This enables optimal performance, cost efficiency, and response quality by matching requests with the most suitable model from your available LLM fleet.
+With the rapid proliferation of large language models (LLMs) — each optimized for different strengths, style, or latency/cost profile — routing has become an essential technique to operationalize the use of different models. Plano provides three distinct routing approaches to meet different use cases: :ref:`Model-based routing <model_based_routing>`, :ref:`Alias-based routing <alias_based_routing>`, and :ref:`Preference-aligned routing <preference_aligned_routing>`. This enables optimal performance, cost efficiency, and response quality by matching requests with the most suitable model from your available LLM fleet.

+.. note::
+  For details on supported model providers, configuration options, and client libraries, see :ref:`LLM Providers <llm_providers>`.

 Routing Methods
 ---------------

-Model-based Routing
+.. _model_based_routing:
+
+Model-based routing
 ~~~~~~~~~~~~~~~~~~~

 Direct routing allows you to specify exact provider and model combinations using the format ``provider/model-name``:

- Use provider-specific names like ``openai/gpt-4o`` or ``anthropic/claude-3-5-sonnet-20241022``
+- Use provider-specific names like ``openai/gpt-5.2`` or ``anthropic/claude-sonnet-4-5``
 - Provides full control and transparency over which model handles each request
 - Ideal for production workloads where you want predictable routing behavior

-Alias-based Routing
+Configuration
+^^^^^^^^^^^^^
+
+Configure your LLM providers with specific provider/model names:
+
+.. code-block:: yaml
+    :caption: Model-based Routing Configuration
+
+    listeners:
+      egress_traffic:
+        address: 0.0.0.0
+        port: 12000
+        message_format: openai
+        timeout: 30s
+
+    llm_providers:
+      - model: openai/gpt-5.2
+        access_key: $OPENAI_API_KEY
+        default: true
+
+      - model: openai/gpt-5
+        access_key: $OPENAI_API_KEY
+
+      - model: anthropic/claude-sonnet-4-5
+        access_key: $ANTHROPIC_API_KEY
+
+Client usage
+^^^^^^^^^^^^
+
+Clients specify exact models:
+
+.. code-block:: python
+
+    # Direct provider/model specification
+    response = client.chat.completions.create(
+        model="openai/gpt-5.2",
+        messages=[{"role": "user", "content": "Hello!"}]
+    )
+
+    response = client.chat.completions.create(
+        model="anthropic/claude-sonnet-4-5",
+        messages=[{"role": "user", "content": "Write a story"}]
+    )
+
+.. _alias_based_routing:
+
+Alias-based routing
 ~~~~~~~~~~~~~~~~~~~

 Alias-based routing lets you create semantic model names that decouple your application from specific providers:

- Use meaningful names like ``fast-model``, ``reasoning-model``, or ``arch.summarize.v1`` (see :ref:`model_aliases`)
+- Use meaningful names like ``fast-model``, ``reasoning-model``, or ``plano.summarize.v1`` (see :ref:`model_aliases`)
 - Maps semantic names to underlying provider models for easier experimentation and provider switching
 - Ideal for applications that want abstraction from specific model names while maintaining control

+Configuration
+^^^^^^^^^^^^^
+
+Configure semantic aliases that map to underlying models:
+
+.. code-block:: yaml
+    :caption: Alias-based Routing Configuration
+
+    listeners:
+      egress_traffic:
+        address: 0.0.0.0
+        port: 12000
+        message_format: openai
+        timeout: 30s
+
+    llm_providers:
+      - model: openai/gpt-5.2
+        access_key: $OPENAI_API_KEY
+
+      - model: openai/gpt-5
+        access_key: $OPENAI_API_KEY
+
+      - model: anthropic/claude-sonnet-4-5
+        access_key: $ANTHROPIC_API_KEY
+
+    model_aliases:
+      # Model aliases - friendly names that map to actual provider names
+      fast-model:
+        target: gpt-5.2
+
+      reasoning-model:
+        target: gpt-5
+
+      creative-model:
+        target: claude-sonnet-4-5
+
+Client usage
+^^^^^^^^^^^^
+
+Clients use semantic names:
+
+.. code-block:: python
+
+    # Using semantic aliases
+    response = client.chat.completions.create(
+        model="fast-model",  # Routes to best available fast model
+        messages=[{"role": "user", "content": "Quick summary please"}]
+    )
+
+    response = client.chat.completions.create(
+        model="reasoning-model",  # Routes to best reasoning model
+        messages=[{"role": "user", "content": "Solve this complex problem"}]
+    )
+
 .. _preference_aligned_routing:

-Preference-aligned Routing (Arch-Router)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Preference-aligned routing (Arch-Router)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Traditional LLM routing approaches face significant limitations: they evaluate performance using benchmarks that often fail to capture human preferences, select from fixed model pools, and operate as "black boxes" without practical mechanisms for encoding user preferences.
+Preference-aligned routing uses the `Arch-Router <https://huggingface.co/katanemo/Arch-Router-1.5B>`_ model to pick the best LLM based on domain, action, and your configured preferences instead of hard-coding a model.

-Arch's preference-aligned routing addresses these challenges by applying a fundamental engineering principle: decoupling. The framework separates route selection (matching queries to human-readable policies) from model assignment (mapping policies to specific LLMs). This separation allows you to define routing policies using descriptive labels like ``Domain: 'finance', Action: 'analyze_earnings_report'`` rather than cryptic identifiers, while independently configuring which models handle each policy.
+- **Domain**: High-level topic of the request (e.g., legal, healthcare, programming).
+- **Action**: What the user wants to do (e.g., summarize, generate code, translate).
+- **Routing preferences**: Your mapping from (domain, action) to preferred models.

-The `Arch-Router <https://huggingface.co/katanemo/Arch-Router-1.5B>`_ model automatically selects the most appropriate LLM based on:
+Arch-Router analyzes each prompt to infer domain and action, then applies your preferences to select a model. This decouples **routing policy** (how to choose) from **model assignment** (what to run), making routing transparent, controllable, and easy to extend as you add or swap models.

- Domain Analysis: Identifies the subject matter (e.g., legal, healthcare, programming)
- Action Classification: Determines the type of operation (e.g., summarization, code generation, translation)
- User-Defined Preferences: Maps domains and actions to preferred models using transparent, configurable routing decisions
- Human Preference Alignment: Uses domain-action mappings that capture subjective evaluation criteria, ensuring routing aligns with real-world user needs rather than just benchmark scores
+Configuration
+^^^^^^^^^^^^^

-This approach supports seamlessly adding new models without retraining and is ideal for dynamic, context-aware routing that adapts to request content and intent.
+To configure preference-aligned dynamic routing, define routing preferences that map domains and actions to specific models:

+.. code-block:: yaml
+    :caption: Preference-Aligned Dynamic Routing Configuration

-Model-based Routing Workflow
----------------------------
+    listeners:
+      egress_traffic:
+        address: 0.0.0.0
+        port: 12000
+        message_format: openai
+        timeout: 30s

-For direct model routing, the process is straightforward:
+    llm_providers:
+      - model: openai/gpt-5.2
+        access_key: $OPENAI_API_KEY
+        default: true

-#. **Client Request**
+      - model: openai/gpt-5
+        access_key: $OPENAI_API_KEY
+        routing_preferences:
+          - name: code understanding
+            description: understand and explain existing code snippets, functions, or libraries
+          - name: complex reasoning
+            description: deep analysis, mathematical problem solving, and logical reasoning

-    The client specifies the exact model using provider/model format (``openai/gpt-4o``).
+      - model: anthropic/claude-sonnet-4-5
+        access_key: $ANTHROPIC_API_KEY
+        routing_preferences:
+          - name: creative writing
+            description: creative content generation, storytelling, and writing assistance
+          - name: code generation
+            description: generating new code snippets, functions, or boilerplate based on user prompts

-#. **Provider Validation**
+Client usage
+^^^^^^^^^^^^

-    Arch validates that the specified provider and model are configured and available.
+Clients can let the router decide or still specify aliases:

-#. **Direct Routing**
+.. code-block:: python

-    The request is sent directly to the specified model without analysis or decision-making.
+    # Let Arch-Router choose based on content
+    response = client.chat.completions.create(
+        messages=[{"role": "user", "content": "Write a creative story about space exploration"}]
+        # No model specified - router will analyze and choose claude-sonnet-4-5
+    )

-#. **Response Handling**
-
-    The response is returned to the client with optional metadata about the routing decision.
-
-
-Alias-based Routing Workflow
-----------------------------
-
-For alias-based routing, the process includes name resolution:
-
-#. **Client Request**
-
-    The client specifies a semantic alias name (``reasoning-model``).
-
-#. **Alias Resolution**
-
-    Arch resolves the alias to the actual provider/model name based on configuration.
-
-#. **Model Selection**
-
-    If the alias maps to multiple models, Arch selects one based on availability and load balancing.
-
-#. **Request Forwarding**
-
-    The request is forwarded to the resolved model.
-
-#. **Response Handling**
-
-    The response is returned with optional metadata about the alias resolution.
-
-
-.. _preference_aligned_routing_workflow:
-
-Preference-aligned Routing Workflow (Arch-Router)
-------------------------------------------------
-
-For preference-aligned dynamic routing, the process involves intelligent analysis:
-
-#. **Prompt Analysis**
-
-    When a user submits a prompt without specifying a model, the Arch-Router analyzes it to determine the domain (subject matter) and action (type of operation requested).
-
-#. **Model Selection**
-
-    Based on the analyzed intent and your configured routing preferences, the Router selects the most appropriate model from your available LLM fleet.
-
-#. **Request Forwarding**
-
-    Once the optimal model is identified, our gateway forwards the original prompt to the selected LLM endpoint. The routing decision is transparent and can be logged for monitoring and optimization purposes.
-
-#. **Response Handling**
-
-    After the selected model processes the request, the response is returned through the gateway. The gateway can optionally add routing metadata or performance metrics to help you understand and optimize your routing decisions.

 Arch-Router
-------------------------
+-----------
 The `Arch-Router <https://huggingface.co/katanemo/Arch-Router-1.5B>`_ is a state-of-the-art **preference-based routing model** specifically designed to address the limitations of traditional LLM routing. This compact 1.5B model delivers production-ready performance with low latency and high accuracy while solving key routing challenges.

 **Addressing Traditional Routing Limitations:**
@ -159,145 +228,6 @@ In summary, Arch-Router demonstrates:
 - **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments.


-Implementing Routing
--------------------
-
-**Model-based Routing**
-
-For direct model routing, configure your LLM providers with specific provider/model names:
-
-.. code-block:: yaml
-    :caption: Model-based Routing Configuration
-
-    listeners:
-      egress_traffic:
-        address: 0.0.0.0
-        port: 12000
-        message_format: openai
-        timeout: 30s
-
-    llm_providers:
-      - model: openai/gpt-4o-mini
-        access_key: $OPENAI_API_KEY
-        default: true
-
-      - model: openai/gpt-4o
-        access_key: $OPENAI_API_KEY
-
-      - model: anthropic/claude-3-5-sonnet-20241022
-        access_key: $ANTHROPIC_API_KEY
-
-Clients specify exact models:
-
-.. code-block:: python
-
-    # Direct provider/model specification
-    response = client.chat.completions.create(
-        model="openai/gpt-4o-mini",
-        messages=[{"role": "user", "content": "Hello!"}]
-    )
-
-    response = client.chat.completions.create(
-        model="anthropic/claude-3-5-sonnet-20241022",
-        messages=[{"role": "user", "content": "Write a story"}]
-    )
-
-**Alias-based Routing**
-
-Configure semantic aliases that map to underlying models:
-
-.. code-block:: yaml
-    :caption: Alias-based Routing Configuration
-
-    listeners:
-      egress_traffic:
-        address: 0.0.0.0
-        port: 12000
-        message_format: openai
-        timeout: 30s
-
-    llm_providers:
-      - model: openai/gpt-4o-mini
-        access_key: $OPENAI_API_KEY
-
-      - model: openai/gpt-4o
-        access_key: $OPENAI_API_KEY
-
-      - model: anthropic/claude-3-5-sonnet-20241022
-        access_key: $ANTHROPIC_API_KEY
-
-    model_aliases:
-      # Model aliases - friendly names that map to actual provider names
-      fast-model:
-        target: gpt-4o-mini
-
-      reasoning-model:
-        target: gpt-4o
-
-      creative-model:
-        target: claude-3-5-sonnet-20241022
-
-Clients use semantic names:
-
-.. code-block:: python
-
-    # Using semantic aliases
-    response = client.chat.completions.create(
-        model="fast-model",  # Routes to best available fast model
-        messages=[{"role": "user", "content": "Quick summary please"}]
-    )
-
-    response = client.chat.completions.create(
-        model="reasoning-model",  # Routes to best reasoning model
-        messages=[{"role": "user", "content": "Solve this complex problem"}]
-    )
-
-**Preference-aligned Routing (Arch-Router)**
-
-To configure preference-aligned dynamic routing, you need to define routing preferences that map domains and actions to specific models:
-
-.. code-block:: yaml
-    :caption: Preference-Aligned Dynamic Routing Configuration
-
-    listeners:
-      egress_traffic:
-        address: 0.0.0.0
-        port: 12000
-        message_format: openai
-        timeout: 30s
-
-    llm_providers:
-      - model: openai/gpt-4o-mini
-        access_key: $OPENAI_API_KEY
-        default: true
-
-      - model: openai/gpt-4o
-        access_key: $OPENAI_API_KEY
-        routing_preferences:
-          - name: code understanding
-            description: understand and explain existing code snippets, functions, or libraries
-          - name: complex reasoning
-            description: deep analysis, mathematical problem solving, and logical reasoning
-
-      - model: anthropic/claude-3-5-sonnet-20241022
-        access_key: $ANTHROPIC_API_KEY
-        routing_preferences:
-          - name: creative writing
-            description: creative content generation, storytelling, and writing assistance
-          - name: code generation
-            description: generating new code snippets, functions, or boilerplate based on user prompts
-
-Clients can let the router decide or use aliases:
-
-.. code-block:: python
-
-    # Let Arch-Router choose based on content
-    response = client.chat.completions.create(
-        messages=[{"role": "user", "content": "Write a creative story about space exploration"}]
-        # No model specified - router will analyze and choose claude-3-5-sonnet-20241022
-    )
-
-
 Combining Routing Methods
 -------------------------

@ -307,17 +237,17 @@ You can combine static model selection with dynamic routing preferences for maxi
    :caption: Hybrid Routing Configuration

    llm_providers:
-      - model: openai/gpt-4o-mini
+      - model: openai/gpt-5.2
        access_key: $OPENAI_API_KEY
        default: true

-      - model: openai/gpt-4o
+      - model: openai/gpt-5
        access_key: $OPENAI_API_KEY
        routing_preferences:
          - name: complex_reasoning
            description: deep analysis and complex problem solving

-      - model: anthropic/claude-3-5-sonnet-20241022
+      - model: anthropic/claude-sonnet-4-5
        access_key: $ANTHROPIC_API_KEY
        routing_preferences:
          - name: creative_tasks
@ -326,14 +256,14 @@ You can combine static model selection with dynamic routing preferences for maxi
    model_aliases:
      # Model aliases - friendly names that map to actual provider names
      fast-model:
-        target: gpt-4o-mini
+        target: gpt-5.2

      reasoning-model:
-        target: gpt-4o
+        target: gpt-5

      # Aliases that can also participate in dynamic routing
      creative-model:
-        target: claude-3-5-sonnet-20241022
+        target: claude-sonnet-4-5

 This configuration allows clients to:

@ -341,7 +271,7 @@ This configuration allows clients to:
 2. **Let the router decide**: No model specified, router analyzes content

 Example Use Cases
-------------------------
+-----------------
 Here are common scenarios where Arch-Router excels:

 - **Coding Tasks**: Distinguish between code generation requests ("write a Python function"), debugging needs ("fix this error"), and code optimization ("make this faster"), routing each to appropriately specialized models.
@ -352,9 +282,8 @@ Here are common scenarios where Arch-Router excels:

 - **Conversational Routing**: Track conversation context to identify when topics shift between domains or when the type of assistance needed changes mid-conversation.

-
-Best practicesm
-------------------------
+Best practices
+--------------
 - **💡Consistent Naming:**  Route names should align with their descriptions.

  - ❌ Bad:
@ -379,18 +308,15 @@ Best practicesm

 - **💡Nouns Descriptor:** Preference-based routers perform better with noun-centric descriptors, as they offer more stable and semantically rich signals for matching.

- **💡Domain Inclusion:** for best user experience, you should always include domain route. This help the router fall back to domain when action is not
+- **💡Domain Inclusion:** for best user experience, you should always include a domain route. This helps the router fall back to domain when action is not confidently inferred.

-.. Unsupported Features
-.. -------------------------
+Unsupported Features
+--------------------

-.. The following features are **not supported** by the Arch-Router model:
+The following features are **not supported** by the Arch-Router model:

-.. - **❌ Multi-Modality:**
-..   The model is not trained to process raw image or audio inputs. While it can handle textual queries *about* these modalities (e.g., "generate an image of a cat"), it cannot interpret encoded multimedia data directly.
+- **Multi-modality**: The model is not trained to process raw image or audio inputs. It can handle textual queries *about* these modalities (e.g., "generate an image of a cat"), but cannot interpret encoded multimedia data directly.

-.. - **❌ Function Calling:**
-..   This model is designed for **semantic preference matching**, not exact intent classification or tool execution. For structured function invocation, use models in the **Arch-Function-Calling** collection.
+- **Function calling**: Arch-Router is designed for **semantic preference matching**, not exact intent classification or tool execution. For structured function invocation, use models in the Plano Function Calling collection instead.

-.. - **❌ System Prompt Dependency:**
-..   Arch-Router routes based solely on the user’s conversation history. It does not use or rely on system prompts for routing decisions.
+- **System prompt dependency**: Arch-Router routes based solely on the user’s conversation history. It does not use or rely on system prompts for routing decisions.
--- a/docs/source/guides/observability/access_logging.rst
+++ b/docs/source/guides/observability/access_logging.rst
@ -3,14 +3,14 @@
 Access Logging
 ==============

-Access logging in Arch refers to the logging of detailed information about each request and response that flows through Arch.
-It provides visibility into the traffic passing through Arch, which is crucial for monitoring, debugging, and analyzing the
+Access logging in Plano refers to the logging of detailed information about each request and response that flows through Plano.
+It provides visibility into the traffic passing through Plano, which is crucial for monitoring, debugging, and analyzing the
 behavior of AI applications and their interactions.

 Key Features
 ^^^^^^^^^^^^
 * **Per-Request Logging**:
-  Each request that passes through Arch is logged. This includes important metadata such as HTTP method,
+  Each request that passes through Plano is logged. This includes important metadata such as HTTP method,
  path, response status code, request duration, upstream host, and more.
 * **Integration with Monitoring Tools**:
  Access logs can be exported to centralized logging systems (e.g., ELK stack or Fluentd) or used to feed monitoring and alerting systems.
@ -19,24 +19,24 @@ Key Features
 How It Works
 ^^^^^^^^^^^^

-Arch gateway exposes access logs for every call it manages on your behalf. By default these access logs can be found under ``~/archgw_logs``. For example:
+Plano exposes access logs for every call it manages on your behalf. By default these access logs can be found under ``~/plano_logs``. For example:

 .. code-block:: console

-  $ tail -F ~/archgw_logs/access_*.log
+  $ tail -F ~/plano_logs/access_*.log

-  ==> /Users/adilhafeez/archgw_logs/access_llm.log <==
+  ==> /Users/username/plano_logs/access_llm.log <==
  [2024-10-10T03:55:49.537Z] "POST /v1/chat/completions HTTP/1.1" 0 DC 0 0 770 - "-" "OpenAI/Python 1.51.0" "469793af-b25f-9b57-b265-f376e8d8c586" "api.openai.com" "162.159.140.245:443"

-  ==> /Users/adilhafeez/archgw_logs/access_internal.log <==
+  ==> /Users/username/plano_logs/access_internal.log <==
  [2024-10-10T03:56:03.906Z] "POST /embeddings HTTP/1.1" 200 - 52 21797 54 53 "-" "-" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "model_server" "192.168.65.254:51000"
  [2024-10-10T03:56:03.961Z] "POST /zeroshot HTTP/1.1" 200 - 106 218 87 87 "-" "-" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "model_server" "192.168.65.254:51000"
  [2024-10-10T03:56:04.050Z] "POST /v1/chat/completions HTTP/1.1" 200 - 1301 614 441 441 "-" "-" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "model_server" "192.168.65.254:51000"
  [2024-10-10T03:56:04.492Z] "POST /hallucination HTTP/1.1" 200 - 556 127 104 104 "-" "-" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "model_server" "192.168.65.254:51000"
  [2024-10-10T03:56:04.598Z] "POST /insurance_claim_details HTTP/1.1" 200 - 447 125 17 17 "-" "-" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "api_server" "192.168.65.254:18083"

-  ==> /Users/adilhafeez/archgw_logs/access_ingress.log <==
-  [2024-10-10T03:56:03.905Z] "POST /v1/chat/completions HTTP/1.1" 200 - 463 1022 1695 984 "-" "OpenAI/Python 1.51.0" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "arch_llm_listener" "0.0.0.0:12000"
+  ==> /Users/username/plano_logs/access_ingress.log <==
+  [2024-10-10T03:56:03.905Z] "POST /v1/chat/completions HTTP/1.1" 200 - 463 1022 1695 984 "-" "OpenAI/Python 1.51.0" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "plano_llm_listener" "0.0.0.0:12000"


 Log Format
@ -58,6 +58,6 @@ For example for following request:

 .. code-block:: console

-  [2024-10-10T03:56:03.905Z] "POST /v1/chat/completions HTTP/1.1" 200 - 463 1022 1695 984 "-" "OpenAI/Python 1.51.0" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "arch_llm_listener" "0.0.0.0:12000"
+  [2024-10-10T03:56:03.905Z] "POST /v1/chat/completions HTTP/1.1" 200 - 463 1022 1695 984 "-" "OpenAI/Python 1.51.0" "604197fe-2a5b-95a2-9367-1d6b30cfc845" "plano_llm_listener" "0.0.0.0:12000"

 Total duration was 1695ms, and the upstream service took 984ms to process the request. Bytes received and sent were 463 and 1022 respectively.
--- a/docs/source/guides/observability/monitoring.rst
+++ b/docs/source/guides/observability/monitoring.rst
@ -8,11 +8,11 @@ and instrumentation for generating, collecting, processing, and exporting teleme
 metrics, and logs. Its flexible design supports a wide range of backends and seamlessly integrates with
 modern application tools.

-Arch acts a *source* for several monitoring metrics related to **prompts** and **LLMs** natively integrated
+Plano acts a *source* for several monitoring metrics related to **agents** and **LLMs** natively integrated
 via `OpenTelemetry <https://opentelemetry.io/>`_ to help you understand three critical aspects of your application:
 latency, token usage, and error rates by an upstream LLM provider. Latency measures the speed at which your application
 is responding to users, which includes metrics like time to first token (TFT), time per output token (TOT) metrics, and
-the total latency as perceived by users. Below are some screenshots how Arch integrates natively with tools like
+the total latency as perceived by users. Below are some screenshots how Plano integrates natively with tools like
 `Grafana <https://grafana.com/grafana/dashboards/>`_ via `Promethus <https://prometheus.io/>`_


@ -32,7 +32,7 @@ Metrics Dashboard (via Grafana)

 Configure Monitoring
 ~~~~~~~~~~~~~~~~~~~~
-Arch gateway publishes stats endpoint at http://localhost:19901/stats. As noted above, Arch is a source for metrics. To view and manipulate dashbaords, you will
+Plano publishes stats endpoint at http://localhost:19901/stats. As noted above, Plano is a source for metrics. To view and manipulate dashbaords, you will
 need to configiure `Promethus <https://prometheus.io/>`_ (as a metrics store) and `Grafana <https://grafana.com/grafana/dashboards/>`_ for dashboards. Below
 are some sample configuration files for both, respectively.

@ -51,7 +51,7 @@ are some sample configuration files for both, respectively.
        timeout: 10s
        api_version: v2
    scrape_configs:
-    - job_name: archgw
+    - job_name: plano
        honor_timestamps: true
        scrape_interval: 15s
        scrape_timeout: 10s
--- a/docs/source/guides/observability/tracing.rst
+++ b/docs/source/guides/observability/tracing.rst
@ -17,9 +17,9 @@ requests in an AI application. With tracing, you can capture a detailed view of
 through various services and components, which is crucial for **debugging**, **performance optimization**,
 and understanding complex AI agent architectures like Co-pilots.

-**Arch** propagates trace context using the W3C Trace Context standard, specifically through the
+**Plano** propagates trace context using the W3C Trace Context standard, specifically through the
 ``traceparent`` header. This allows each component in the system to record its part of the request
-flow, enabling **end-to-end tracing** across the entire application. By using OpenTelemetry, Arch ensures
+flow, enabling **end-to-end tracing** across the entire application. By using OpenTelemetry, Plano ensures
 that developers can capture this trace data consistently and in a format compatible with various observability
 tools.

@ -41,9 +41,9 @@ Benefits of Using ``Traceparent`` Headers
 How to Initiate A Trace
 -----------------------

-1. **Enable Tracing Configuration**: Simply add the ``random_sampling`` in ``tracing`` section to 100`` flag to in the :ref:`listener <arch_overview_listeners>` config
+1. **Enable Tracing Configuration**: Simply add the ``random_sampling`` in ``tracing`` section to 100`` flag to in the :ref:`listener <plano_overview_listeners>` config

-2. **Trace Context Propagation**: Arch automatically propagates the ``traceparent`` header. When a request is received, Arch will:
+2. **Trace Context Propagation**: Plano automatically propagates the ``traceparent`` header. When a request is received, Plano will:

   - Generate a new ``traceparent`` header if one is not present.
   - Extract the trace context from the ``traceparent`` header if it exists.
@ -57,7 +57,7 @@ How to Initiate A Trace
 Trace Propagation
 -----------------

-Arch uses the W3C Trace Context standard for trace propagation, which relies on the ``traceparent`` header.
+Plano uses the W3C Trace Context standard for trace propagation, which relies on the ``traceparent`` header.
 This header carries tracing information in a standardized format, enabling interoperability between different
 tracing systems.

@ -77,7 +77,7 @@ Instrumentation
 ~~~~~~~~~~~~~~~

 To integrate AI tracing, your application needs to follow a few simple steps. The steps
-below are very common practice, and not unique to Arch, when you reading tracing headers and export
+below are very common practice, and not unique to Plano, when you reading tracing headers and export
 `spans <https://docs.lightstep.com/docs/understand-distributed-tracing>`_ for distributed tracing.

 - Read the ``traceparent`` header from incoming requests.
@ -148,66 +148,6 @@ Handle incoming requests:
           print(f"Payment service response: {response.content}")


-AI Agent Tracing Visualization Example
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The following is an example of tracing for an AI-powered customer support system.
-A customer interacts with AI agents, which forward their requests through different
-specialized services and external systems.
-
-::
-
-    +--------------------------+
-    |   Customer Interaction   |
-    +--------------------------+
-               |
-               v
-    +--------------------------+        +--------------------------+
-    |  Agent 1 (Main - Arch)   | ---->  | External Payment Service |
-    +--------------------------+        +--------------------------+
-               |                                  |
-               v                                  v
-    +--------------------------+        +--------------------------+
-    |  Agent 2 (Support - Arch)| ---->  |   Internal Tech Support  |
-    +--------------------------+        +--------------------------+
-               |                                  |
-               v                                  v
-    +--------------------------+        +--------------------------+
-    | Agent 3 (Orders- Arch)   | ---->  |   Inventory Management   |
-    +--------------------------+        +--------------------------+
-
-Trace Breakdown:
-****************
-
- Customer Interaction:
-    - Span 1: Customer initiates a request via the AI-powered chatbot for billing support (e.g., asking for payment details).
-
- AI Agent 1 (Main - Arch):
-    - Span 2: AI Agent 1 (Main) processes the request and identifies it as related to billing, forwarding the request
-      to an external payment service.
-    - Span 3: AI Agent 1 determines that additional technical support is needed for processing and forwards the request
-      to AI Agent 2.
-
- External Payment Service:
-    - Span 4: The external payment service processes the payment-related request (e.g., verifying payment status) and sends
-      the response back to AI Agent 1.
-
- AI Agent 2 (Tech - Arch):
-    - Span 5: AI Agent 2, responsible for technical queries, processes a request forwarded from AI Agent 1 (e.g., checking for
-      any account issues).
-    - Span 6: AI Agent 2 forwards the query to Internal Tech Support for further investigation.
-
- Internal Tech Support:
-    - Span 7: Internal Tech Support processes the request (e.g., resolving account access issues) and responds to AI Agent 2.
-
- AI Agent 3 (Orders - Arch):
-    - Span 8: AI Agent 3 handles order-related queries. AI Agent 1 forwards the request to AI Agent 3 after payment verification
-      is completed.
-    - Span 9: AI Agent 3 forwards a request to the Inventory Management system to confirm product availability for a pending order.
-
- Inventory Management:
-    - Span 10: The Inventory Management system checks stock and availability and returns the information to AI Agent 3.
-
 Integrating with Tracing Tools
 ------------------------------

@ -292,11 +232,11 @@ To send tracing data to `Datadog <https://docs.datadoghq.com/getting_started/tra
 Langtrace
 ~~~~~~~~~

-Langtrace is an observability tool designed specifically for large language models (LLMs). It helps you capture, analyze, and understand how LLMs are used in your applications including those built using Arch.
+Langtrace is an observability tool designed specifically for large language models (LLMs). It helps you capture, analyze, and understand how LLMs are used in your applications including those built using Plano.

 To send tracing data to `Langtrace <https://docs.langtrace.ai/supported-integrations/llm-tools/arch>`_:

-1. **Configure Arch**: Make sure Arch is installed and setup correctly. For more information, refer to the `installation guide <https://github.com/katanemo/archgw?tab=readme-ov-file#prerequisites>`_.
+1. **Configure Plano**: Make sure Plano is installed and setup correctly. For more information, refer to the `installation guide <https://github.com/katanemo/archgw?tab=readme-ov-file#prerequisites>`_.

 2. **Install Langtrace**: Install the Langtrace SDK.:

@ -348,7 +288,7 @@ Best Practices
 Summary
 ----------

-By leveraging the ``traceparent`` header for trace context propagation, Arch enables developers to implement
+By leveraging the ``traceparent`` header for trace context propagation, Plano enables developers to implement
 tracing efficiently. This approach simplifies the process of collecting and analyzing tracing data in common
 tools like AWS X-Ray and Datadog, enhancing observability and facilitating faster debugging and optimization.

--- a/docs/source/guides/orchestration.rst
+++ b/docs/source/guides/orchestration.rst
@ -0,0 +1,350 @@
+.. _agent_routing:
+
+Orchestration
+==============
+
+Building multi-agent systems allow you to route requests across multiple specialized agents, each designed to handle specific types of tasks.
+Plano makes it easy to build and scale these systems by managing the orchestration layer—deciding which agent(s) should handle each request—while you focus on implementing individual agent logic.
+
+This guide shows you how to configure and implement multi-agent orchestration in Plano using a real-world example: a **Travel Booking Assistant** that routes queries to specialized agents for weather and flights.
+
+How It Works
+------------
+
+Plano's orchestration layer analyzes incoming prompts and routes them to the most appropriate agent based on user intent and conversation context. The workflow is:
+
+1. **User submits a prompt**: The request arrives at Plano's agent listener.
+2. **Agent selection**: Plano uses an LLM to analyze the prompt and determine user intent and complexity. By default, this uses `Plano-Orchestrator-30B-A3B <https://huggingface.co/collections/katanemo/plano-orchestrator>`_, which offers performance of foundation models at 1/10th the cost. The LLM routes the request to the most suitable agent configured in your system—such as a weather agent or flight agent.
+3. **Agent handles request**: Once the selected agent receives the request object from Plano, it manages its own :ref:`inner loop <agents>` until the task is complete. This means the agent autonomously calls models, invokes tools, processes data, and reasons about next steps—all within its specialized domain—before returning the final response.
+4. **Seamless handoffs**: For multi-turn conversations, Plano repeats the intent analysis for each follow-up query, enabling smooth handoffs between agents as the conversation evolves.
+
+Example: Travel Booking Assistant
+----------------------------------
+
+Let's walk through a complete multi-agent system: a Travel Booking Assistant that helps users plan trips by providing weather forecasts and flight information. This system uses two specialized agents:
+
+* **Weather Agent**: Provides real-time weather conditions and multi-day forecasts
+* **Flight Agent**: Searches for flights between airports with real-time tracking
+
+Configuration
+-------------
+
+Configure your agents in the ``listeners`` section of your ``plano_config.yaml``:
+
+.. literalinclude:: ../resources/includes/agents/agents_config.yaml
+    :language: yaml
+    :linenos:
+    :caption: Travel Booking Multi-Agent Configuration
+
+**Key Configuration Elements:**
+
+* **agent listener**: A listener of ``type: agent`` tells Plano to perform intent analysis and routing for incoming requests.
+* **agents list**: Define each agent with an ``id``, ``description`` (used for routing decisions)
+* **router**: The ``plano_orchestrator_v1`` router uses Plano-Orchestrator to analyze user intent and select the appropriate agent.
+* **filter_chain**: Optionally attach :ref:`filter chains <filter_chain>` to agents for guardrails, query rewriting, or context enrichment.
+
+**Writing Effective Agent Descriptions**
+
+Agent descriptions are critical—they're used by Plano-Orchestrator to make routing decisions. Effective descriptions should include:
+
+* **Clear introduction**: A concise statement explaining what the agent is and its primary purpose
+* **Capabilities section**: A bulleted list of specific capabilities, including:
+
+  * What APIs or data sources it uses (e.g., "Open-Meteo API", "FlightAware AeroAPI")
+  * What information it provides (e.g., "current temperature", "multi-day forecasts", "gate information")
+  * How it handles context (e.g., "Understands conversation context to resolve location references")
+  * What question patterns it handles (e.g., "What's the weather in [city]?")
+  * How it handles multi-part queries (e.g., "When queries include both weather and flights, this agent answers ONLY the weather part")
+
+Here's an example of a well-structured agent description:
+
+.. code-block:: yaml
+
+    - id: weather_agent
+      description: |
+
+        WeatherAgent is a specialized AI assistant for real-time weather information
+        and forecasts. It provides accurate weather data for any city worldwide using
+        the Open-Meteo API, helping travelers plan their trips with up-to-date weather
+        conditions.
+
+        Capabilities:
+          * Get real-time weather conditions and multi-day forecasts for any city worldwide
+          * Provides current temperature, weather conditions, sunrise/sunset times
+          * Provides detailed weather information including multi-day forecasts
+          * Understands conversation context to resolve location references from previous messages
+          * Handles weather-related questions including "What's the weather in [city]?"
+          * When queries include both weather and other travel questions (e.g., flights),
+            this agent answers ONLY the weather part
+
+.. note::
+   We will soon support "Agents as Tools" via Model Context Protocol (MCP), enabling agents to dynamically discover and invoke other agents as tools. Track progress on `GitHub Issue #646 <https://github.com/katanemo/archgw/issues/646>`_.
+
+Implementation
+--------------
+
+Agents are HTTP services that receive routed requests from Plano. Each agent implements the OpenAI Chat Completions API format, making them compatible with standard LLM clients.
+
+Agent Structure
+^^^^^^^^^^^^^^^
+
+Let's examine the Weather Agent implementation:
+
+.. literalinclude:: ../resources/includes/agents/weather.py
+    :language: python
+    :linenos:
+    :lines: 262-283
+    :caption: Weather Agent - Core Structure
+
+**Key Points:**
+
+* Agents expose a ``/v1/chat/completions`` endpoint that matches OpenAI's API format
+* They use Plano's LLM gateway (via ``LLM_GATEWAY_ENDPOINT``) for all LLM calls
+* They receive the full conversation history in ``request_body.messages``
+
+Information Extraction with LLMs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Agents use LLMs to extract structured information from natural language queries. This enables them to understand user intent and extract parameters needed for API calls.
+
+The Weather Agent extracts location information:
+
+.. literalinclude:: ../resources/includes/agents/weather.py
+    :language: python
+    :linenos:
+    :lines: 73-119
+    :caption: Weather Agent - Location Extraction
+
+The Flight Agent extracts more complex information—origin, destination, and dates:
+
+.. literalinclude:: ../resources/includes/agents/flights.py
+    :language: python
+    :linenos:
+    :lines: 69-120
+    :caption: Flight Agent - Flight Information Extraction
+
+**Key Points:**
+
+* Use smaller, faster models (like ``gpt-4o-mini``) for extraction tasks
+* Include conversation context to handle follow-up questions and pronouns
+* Use structured prompts with clear output formats (JSON)
+* Handle edge cases with fallback values
+
+Calling External APIs
+^^^^^^^^^^^^^^^^^^^^^^
+
+After extracting information, agents call external APIs to fetch real-time data:
+
+.. literalinclude:: ../resources/includes/agents/weather.py
+    :language: python
+    :linenos:
+    :lines: 136-197
+    :caption: Weather Agent - External API Call
+
+The Flight Agent calls FlightAware's AeroAPI:
+
+.. literalinclude:: ../resources/includes/agents/flights.py
+    :language: python
+    :linenos:
+    :lines: 156-260
+    :caption: Flight Agent - External API Call
+
+**Key Points:**
+
+* Use async HTTP clients (like ``httpx.AsyncClient``) for non-blocking API calls
+* Transform external API responses into consistent, structured formats
+* Handle errors gracefully with fallback values
+* Cache or validate data when appropriate (e.g., airport code validation)
+
+Preparing Context and Generating Responses
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Agents combine extracted information, API data, and conversation history to generate responses:
+
+.. literalinclude:: ../resources/includes/agents/weather.py
+    :language: python
+    :linenos:
+    :lines: 290-370
+    :caption: Weather Agent - Context Preparation and Response Generation
+
+**Key Points:**
+
+* Use system messages to provide structured data to the LLM
+* Include full conversation history for context-aware responses
+* Stream responses for better user experience
+* Route all LLM calls through Plano's gateway for consistent behavior and observability
+
+Best Practices
+--------------
+
+**Write Clear Agent Descriptions**
+
+Agent descriptions are used by Plano-Orchestrator to make routing decisions. Be specific about what each agent handles:
+
+.. code-block:: yaml
+
+    # Good - specific and actionable
+    - id: flight_agent
+      description: Get live flight information between airports using FlightAware AeroAPI. Shows real-time flight status, scheduled/estimated/actual departure and arrival times, gate and terminal information, delays, aircraft type, and flight status. Automatically resolves city names to airport codes (IATA/ICAO). Understands conversation context to infer origin/destination from follow-up questions.
+
+    # Less ideal - too vague
+    - id: flight_agent
+      description: Handles flight queries
+
+**Use Conversation Context Effectively**
+
+Include conversation history in your extraction and response generation:
+
+.. code-block:: python
+
+    # Include conversation context for extraction
+    conversation_context = []
+    for msg in messages:
+        conversation_context.append({"role": msg.role, "content": msg.content})
+
+    # Use recent context (last 10 messages)
+    context_messages = conversation_context[-10:] if len(conversation_context) > 10 else conversation_context
+
+**Route LLM Calls Through Plano's Model Proxy**
+
+Always route LLM calls through Plano's :ref:`Model Proxy <llm_providers>` for consistent responses, smart routing, and rich observability:
+
+.. code-block:: python
+
+    openai_client_via_plano = AsyncOpenAI(
+        base_url=LLM_GATEWAY_ENDPOINT,  # Plano's LLM gateway
+        api_key="EMPTY",
+    )
+
+    response = await openai_client_via_plano.chat.completions.create(
+        model="openai/gpt-4o",
+        messages=messages,
+        stream=True,
+    )
+
+**Handle Errors Gracefully**
+
+Provide fallback values and clear error messages:
+
+.. code-block:: python
+
+    async def get_weather_data(request: Request, messages: list, days: int = 1):
+        try:
+            # ... extraction and API logic ...
+            location = response.choices[0].message.content.strip().strip("\"'`.,!?")
+            if not location or location.upper() == "NOT_FOUND":
+                location = "New York"  # Fallback to default
+            return weather_data
+        except Exception as e:
+            logger.error(f"Error getting weather data: {e}")
+            return {"location": "New York", "weather": {"error": "Could not retrieve weather data"}}
+
+**Use Appropriate Models for Tasks**
+
+Use smaller, faster models for extraction tasks and larger models for final responses:
+
+.. code-block:: python
+
+    # Extraction: Use smaller, faster model
+    LOCATION_MODEL = "openai/gpt-4o-mini"
+
+    # Final response: Use larger, more capable model
+    WEATHER_MODEL = "openai/gpt-4o"
+
+**Stream Responses**
+
+Stream responses for better user experience:
+
+.. code-block:: python
+
+    async def invoke_weather_agent(request: Request, request_body: dict, traceparent_header: str = None):
+        # ... prepare messages with weather data ...
+
+        stream = await openai_client_via_plano.chat.completions.create(
+            model=WEATHER_MODEL,
+            messages=response_messages,
+            temperature=request_body.get("temperature", 0.7),
+            max_tokens=request_body.get("max_tokens", 1000),
+            stream=True,
+            extra_headers=extra_headers,
+        )
+
+        async for chunk in stream:
+            if chunk.choices:
+                yield f"data: {chunk.model_dump_json()}\n\n"
+
+        yield "data: [DONE]\n\n"
+
+Common Use Cases
+----------------
+
+Multi-agent orchestration is particularly powerful for:
+
+**Travel and Booking Systems**
+
+Route queries to specialized agents for weather and flights:
+
+.. code-block:: yaml
+
+    agents:
+      - id: weather_agent
+        description: Get real-time weather conditions and forecasts
+      - id: flight_agent
+        description: Search for flights and provide flight status
+
+**Customer Support**
+
+Route common queries to automated support agents while escalating complex issues:
+
+.. code-block:: yaml
+
+    agents:
+      - id: tier1_support
+        description: Handles common FAQs, password resets, and basic troubleshooting
+      - id: tier2_support
+        description: Handles complex technical issues requiring deep product knowledge
+      - id: human_escalation
+        description: Escalates sensitive issues or unresolved problems to human agents
+
+**Sales and Marketing**
+
+Direct leads and inquiries to specialized sales agents:
+
+.. code-block:: yaml
+
+    agents:
+      - id: product_recommendation
+        description: Recommends products based on user needs and preferences
+      - id: pricing_agent
+        description: Provides pricing information and quotes
+      - id: sales_closer
+        description: Handles final negotiations and closes deals
+
+**Technical Documentation and Support**
+
+Combine RAG agents for documentation lookup with specialized troubleshooting agents:
+
+.. code-block:: yaml
+
+    agents:
+      - id: docs_agent
+        description: Retrieves relevant documentation and guides
+        filter_chain:
+          - query_rewriter
+          - context_builder
+      - id: troubleshoot_agent
+        description: Diagnoses and resolves technical issues step by step
+
+Next Steps
+----------
+
+* Learn more about :ref:`agents <agents>` and the inner vs. outer loop model
+* Explore :ref:`filter chains <filter_chain>` for adding guardrails and context enrichment
+* See :ref:`observability <observability>` for monitoring multi-agent workflows
+* Review the :ref:`LLM Providers <llm_providers>` guide for model routing within agents
+* Check out the complete `Travel Booking demo <https://github.com/katanemo/plano/tree/main/demos/use_cases/travel_booking>`_ on GitHub
+
+.. note::
+    To observe traffic to and from agents, please read more about :ref:`observability <observability>` in Plano.
+
+By carefully configuring and managing your Agent routing and hand off, you can significantly improve your application's responsiveness, performance, and overall user satisfaction.
--- a/docs/source/guides/prompt_guard.rst
+++ b/docs/source/guides/prompt_guard.rst
@ -1,66 +1,118 @@
 .. _prompt_guard:

-Prompt Guard
-=============
+Guardrails
+==========

-**Prompt guard** is a security and validation feature offered in Arch to protect agents, by filtering and analyzing prompts before they reach your application logic.
-In applications where prompts generate responses or execute specific actions based on user inputs, prompt guard minimizes risks like malicious inputs (or misaligned outputs).
-By adding a layer of input scrutiny, prompt guards ensures safer, more reliable, and accurate interactions with agents.
+**Guardrails** are Plano's way of applying safety and validation checks to prompts before they reach your application logic. They are typically implemented as
+filters in a :ref:`Filter Chain <filter_chain>` attached to an agent, so every request passes through a consistent processing layer.
+
+
+Why Guardrails
+--------------
+Guardrails are essential for maintaining control over AI-driven applications. They help enforce organizational policies, ensure compliance with regulations
+(like GDPR or HIPAA), and protect users from harmful or inappropriate content. In applications where prompts generate responses or trigger actions, guardrails
+minimize risks like malicious inputs, off-topic queries, or misaligned outputs—adding a consistent layer of input scrutiny that makes interactions safer,
+more reliable, and easier to reason about.

-Why Prompt Guard
----------------

 .. vale Vale.Spelling = NO

- **Prompt Sanitization via Arch-Guard**
-    - **Jailbreak Prevention**: Detects and filters inputs that might attempt jailbreak attacks, like alternating LLM intended behavior, exposing the system prompt, or bypassing ethnics safety.
-
- **Dynamic Error Handling**
-    - **Automatic Correction**: Applies error-handling techniques to suggest corrections for minor input errors, such as typos or misformatted data.
-    - **Feedback Mechanism**: Provides informative error messages to users, helping them understand how to correct input mistakes or adhere to guidelines.
-
-.. Note::
-    Today, Arch offers support for jailbreak via Arch-Guard. We will be adding support for additional guards in Q1, 2025 (including response guardrails)
-
-What Is Arch-Guard
-~~~~~~~~~~~~~~~~~~
-`Arch-Guard <https://huggingface.co/collections/katanemo/arch-guard-6702bdc08b889e4bce8f446d>`_ is a robust classifier model specifically trained on a diverse corpus of prompt attacks.
-It excels at detecting explicitly malicious prompts, providing an essential layer of security for LLM applications.
-
-By embedding Arch-Guard within the Arch architecture, we empower developers to build robust, LLM-powered applications while prioritizing security and safety. With Arch-Guard, you can navigate the complexities of prompt management with confidence, knowing you have a reliable defense against malicious input.
+- **Jailbreak Prevention**: Detect and filter inputs that attempt to change LLM behavior, expose system prompts, or bypass safety policies.
+- **Domain and Topicality Enforcement**: Ensure that agents only respond to prompts within an approved domain (for example, finance-only or healthcare-only use cases) and reject unrelated queries.
+- **Dynamic Error Handling**: Provide clear error messages when requests violate policy, helping users correct their inputs.


-Example Configuration
-~~~~~~~~~~~~~~~~~~~~~
-Here is an example of using Arch-Guard in Arch:
+How Guardrails Work
+-------------------

-.. literalinclude:: includes/arch_config.yaml
-    :language: yaml
-    :linenos:
-    :lines: 22-26
-    :caption: Arch-Guard Example Configuration
+Guardrails can be implemented as either in-process MCP filters or as HTTP-based filters. HTTP filters are external services that receive the request over HTTP, validate it, and return a response to allow or reject the request. This makes it easy to use filters written in any language or run them as independent services.

-How Arch-Guard Works
----------------------
+Each filter receives the chat messages, evaluates them against policy, and either lets the request continue or raises a ``ToolError`` (or returns an error response) to reject it with a helpful error message.

-#. **Pre-Processing Stage**
+The example below shows an input guard for TechCorp's customer support system that validates queries are within the company's domain:

-    As a request or prompt is received, Arch Guard first performs validation. If any violations are detected, the input is flagged, and a tailored error message may be returned.
+.. code-block:: python
+    :caption: Example domain validation guard using FastMCP

-#. **Error Handling and Feedback**
+    from typing import List
+    from fastmcp.exceptions import ToolError
+    from . import mcp

-    If the prompt contains errors or does not meet certain criteria, the user receives immediate feedback or correction suggestions, enhancing usability and reducing the chance of repeated input mistakes.
+    @mcp.tool
+    async def input_guards(messages: List[ChatMessage]) -> List[ChatMessage]:
+        """Validates queries are within TechCorp's domain."""

-Benefits of Using Arch Guard
------------------------------
+        # Get the user's query
+        user_query = next(
+            (msg.content for msg in reversed(messages) if msg.role == "user"),
+            ""
+        )

- **Enhanced Security**: Protects against injection attacks, harmful content, and misuse, securing both system and user data.
+        # Use an LLM to validate the query scope (simplified)
+        is_valid = await validate_with_llm(user_query)

- **Better User Experience**: Clear feedback and error correction improve user interactions by guiding them to correct input formats and constraints.
+        if not is_valid:
+            raise ToolError(
+                "I can only assist with questions related to TechCorp and its services. "
+                "Please ask about TechCorp's products, pricing, SLAs, or technical support."
+            )
+
+        return messages


-Summary
-------
+To wire this guardrail into Plano, define the filter and add it to your agent's filter chain:

-Prompt guard is an essential tool for any prompt-based system that values security, accuracy, and compliance.
-By implementing Prompt Guard, developers can provide a robust layer of input validation and security, leading to better-performing, reliable, and safer applications.
+.. code-block:: yaml
+    :caption: Plano configuration with input guard filter
+
+    filters:
+      - id: input_guards
+        url: http://localhost:10500
+
+    listeners:
+      - type: agent
+        name: agent_1
+        port: 8001
+        router: plano_orchestrator_v1
+        agents:
+          - id: rag_agent
+            description: virtual assistant for retrieval augmented generation tasks
+            filter_chain:
+              - input_guards
+
+
+When a request arrives at ``agent_1``, Plano invokes the ``input_guards`` filter first. If validation passes, the request continues to
+the agent. If validation fails (``ToolError`` raised), Plano returns an error response to the caller.
+
+Testing the Guardrail
+---------------------
+
+Here's an example of the guardrail in action, rejecting a query about Apple Corporation (outside TechCorp's domain):
+
+.. code-block:: bash
+    :caption: Request that violates the guardrail policy
+
+    curl -X POST http://localhost:8001/v1/chat/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "gpt-4",
+        "messages": [
+          {
+            "role": "user",
+            "content": "what is sla for apple corporation?"
+          }
+        ],
+        "stream": false
+      }'
+
+.. code-block:: json
+    :caption: Error response from the guardrail
+
+    {
+      "error": "ClientError",
+      "agent": "input_guards",
+      "status": 400,
+      "agent_response": "I apologize, but I can only assist with questions related to TechCorp and its services. Your query appears to be outside this scope. The query is about SLA for Apple Corporation, which is unrelated to TechCorp.\n\nPlease ask me about TechCorp's products, services, pricing, SLAs, or technical support."
+    }
+
+This prevents out-of-scope queries from reaching your agent while providing clear feedback to users about why their request was rejected.
--- a/docs/source/guides/state.rst
+++ b/docs/source/guides/state.rst
@ -0,0 +1,255 @@
+.. _managing_conversational_state:
+
+Conversational State
+=====================
+
+The OpenAI Responses API (``v1/responses``) is designed for multi-turn conversations where context needs to persist across requests. Plano provides a unified ``v1/responses`` API that works with **any LLM provider**—OpenAI, Anthropic, Azure OpenAI, DeepSeek, or any OpenAI-compatible provider—while automatically managing conversational state for you.
+
+Unlike the traditional Chat Completions API where you manually manage conversation history by including all previous messages in each request, Plano handles state management behind the scenes. This means you can use the Responses API with any model provider, and Plano will persist conversation context across requests—making it ideal for building conversational agents that remember context without bloating every request with full message history.
+
+How It Works
+------------
+
+When a client calls the Responses API:
+
+1. **First request**: Plano generates a unique ``resp_id`` and stores the conversation state (messages, model, provider, timestamp).
+2. **Subsequent requests**: The client includes the ``previous_resp_id`` from the previous response. Plano retrieves the stored conversation state, merges it with the new input, and sends the combined context to the LLM.
+3. **Response**: The LLM sees the full conversation history without the client needing to resend all previous messages.
+
+This pattern dramatically reduces bandwidth and makes it easier to build multi-turn agents—Plano handles the state plumbing so you can focus on agent logic.
+
+**Example Using OpenAI Python SDK:**
+
+.. code-block:: python
+
+    from openai import OpenAI
+
+    # Point to Plano's Model Proxy endpoint
+    client = OpenAI(
+        api_key="test-key",
+        base_url="http://127.0.0.1:12000/v1"
+    )
+
+    # First turn - Plano creates a new conversation state
+    response = client.responses.create(
+        model="claude-sonnet-4-5",  # Works with any configured provider
+        input="My name is Alice and I like Python"
+    )
+
+    # Save the response_id for conversation continuity
+    resp_id = response.id
+    print(f"Assistant: {response.output_text}")
+
+    # Second turn - Plano automatically retrieves previous context
+    resp2 = client.responses.create(
+        model="claude-sonnet-4-5", # Make sure its configured in plano_config.yaml
+        input="Please list all the messages you have received in our conversation, numbering each one.",
+        previous_response_id=resp_id,
+    )
+
+    print(f"Assistant: {resp2.output_text}")
+    # Output: "Your name is Alice and your favorite language is Python"
+
+Notice how the second request only includes the new user message—Plano automatically merges it with the stored conversation history before sending to the LLM.
+
+Configuration Overview
+----------------------
+
+State storage is configured in the ``state_storage`` section of your ``plano_config.yaml``:
+
+.. literalinclude:: ../resources/includes/arch_config_state_storage_example.yaml
+    :language: yaml
+    :lines: 21-30
+    :linenos:
+    :emphasize-lines: 3,6-10
+
+Plano supports two storage backends:
+
+* **Memory**: Fast, ephemeral storage for development and testing. State is lost when Plano restarts.
+* **PostgreSQL**: Durable, production-ready storage with support for Supabase and self-hosted PostgreSQL instances.
+
+.. note::
+   If you don't configure ``state_storage``, conversation state management is **disabled**. The Responses API will still work, but clients must manually include full conversation history in each request (similar to the Chat Completions API behavior).
+
+Memory Storage (Development)
+----------------------------
+
+Memory storage keeps conversation state in-memory using a thread-safe ``HashMap``. It's perfect for local development, demos, and testing, but all state is lost when Plano restarts.
+
+**Configuration**
+
+Add this to your ``plano_config.yaml``:
+
+.. code-block:: yaml
+
+   state_storage:
+     type: memory
+
+That's it. No additional setup required.
+
+**When to Use Memory Storage**
+
+* Local development and debugging
+* Demos and proof-of-concepts
+* Automated testing environments
+* Single-instance deployments where persistence isn't critical
+
+**Limitations**
+
+* State is lost on restart
+* Not suitable for production workloads
+* Cannot scale across multiple Plano instances
+
+PostgreSQL Storage (Production)
+--------------------------------
+
+PostgreSQL storage provides durable, production-grade conversation state management. It works with both self-hosted PostgreSQL and Supabase (PostgreSQL-as-a-service), making it ideal for scaling multi-agent systems in production.
+
+Prerequisites
+^^^^^^^^^^^^^
+
+Before configuring PostgreSQL storage, you need:
+
+1. A PostgreSQL database (version 12 or later)
+2. Database credentials (host, user, password)
+3. The ``conversation_states`` table created in your database
+
+**Setting Up the Database**
+
+Run the SQL schema to create the required table:
+
+.. literalinclude:: ../resources/db_setup/conversation_states.sql
+    :language: sql
+    :linenos:
+
+**Using psql:**
+
+.. code-block:: bash
+
+   psql $DATABASE_URL -f docs/db_setup/conversation_states.sql
+
+**Using Supabase Dashboard:**
+
+1. Log in to your Supabase project
+2. Navigate to the SQL Editor
+3. Copy and paste the SQL from ``docs/db_setup/conversation_states.sql``
+4. Run the query
+
+Configuration
+^^^^^^^^^^^^^
+
+Once the database table is created, configure Plano to use PostgreSQL storage:
+
+.. code-block:: yaml
+
+   state_storage:
+     type: postgres
+     connection_string: "postgresql://user:password@host:5432/database"
+
+**Using Environment Variables**
+
+You should **never** hardcode credentials. Use environment variables instead:
+
+.. code-block:: yaml
+
+   state_storage:
+     type: postgres
+     connection_string: "postgresql://myuser:$DB_PASSWORD@db.example.com:5432/postgres"
+
+Then set the environment variable before running Plano:
+
+.. code-block:: bash
+
+   export DB_PASSWORD="your-secure-password"
+   # Run Plano or config validation
+   ./plano
+
+.. warning::
+   **Special Characters in Passwords**: If your password contains special characters like ``#``, ``@``, or ``&``, you must URL-encode them in the connection string. For example, ``MyPass#123`` becomes ``MyPass%23123``.
+
+Supabase Connection Strings
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Supabase requires different connection strings depending on your network setup. Most users should use the **Session Pooler** connection string.
+
+**IPv4 Networks (Most Common)**
+
+Use the Session Pooler connection string (port 5432):
+
+.. code-block:: text
+
+   postgresql://postgres.[PROJECT-REF]:[PASSWORD]@aws-0-[REGION].pooler.supabase.com:5432/postgres
+
+**IPv6 Networks**
+
+Use the direct connection (port 5432):
+
+.. code-block:: text
+
+   postgresql://postgres:[PASSWORD]@db.[PROJECT-REF].supabase.co:5432/postgres
+
+**Finding Your Connection String**
+
+1. Go to your Supabase project dashboard
+2. Navigate to **Settings → Database → Connection Pooling**
+3. Copy the **Session mode** connection string
+4. Replace ``[YOUR-PASSWORD]`` with your actual database password
+5. URL-encode special characters in the password
+
+**Example Configuration**
+
+.. code-block:: yaml
+
+   state_storage:
+     type: postgres
+     connection_string: "postgresql://postgres.myproject:$DB_PASSWORD@aws-0-us-west-2.pooler.supabase.com:5432/postgres"
+
+Then set the environment variable:
+
+.. code-block:: bash
+
+   # If your password is "MyPass#123", encode it as "MyPass%23123"
+   export DB_PASSWORD="MyPass%23123"
+
+Troubleshooting
+---------------
+
+**"Table 'conversation_states' does not exist"**
+
+Run the SQL schema from ``docs/db_setup/conversation_states.sql`` against your database.
+
+**Connection errors with Supabase**
+
+* Verify you're using the correct connection string format (Session Pooler for IPv4)
+* Check that your password is URL-encoded if it contains special characters
+* Ensure your Supabase project hasn't paused due to inactivity (free tier)
+
+**Permission errors**
+
+Ensure your database user has the following permissions:
+
+.. code-block:: sql
+
+   GRANT SELECT, INSERT, UPDATE, DELETE ON conversation_states TO your_user;
+
+**State not persisting across requests**
+
+* Verify ``state_storage`` is configured in your ``plano_config.yaml``
+* Check Plano logs for state storage initialization messages
+* Ensure the client is sending the ``prev_response_id={$response_id}`` from previous responses
+
+Best Practices
+--------------
+
+1. **Use environment variables for credentials**: Never hardcode database passwords in configuration files.
+2. **Start with memory storage for development**: Switch to PostgreSQL when moving to production.
+3. **Implement cleanup policies**: Prevent unbounded growth by regularly archiving or deleting old conversations.
+4. **Monitor storage usage**: Track conversation state table size and query performance in production.
+5. **Test failover scenarios**: Ensure your application handles storage backend failures gracefully.
+
+Next Steps
+----------
+
+* Learn more about building :ref:`agents <agents>` that leverage conversational state
+* Explore :ref:`filter chains <filter_chain>` for enriching conversation context
+* See the :ref:`LLM Providers <llm_providers>` guide for configuring model routing