several improvements to docs. TODOS: Tracing and Filters

This commit is contained in:
Salman Paracha 2025-12-21 22:10:32 -08:00
parent 1d6a1613a2
commit e0404d305c
9 changed files with 297 additions and 301 deletions

View file

@ -3,21 +3,17 @@
LLM Routing
==============================================================
With the rapid proliferation of large language models (LLM) — each optimized for different strengths, style, or latency/cost profile — routing has become an essential technique to operationalize the use of different models.
Plano provides three distinct routing approaches to meet different use cases:
1. **Model-based Routing**: Direct routing to specific models using provider/model names
2. **Alias-based Routing**: Semantic routing using custom aliases that map to underlying models
3. **Preference-aligned Routing**: Intelligent routing using the Arch-Router model based on context and user-defined preferences
This enables optimal performance, cost efficiency, and response quality by matching requests with the most suitable model from your available LLM fleet.
With the rapid proliferation of large language models (LLMs) — each optimized for different strengths, style, or latency/cost profile — routing has become an essential technique to operationalize the use of different models. Plano provides three distinct routing approaches to meet different use cases: :ref:`Model-based routing <model_based_routing>`, :ref:`Alias-based routing <alias_based_routing>`, and :ref:`Preference-aligned routing <preference_aligned_routing>`. This enables optimal performance, cost efficiency, and response quality by matching requests with the most suitable model from your available LLM fleet.
.. note::
For details on supported model providers, configuration options, and client libraries, see :ref:`LLM Providers <llm_providers>`.
Routing Methods
---------------
Model-based Routing
.. _model_based_routing:
Model-based routing
~~~~~~~~~~~~~~~~~~~
Direct routing allows you to specify exact provider and model combinations using the format ``provider/model-name``:
@ -26,147 +22,10 @@ Direct routing allows you to specify exact provider and model combinations using
- Provides full control and transparency over which model handles each request
- Ideal for production workloads where you want predictable routing behavior
Alias-based Routing
~~~~~~~~~~~~~~~~~~~
Configuration
^^^^^^^^^^^^^
Alias-based routing lets you create semantic model names that decouple your application from specific providers:
- Use meaningful names like ``fast-model``, ``reasoning-model``, or ``arch.summarize.v1`` (see :ref:`model_aliases`)
- Maps semantic names to underlying provider models for easier experimentation and provider switching
- Ideal for applications that want abstraction from specific model names while maintaining control
.. _preference_aligned_routing:
Preference-aligned Routing (Arch-Router)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Traditional LLM routing approaches face significant limitations: they evaluate performance using benchmarks that often fail to capture human preferences, select from fixed model pools, and operate as "black boxes" without practical mechanisms for encoding user preferences.
Plano's preference-aligned routing addresses these challenges by applying a fundamental engineering principle: decoupling. The framework separates route selection (matching queries to human-readable policies) from model assignment (mapping policies to specific LLMs). This separation allows you to define routing policies using descriptive labels like ``Domain: 'finance', Action: 'analyze_earnings_report'`` rather than cryptic identifiers, while independently configuring which models handle each policy.
The `Arch-Router <https://huggingface.co/katanemo/Arch-Router-1.5B>`_ model automatically selects the most appropriate LLM based on:
- Domain Analysis: Identifies the subject matter (e.g., legal, healthcare, programming)
- Action Classification: Determines the type of operation (e.g., summarization, code generation, translation)
- User-Defined Preferences: Maps domains and actions to preferred models using transparent, configurable routing decisions
- Human Preference Alignment: Uses domain-action mappings that capture subjective evaluation criteria, ensuring routing aligns with real-world user needs rather than just benchmark scores
This approach supports seamlessly adding new models without retraining and is ideal for dynamic, context-aware routing that adapts to request content and intent.
Model-based Routing Workflow
----------------------------
For direct model routing, the process is straightforward:
#. **Client Request**
The client specifies the exact model using provider/model format (``openai/gpt-4o``).
#. **Provider Validation**
Plano validates that the specified provider and model are configured and available.
#. **Direct Routing**
The request is sent directly to the specified model without analysis or decision-making.
#. **Response Handling**
The response is returned to the client with optional metadata about the routing decision.
Alias-based Routing Workflow
-----------------------------
For alias-based routing, the process includes name resolution:
#. **Client Request**
The client specifies a semantic alias name (``reasoning-model``).
#. **Alias Resolution**
Plano resolves the alias to the actual provider/model name based on configuration.
#. **Model Selection**
If the alias maps to multiple models, Plano selects one based on availability and load balancing.
#. **Request Forwarding**
The request is forwarded to the resolved model.
#. **Response Handling**
The response is returned with optional metadata about the alias resolution.
.. _preference_aligned_routing_workflow:
Preference-aligned Routing Workflow (Arch-Router)
-------------------------------------------------
For preference-aligned dynamic routing, the process involves intelligent analysis:
#. **Prompt Analysis**
When a user submits a prompt without specifying a model, the Arch-Router analyzes it to determine the domain (subject matter) and action (type of operation requested).
#. **Model Selection**
Based on the analyzed intent and your configured routing preferences, the Router selects the most appropriate model from your available LLM fleet.
#. **Request Forwarding**
Once the optimal model is identified, our gateway forwards the original prompt to the selected LLM endpoint. The routing decision is transparent and can be logged for monitoring and optimization purposes.
#. **Response Handling**
After the selected model processes the request, the response is returned through the gateway. The gateway can optionally add routing metadata or performance metrics to help you understand and optimize your routing decisions.
Arch-Router
-------------------------
The `Arch-Router <https://huggingface.co/katanemo/Arch-Router-1.5B>`_ is a state-of-the-art **preference-based routing model** specifically designed to address the limitations of traditional LLM routing. This compact 1.5B model delivers production-ready performance with low latency and high accuracy while solving key routing challenges.
**Addressing Traditional Routing Limitations:**
**Human Preference Alignment**
Unlike benchmark-driven approaches, Arch-Router learns to match queries with human preferences by using domain-action mappings that capture subjective evaluation criteria, ensuring routing decisions align with real-world user needs.
**Flexible Model Integration**
The system supports seamlessly adding new models for routing without requiring retraining or architectural modifications, enabling dynamic adaptation to evolving model landscapes.
**Preference-Encoded Routing**
Provides a practical mechanism to encode user preferences through domain-action mappings, offering transparent and controllable routing decisions that can be customized for specific use cases.
To support effective routing, Arch-Router introduces two key concepts:
- **Domain** the high-level thematic category or subject matter of a request (e.g., legal, healthcare, programming).
- **Action** the specific type of operation the user wants performed (e.g., summarization, code generation, booking appointment, translation).
Both domain and action configs are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request.
In summary, Arch-Router demonstrates:
- **Structured Preference Routing**: Aligns prompt request with model strengths using explicit domainaction mappings.
- **Transparent and Controllable**: Makes routing decisions transparent and configurable, empowering users to customize system behavior.
- **Flexible and Adaptive**: Supports evolving user needs, model updates, and new domains/actions without retraining the router.
- **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments.
.. _implementing_routing:
Implementing Routing
--------------------
**Model-based Routing**
For direct model routing, configure your LLM providers with specific provider/model names:
Configure your LLM providers with specific provider/model names:
.. code-block:: yaml
:caption: Model-based Routing Configuration
@ -189,6 +48,9 @@ For direct model routing, configure your LLM providers with specific provider/mo
- model: anthropic/claude-3-5-sonnet-20241022
access_key: $ANTHROPIC_API_KEY
Client usage
^^^^^^^^^^^^
Clients specify exact models:
.. code-block:: python
@ -204,7 +66,19 @@ Clients specify exact models:
messages=[{"role": "user", "content": "Write a story"}]
)
**Alias-based Routing**
.. _alias_based_routing:
Alias-based routing
~~~~~~~~~~~~~~~~~~~
Alias-based routing lets you create semantic model names that decouple your application from specific providers:
- Use meaningful names like ``fast-model``, ``reasoning-model``, or ``arch.summarize.v1`` (see :ref:`model_aliases`)
- Maps semantic names to underlying provider models for easier experimentation and provider switching
- Ideal for applications that want abstraction from specific model names while maintaining control
Configuration
^^^^^^^^^^^^^
Configure semantic aliases that map to underlying models:
@ -239,6 +113,9 @@ Configure semantic aliases that map to underlying models:
creative-model:
target: claude-3-5-sonnet-20241022
Client usage
^^^^^^^^^^^^
Clients use semantic names:
.. code-block:: python
@ -254,9 +131,23 @@ Clients use semantic names:
messages=[{"role": "user", "content": "Solve this complex problem"}]
)
**Preference-aligned Routing (Arch-Router)**
.. _preference_aligned_routing:
To configure preference-aligned dynamic routing, you need to define routing preferences that map domains and actions to specific models:
Preference-aligned routing (Arch-Router)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Preference-aligned routing uses the `Arch-Router <https://huggingface.co/katanemo/Arch-Router-1.5B>`_ model to pick the best LLM based on domain, action, and your configured preferences instead of hard-coding a model.
- **Domain**: High-level topic of the request (e.g., legal, healthcare, programming).
- **Action**: What the user wants to do (e.g., summarize, generate code, translate).
- **Routing preferences**: Your mapping from (domain, action) to preferred models.
Arch-Router analyzes each prompt to infer domain and action, then applies your preferences to select a model. This decouples **routing policy** (how to choose) from **model assignment** (what to run), making routing transparent, controllable, and easy to extend as you add or swap models.
Configuration
^^^^^^^^^^^^^
To configure preference-aligned dynamic routing, define routing preferences that map domains and actions to specific models:
.. code-block:: yaml
:caption: Preference-Aligned Dynamic Routing Configuration
@ -289,7 +180,10 @@ To configure preference-aligned dynamic routing, you need to define routing pref
- name: code generation
description: generating new code snippets, functions, or boilerplate based on user prompts
Clients can let the router decide or use aliases:
Client usage
^^^^^^^^^^^^
Clients can let the router decide or still specify aliases:
.. code-block:: python
@ -300,6 +194,40 @@ Clients can let the router decide or use aliases:
)
Arch-Router
-----------
The `Arch-Router <https://huggingface.co/katanemo/Arch-Router-1.5B>`_ is a state-of-the-art **preference-based routing model** specifically designed to address the limitations of traditional LLM routing. This compact 1.5B model delivers production-ready performance with low latency and high accuracy while solving key routing challenges.
**Addressing Traditional Routing Limitations:**
**Human Preference Alignment**
Unlike benchmark-driven approaches, Arch-Router learns to match queries with human preferences by using domain-action mappings that capture subjective evaluation criteria, ensuring routing decisions align with real-world user needs.
**Flexible Model Integration**
The system supports seamlessly adding new models for routing without requiring retraining or architectural modifications, enabling dynamic adaptation to evolving model landscapes.
**Preference-Encoded Routing**
Provides a practical mechanism to encode user preferences through domain-action mappings, offering transparent and controllable routing decisions that can be customized for specific use cases.
To support effective routing, Arch-Router introduces two key concepts:
- **Domain** the high-level thematic category or subject matter of a request (e.g., legal, healthcare, programming).
- **Action** the specific type of operation the user wants performed (e.g., summarization, code generation, booking appointment, translation).
Both domain and action configs are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request.
In summary, Arch-Router demonstrates:
- **Structured Preference Routing**: Aligns prompt request with model strengths using explicit domainaction mappings.
- **Transparent and Controllable**: Makes routing decisions transparent and configurable, empowering users to customize system behavior.
- **Flexible and Adaptive**: Supports evolving user needs, model updates, and new domains/actions without retraining the router.
- **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments.
Combining Routing Methods
-------------------------
@ -343,7 +271,7 @@ This configuration allows clients to:
2. **Let the router decide**: No model specified, router analyzes content
Example Use Cases
-------------------------
-----------------
Here are common scenarios where Arch-Router excels:
- **Coding Tasks**: Distinguish between code generation requests ("write a Python function"), debugging needs ("fix this error"), and code optimization ("make this faster"), routing each to appropriately specialized models.
@ -354,9 +282,8 @@ Here are common scenarios where Arch-Router excels:
- **Conversational Routing**: Track conversation context to identify when topics shift between domains or when the type of assistance needed changes mid-conversation.
Best practicesm
-------------------------
Best practices
--------------
- **💡Consistent Naming:** Route names should align with their descriptions.
- ❌ Bad:
@ -381,18 +308,15 @@ Best practicesm
- **💡Nouns Descriptor:** Preference-based routers perform better with noun-centric descriptors, as they offer more stable and semantically rich signals for matching.
- **💡Domain Inclusion:** for best user experience, you should always include domain route. This help the router fall back to domain when action is not
- **💡Domain Inclusion:** for best user experience, you should always include a domain route. This helps the router fall back to domain when action is not confidently inferred.
.. Unsupported Features
.. -------------------------
Unsupported Features
--------------------
.. The following features are **not supported** by the Arch-Router model:
The following features are **not supported** by the Arch-Router model:
.. - **❌ Multi-Modality:**
.. The model is not trained to process raw image or audio inputs. While it can handle textual queries *about* these modalities (e.g., "generate an image of a cat"), it cannot interpret encoded multimedia data directly.
- **Multi-modality**: The model is not trained to process raw image or audio inputs. It can handle textual queries *about* these modalities (e.g., "generate an image of a cat"), but cannot interpret encoded multimedia data directly.
.. - **❌ Function Calling:**
.. This model is designed for **semantic preference matching**, not exact intent classification or tool execution. For structured function invocation, use models in the **Plano-Function-Calling** collection.
- **Function calling**: Arch-Router is designed for **semantic preference matching**, not exact intent classification or tool execution. For structured function invocation, use models in the Plano Function Calling collection instead.
.. - **❌ System Prompt Dependency:**
.. Arch-Router routes based solely on the users conversation history. It does not use or rely on system prompts for routing decisions.
- **System prompt dependency**: Arch-Router routes based solely on the users conversation history. It does not use or rely on system prompts for routing decisions.

View file

@ -17,9 +17,9 @@ requests in an AI application. With tracing, you can capture a detailed view of
through various services and components, which is crucial for **debugging**, **performance optimization**,
and understanding complex AI agent architectures like Co-pilots.
**Arch** propagates trace context using the W3C Trace Context standard, specifically through the
**Plano** propagates trace context using the W3C Trace Context standard, specifically through the
``traceparent`` header. This allows each component in the system to record its part of the request
flow, enabling **end-to-end tracing** across the entire application. By using OpenTelemetry, Arch ensures
flow, enabling **end-to-end tracing** across the entire application. By using OpenTelemetry, Plano ensures
that developers can capture this trace data consistently and in a format compatible with various observability
tools.
@ -43,7 +43,7 @@ How to Initiate A Trace
1. **Enable Tracing Configuration**: Simply add the ``random_sampling`` in ``tracing`` section to 100`` flag to in the :ref:`listener <plano_overview_listeners>` config
2. **Trace Context Propagation**: Arch automatically propagates the ``traceparent`` header. When a request is received, Arch will:
2. **Trace Context Propagation**: Plano automatically propagates the ``traceparent`` header. When a request is received, Plano will:
- Generate a new ``traceparent`` header if one is not present.
- Extract the trace context from the ``traceparent`` header if it exists.
@ -57,7 +57,7 @@ How to Initiate A Trace
Trace Propagation
-----------------
Arch uses the W3C Trace Context standard for trace propagation, which relies on the ``traceparent`` header.
Plano uses the W3C Trace Context standard for trace propagation, which relies on the ``traceparent`` header.
This header carries tracing information in a standardized format, enabling interoperability between different
tracing systems.
@ -77,7 +77,7 @@ Instrumentation
~~~~~~~~~~~~~~~
To integrate AI tracing, your application needs to follow a few simple steps. The steps
below are very common practice, and not unique to Arch, when you reading tracing headers and export
below are very common practice, and not unique to Plano, when you reading tracing headers and export
`spans <https://docs.lightstep.com/docs/understand-distributed-tracing>`_ for distributed tracing.
- Read the ``traceparent`` header from incoming requests.
@ -148,66 +148,6 @@ Handle incoming requests:
print(f"Payment service response: {response.content}")
AI Agent Tracing Visualization Example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following is an example of tracing for an AI-powered customer support system.
A customer interacts with AI agents, which forward their requests through different
specialized services and external systems.
::
+--------------------------+
| Customer Interaction |
+--------------------------+
|
v
+--------------------------+ +--------------------------+
| Agent 1 (Main - Arch) | ----> | External Payment Service |
+--------------------------+ +--------------------------+
| |
v v
+--------------------------+ +--------------------------+
| Agent 2 (Support - Arch)| ----> | Internal Tech Support |
+--------------------------+ +--------------------------+
| |
v v
+--------------------------+ +--------------------------+
| Agent 3 (Orders- Arch) | ----> | Inventory Management |
+--------------------------+ +--------------------------+
Trace Breakdown:
****************
- Customer Interaction:
- Span 1: Customer initiates a request via the AI-powered chatbot for billing support (e.g., asking for payment details).
- AI Agent 1 (Main - Arch):
- Span 2: AI Agent 1 (Main) processes the request and identifies it as related to billing, forwarding the request
to an external payment service.
- Span 3: AI Agent 1 determines that additional technical support is needed for processing and forwards the request
to AI Agent 2.
- External Payment Service:
- Span 4: The external payment service processes the payment-related request (e.g., verifying payment status) and sends
the response back to AI Agent 1.
- AI Agent 2 (Tech - Arch):
- Span 5: AI Agent 2, responsible for technical queries, processes a request forwarded from AI Agent 1 (e.g., checking for
any account issues).
- Span 6: AI Agent 2 forwards the query to Internal Tech Support for further investigation.
- Internal Tech Support:
- Span 7: Internal Tech Support processes the request (e.g., resolving account access issues) and responds to AI Agent 2.
- AI Agent 3 (Orders - Arch):
- Span 8: AI Agent 3 handles order-related queries. AI Agent 1 forwards the request to AI Agent 3 after payment verification
is completed.
- Span 9: AI Agent 3 forwards a request to the Inventory Management system to confirm product availability for a pending order.
- Inventory Management:
- Span 10: The Inventory Management system checks stock and availability and returns the information to AI Agent 3.
Integrating with Tracing Tools
------------------------------
@ -292,11 +232,11 @@ To send tracing data to `Datadog <https://docs.datadoghq.com/getting_started/tra
Langtrace
~~~~~~~~~
Langtrace is an observability tool designed specifically for large language models (LLMs). It helps you capture, analyze, and understand how LLMs are used in your applications including those built using Arch.
Langtrace is an observability tool designed specifically for large language models (LLMs). It helps you capture, analyze, and understand how LLMs are used in your applications including those built using Plano.
To send tracing data to `Langtrace <https://docs.langtrace.ai/supported-integrations/llm-tools/arch>`_:
1. **Configure Arch**: Make sure Arch is installed and setup correctly. For more information, refer to the `installation guide <https://github.com/katanemo/archgw?tab=readme-ov-file#prerequisites>`_.
1. **Configure Plano**: Make sure Plano is installed and setup correctly. For more information, refer to the `installation guide <https://github.com/katanemo/archgw?tab=readme-ov-file#prerequisites>`_.
2. **Install Langtrace**: Install the Langtrace SDK.:
@ -348,7 +288,7 @@ Best Practices
Summary
----------
By leveraging the ``traceparent`` header for trace context propagation, Arch enables developers to implement
By leveraging the ``traceparent`` header for trace context propagation, Plano enables developers to implement
tracing efficiently. This approach simplifies the process of collecting and analyzing tracing data in common
tools like AWS X-Ray and Datadog, enhancing observability and facilitating faster debugging and optimization.