Use intent model from archfc to pick prompt gateway (#328)

2026-05-15 11:02:39 +02:00 · 2024-12-20 13:25:01 -08:00 · 2024-12-20 13:25:01 -08:00 · ba7279becb
commit ba7279becb
parent 67b8fd635e
151 changed files with 8642 additions and 10932 deletions
--- a/docs/source/_static/img/input-token-metrics.png
+++ b/docs/source/_static/img/input-token-metrics.png
--- a/docs/source/_static/img/llm-request-metrics.png
+++ b/docs/source/_static/img/llm-request-metrics.png
--- a/docs/source/_static/img/output-token-metrics.png
+++ b/docs/source/_static/img/output-token-metrics.png
--- a/docs/source/build_with_arch/agent.rst
+++ b/docs/source/build_with_arch/agent.rst
@ -1,18 +1,18 @@
 .. _arch_agent_guide:

-Agentic Workflow
-==============================
+Agentic Apps
+=============

-Arch helps you easily personalize your applications by calling application-specific (API) functions
-via user prompts. This involves any predefined functions or APIs you want to expose to users to perform tasks,
-gather information, or manipulate data. This capability is generally referred to as :ref:`function calling <function_calling>`, where
-you have the flexibility to support “agentic” apps tailored to specific use cases - from updating insurance
-claims to creating ad campaigns - via prompts.
+Arch helps you build personalized agentic applications by calling application-specific (API) functions via user prompts.
+This involves any predefined functions or APIs you want to expose to users to perform tasks, gather information,
+or manipulate data. This capability is generally referred to as :ref:`function calling <function_calling>`, where
+you can support “agentic” apps tailored to specific use cases - from updating insurance claims to creating ad campaigns - via prompts.

-Arch analyzes prompts, extracts critical information from prompts, engages in lightweight conversation with
-the user to gather any missing parameters and makes API calls so that you can focus on writing business logic.
-Arch does this via its purpose-built `Arch-Function <https://huggingface.co/collections/katanemo/arch-function-66f209a693ea8df14317ad68>`_ - the fastest (200ms p90 - 10x faser than GPT-4o)
-and cheapest (100x than GPT-4o) function calling LLM that matches performance with frontier models.
+Arch analyzes prompts, extracts critical information from prompts, engages in lightweight conversation with the user to
+gather any missing parameters and makes API calls so that you can focus on writing business logic. Arch does this via its
+purpose-built `Arch-Function <https://huggingface.co/collections/katanemo/arch-function-66f209a693ea8df14317ad68>`_ -
+the fastest (200ms p50 - 12x faser than GPT-4o) and cheapest (44x than GPT-4o) function calling LLM that matches or outperforms
+frontier LLMs.

 .. image:: includes/agent/function-calling-flow.jpg
   :width: 100%
--- a/docs/source/build_with_arch/includes/multi_turn/multi_turn_rag.py
+++ b/docs/source/build_with_arch/includes/multi_turn/multi_turn_rag.py
@ -0,0 +1,39 @@
+import os
+import gradio as gr
+
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import Optional
+from openai import OpenAI
+from common import create_gradio_app
+
+app = FastAPI()
+
+
+# Define the request model
+class EnergySourceRequest(BaseModel):
+    energy_source: str
+    consideration: Optional[str] = None
+
+
+class EnergySourceResponse(BaseModel):
+    energy_source: str
+    consideration: Optional[str] = None
+
+
+# Post method for device summary
+@app.post("/agent/energy_source_info")
+def get_workforce(request: EnergySourceRequest):
+    """
+    Endpoint to get details about energy source
+    """
+    considertion = "You don't have any specific consideration. Feel free to talk in a more open ended fashion"
+
+    if request.consideration is not None:
+        considertion = f"Add specific focus on the following consideration when you summarize the content for the energy source: {request.consideration}"
+
+    response = {
+        "energy_source": request.energy_source,
+        "consideration": considertion,
+    }
+    return response
--- a/docs/source/build_with_arch/includes/multi_turn/mutli-turn-example.png
+++ b/docs/source/build_with_arch/includes/multi_turn/mutli-turn-example.png
--- a/docs/source/build_with_arch/includes/multi_turn/prompt_targets_multi_turn.yaml
+++ b/docs/source/build_with_arch/includes/multi_turn/prompt_targets_multi_turn.yaml
@ -0,0 +1,35 @@
+version: v0.1
+listener:
+  address: 127.0.0.1
+  port: 8080 #If you configure port 443, you'll need to update the listener with tls_certificates
+  message_format: huggingface
+
+# Centralized way to manage LLMs, manage keys, retry logic, failover and limits in a central way
+llm_providers:
+  - name: OpenAI
+    provider: openai
+    access_key: $OPENAI_API_KEY
+    model: gpt-3.5-turbo
+    default: true
+
+# default system prompt used by all prompt targets
+system_prompt: |
+   You are a helpful assistant and can offer information about energy sources. You will get a JSON object with energy_source and consideration fields. Focus on answering using those fields
+
+prompt_targets:
+  - name: get_info_for_energy_source
+    description: get information about an energy source
+    parameters:
+      - name: energy_source
+        type: str
+        description: a source of energy
+        required: true
+        enum: [renewable, fossil]
+      - name: consideration
+        type: str
+        description: a specific type of consideration for an energy source
+        enum: [cost, economic, technology]
+    endpoint:
+      name: rag_energy_source_agent
+      path: /agent/energy_source_info
+      http_method: POST
--- a/docs/source/build_with_arch/includes/rag/intent_detection.py
+++ b/docs/source/build_with_arch/includes/rag/intent_detection.py
@ -1,162 +0,0 @@
-from flask import Flask, request, jsonify
-from datetime import datetime
-import uuid
-from langchain.memory import ConversationBufferMemory
-from langchain.schema import AIMessage, HumanMessage
-from langchain import OpenAI
-
-app = Flask(__name__)
-
-# Global dictionary to keep track of user memories
-user_memories = {}
-
-
-def get_user_conversation(user_id):
-    """
-    Retrieve the user's conversation memory using LangChain.
-    If the user does not exist, initialize their conversation memory.
-    """
-    if user_id not in user_memories:
-        user_memories[user_id] = ConversationBufferMemory(return_messages=True)
-    return user_memories[user_id]
-
-
-def update_user_conversation(user_id, client_messages, intent_changed):
-    """
-    Update the user's conversation memory with new messages using LangChain.
-    Each message is augmented with a UUID, timestamp, and intent change marker.
-    Only new messages are added to avoid duplication.
-    """
-    memory = get_user_conversation(user_id)
-    stored_messages = memory.chat_memory.messages
-
-    # Determine the number of stored messages
-    num_stored_messages = len(stored_messages)
-    new_messages = client_messages[num_stored_messages:]
-
-    # Process each new message
-    for index, message in enumerate(new_messages):
-        role = message.get("role")
-        content = message.get("content")
-        metadata = {
-            "uuid": str(uuid.uuid4()),
-            "timestamp": datetime.utcnow().isoformat(),
-            "intent_changed": False,  # Default value
-        }
-
-        # Mark the intent change on the last message if detected
-        if intent_changed and index == len(new_messages) - 1:
-            metadata["intent_changed"] = True
-
-        # Create a new message with metadata
-        if role == "user":
-            memory.chat_memory.add_message(
-                HumanMessage(content=content, additional_kwargs={"metadata": metadata})
-            )
-        elif role == "assistant":
-            memory.chat_memory.add_message(
-                AIMessage(content=content, additional_kwargs={"metadata": metadata})
-            )
-        else:
-            # Handle other roles if necessary
-            pass
-
-    return memory
-
-
-def get_messages_since_last_intent(messages):
-    """
-    Retrieve messages from the last intent change onwards using LangChain.
-    """
-    messages_since_intent = []
-    for message in reversed(messages):
-        # Insert message at the beginning to maintain correct order
-        messages_since_intent.insert(0, message)
-        metadata = message.additional_kwargs.get("metadata", {})
-        # Break if intent_changed is True
-        if metadata.get("intent_changed", False) == True:
-            break
-
-    return messages_since_intent
-
-
-def forward_to_llm(messages):
-    """
-    Forward messages to an upstream LLM using LangChain.
-    """
-    # Convert messages to a conversation string
-    conversation = ""
-    for message in messages:
-        role = "User" if isinstance(message, HumanMessage) else "Assistant"
-        content = message.content
-        conversation += f"{role}: {content}\n"
-    # Use LangChain's LLM to get a response. This call is proxied through Arch for end-to-end observability and traffic management
-    llm = OpenAI()
-    # Create a prompt that includes the conversation
-    prompt = f"{conversation}Assistant:"
-    response = llm(prompt)
-    return response
-
-
-@app.route("/process_rag", methods=["POST"])
-def process_rag():
-    # Extract JSON data from the request
-    data = request.get_json()
-
-    user_id = data.get("user_id")
-    if not user_id:
-        return jsonify({"error": "User ID is required"}), 400
-
-    client_messages = data.get("messages")
-    if not client_messages or not isinstance(client_messages, list):
-        return jsonify({"error": "Messages array is required"}), 400
-
-    # Extract the intent change marker from Arch's headers if present for the current prompt
-    intent_changed_header = request.headers.get("x-arch-intent-marker", "").lower()
-    if intent_changed_header in ["", "false"]:
-        intent_changed = False
-    elif intent_changed_header == "true":
-        intent_changed = True
-    else:
-        # Invalid value provided
-        return (
-            jsonify({"error": "Invalid value for x-arch-prompt-intent-change header"}),
-            400,
-        )
-
-    # Update user conversation based on intent change
-    memory = update_user_conversation(user_id, client_messages, intent_changed)
-
-    # Retrieve messages since last intent change for LLM
-    messages_for_llm = get_messages_since_last_intent(memory.chat_memory.messages)
-
-    # Forward messages to upstream LLM
-    llm_response = forward_to_llm(messages_for_llm)
-
-    # Prepare the messages to return
-    messages_to_return = []
-    for message in memory.chat_memory.messages:
-        role = "user" if isinstance(message, HumanMessage) else "assistant"
-        content = message.content
-        metadata = message.additional_kwargs.get("metadata", {})
-        message_entry = {
-            "uuid": metadata.get("uuid"),
-            "timestamp": metadata.get("timestamp"),
-            "role": role,
-            "content": content,
-            "intent_changed": metadata.get("intent_changed", False),
-        }
-        messages_to_return.append(message_entry)
-
-    # Prepare the response
-    response = {
-        "user_id": user_id,
-        "messages": messages_to_return,
-        "llm_response": llm_response,
-    }
-
-    return jsonify(response), 200
-
-
-if __name__ == "__main__":
-    app.run(debug=True)
--- a/docs/source/build_with_arch/multi_turn.rst
+++ b/docs/source/build_with_arch/multi_turn.rst
@ -0,0 +1,90 @@
+.. _arch_multi_turn_guide:
+
+Multi-Turn
+==========
+Developers often `struggle <https://www.reddit.com/r/LocalLLaMA/comments/18mqwg6/best_practice_for_rag_with_followup_chat/>`_ to efficiently handle
+``follow-up`` or ``clarification`` questions. Specifically, when users ask for changes or additions to previous responses, it requires developers to
+re-write prompts using LLMs with precise prompt engineering techniques. This process is slow, manual, error prone and adds latency and token cost for
+common scenarios that can be managed more efficiently.
+
+Arch is highly capable of accurately detecting and processing prompts in multi-turn scenarios so that you can buil fast and accurate agents in minutes.
+Below are some cnversational examples that you can build via Arch. Each example is enriched with annotations (via ** [Arch] ** ) that illustrates how Arch
+processess conversational messages on your behalf.
+
+.. Note::
+    The following section assumes that you have some knowledge about the core concepts of Arch, such as :ref:`prompt_targets <arch_overview_prompt_handling>`.
+    If you haven't familizaried yourself with Arch's concepts, we recommend you first read the :ref:`tech overview <tech_overview>` section firtst.
+    Additionally, the conversation examples below assume the usage of the following :ref:`arch_config.yaml <multi_turn_subsection_prompt_target>` file.
+
+Example 1: Adjusting Retrieval
+------------------------------
+.. code-block:: text
+
+    User: What are the benefits of renewable energy?
+    **[Arch]**: Check if there is an available <prompt_target> that can handle this user query.
+    **[Arch]**: Found "get_info_for_energy_source" prompt_target in arch_config.yaml. Forward prompt to the endpoint configured in "get_info_for_energy_source"
+    ...
+    Assistant: Renewable energy reduces greenhouse gas emissions, lowers air pollution, and provides sustainable power sources like solar and wind.
+
+    User: Include cost considerations in the response.
+    **[Arch]**: Follow-up detected. Forward prompt history to the "get_info_for_energy_source" prompt_target and post the following parameters consideration="cost"
+    ...
+    Assistant: Renewable energy reduces greenhouse gas emissions, lowers air pollution, and provides sustainable power sources like solar and wind. While the initial setup costs can be high, long-term savings from reduced fuel expenses and government incentives make it cost-effective.
+
+
+Example 2: Switching Intent
+---------------------------
+.. code-block:: text
+
+    User: What are the symptoms of diabetes?
+    **[Arch]**: Check if there is an available <prompt_target> that can handle this user query.
+    **[Arch]**: Found "diseases_symptoms" prompt_target in arch_config.yaml. Forward disease=diabeteres to "diseases_symptoms" prompt target
+    ...
+    Assistant: Common symptoms include frequent urination, excessive thirst, fatigue, and blurry vision.
+
+    User: How is it diagnosed?
+    **[Arch]**: New intent detected.
+    **[Arch]**: Found "disease_diagnoses" prompt_target in arch_config.yaml. Forward disease=diabeteres to "disease_diagnoses" prompt target
+    ...
+    Assistant: Diabetes is diagnosed through blood tests like fasting blood sugar, A1C, or an oral glucose tolerance test.
+
+
+Build Multi-Turn RAG Apps
+--------------------------
+The following section describes how you can easilly add support for multi-turn scenarios via Arch. You process and manage multi-turn prompts
+just like you manage single-turn ones. Arch handles the conpleixity of detecting the correct intent based on the last user prompt and
+the covnersational history, extracts relevant parameters needed by downstream APIs, and dipatches calls to any upstream LLMs to summarize the
+response from your APIs.
+
+
+.. _multi_turn_subsection_prompt_target:
+
+Step 1: Define Arch Config
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. literalinclude:: includes/multi_turn/prompt_targets_multi_turn.yaml
+    :language: yaml
+    :caption: Arch Config
+    :linenos:
+
+Step 2: Process Request in Flask
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Once the prompt targets are configured as above, handle parameters across multi-turn as if its a single-turn request
+
+.. literalinclude:: includes/multi_turn/multi_turn_rag.py
+    :language: python
+    :caption: Parameter handling with Flask
+    :linenos:
+
+Demo App
+~~~~~~~~
+
+For your convenience, we've built a `demo app <https://github.com/katanemo/archgw/main/demos/multi_turn_rag_agent>`_
+that you can test and modify locally for multi-turn RAG scenarios.
+
+.. figure:: includes/multi_turn/mutli-turn-example.png
+   :width: 100%
+   :align: center
+
+   Example multi-turn user conversation showing adjusting retrieval
--- a/docs/source/build_with_arch/rag.rst
+++ b/docs/source/build_with_arch/rag.rst
@ -1,10 +1,18 @@
 .. _arch_rag_guide:

-RAG Application
-===============
+RAG Apps
+========

 The following section describes how Arch can help you build faster, smarter and more accurate
-Retrieval-Augmented Generation (RAG) applications.
+Retrieval-Augmented Generation (RAG) applications, including fast and accurate RAG in multi-turn
+converational scenarios.
+
+What is Retrieval-Augmented Generation (RAG)?
+---------------------------------------------
+RAG applications combine retrieval-based methods with generative AI models to provide more accurate,
+contextually relevant, and reliable outputs. These applications leverage external data sources to augment
+the capabilities of Large Language Models (LLMs), enabling them to retrieve and integrate specific information
+rather than relying solely on the LLM's internal knowledge.

 Parameter Extraction for RAG
 ----------------------------
@ -33,60 +41,12 @@ Once the prompt targets are configured as above, handling those parameters is
    :caption: Parameter handling with Flask
    :linenos:

-[Coming Soon] `Drift Detection via Arch Intent-Markers <https://github.com/orgs/katanemo/projects/1/views/1?pane=issue&itemId=82697909>`_
-----------------------------------------------------------------------------------------------------------------------------------------
-Developers struggle to efficiently handle ``follow-up`` or ``clarification`` questions. Specifically, when users ask for
-changes or additions to previous responses their AI applications often generate entirely new responses instead of adjusting
-previous ones. Arch offers ``intent tracking`` as a feature so that developers can know when the user has shifted away from a
-previous intent so that they can dramatically improve retrieval accuracy, lower overall token cost and  improve the speed of
-their responses back to users.
+Multi-Turn RAG (Follow-up Questions)
+-------------------------------------
+Developers often `struggle <https://www.reddit.com/r/LocalLLaMA/comments/18mqwg6/best_practice_for_rag_with_followup_chat/>`_ to efficiently handle
+``follow-up`` or ``clarification`` questions. Specifically, when users ask for changes or additions to previous responses, it requires developers to
+re-write prompts using LLMs with precise prompt engineering techniques. This process is slow, manual, error prone and adds signifcant latency to the
+user experience. Arch

-Arch uses its built-in lightweight NLI and embedding models to know if the user has steered away from an active intent.
-Arch's intent-drift detection mechanism is based on its :ref:`prompt target <prompt_target>` primtive. Arch tries to match an incoming
-prompt to one of the prompt_targets configured in the gateway. Once it detects that the user has moved away from an active
-active intent, Arch adds the ``x-arch-intent-marker`` headers to the request before sending it your application servers.
-
-.. literalinclude:: includes/rag/intent_detection.py
-    :language: python
-    :linenos:
-    :lines: 101-157
-    :emphasize-lines: 14-25
-    :caption: Intent Detection Example
-
-
-.. Note::
-
-   Arch is (mostly) stateless so that it can scale in an embarrassingly parrallel fashion. So, while Arch offers
-   intent-drift detetction, you still have to maintain converational state with intent drift as metadata. The
-   following code snippets show how easily you can build and enrich conversational history with Langchain (in Python),
-   so that you can use the most relevant prompts for your retrieval and for prompting upstream LLMs.
-
-
-Step 1: Define ConversationBufferMemory
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. literalinclude:: includes/rag/intent_detection.py
-    :language: python
-    :linenos:
-    :lines: 1-21
-
-Step 2: Update ConversationBufferMemory with Intents
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. literalinclude:: includes/rag/intent_detection.py
-    :language: python
-    :linenos:
-    :lines: 24-64
-
-Step 3: Get Messages based on latest drift
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. literalinclude:: includes/rag/intent_detection.py
-    :language: python
-    :linenos:
-    :lines: 67-80
-
-
-You can used the last set of messages that match to an intent to prompt an LLM, use it with an vector-DB for
-improved retrieval, etc. With Arch and a few lines of code, you can improve the retrieval accuracy, lower overall
-token cost and dramatically improve the speed of their responses back to users.
+Arch is highly capable of accurately detecting and processing prompts in a multi-turn scenarios so that you can buil fast and accurate RAG apps in
+minutes. For additional details on how to build multi-turn RAG applications please refer to our :ref:`multi-turn <arch_multi_turn_guide>` docs.
--- a/docs/source/concepts/tech_overview/prompt.rst
+++ b/docs/source/concepts/tech_overview/prompt.rst
@ -87,7 +87,7 @@ For more details on how you can build agentic applications using Arch, see our f
 .. Note::
   `Arch-Function <https://huggingface.co/collections/katanemo/arch-function-66f209a693ea8df14317ad68>`_ is a collection of dedicated agentic models engineered in Arch to extract information from a (set of) prompts and executes necessary backend API calls.
   This allows for efficient handling of agentic tasks, such as scheduling data retrieval, by dynamically interacting with backend services.
-   Arch-Function achieves state-of-the-art performance, comparable with frontier models like Claude Sonnet 3.5 ang GPT-4, while being 100x cheaper ($0.05M/token hosted) and 10x faster (p50 latencies of 200ms).
+   Arch-Function achieves state-of-the-art performance, comparable with frontier models like Claude Sonnet 3.5 ang GPT-4, while being 44x cheaper ($0.10M/token hosted) and 10x faster (p50 latencies of 200ms).

 Prompting LLMs
 --------------
--- a/docs/source/concepts/tech_overview/terminology.rst
+++ b/docs/source/concepts/tech_overview/terminology.rst
@ -3,8 +3,16 @@
 Terminology
 ============

-A few definitions before we dive into the main architecture documentation. Arch borrows from Envoy's terminology
-to keep things consistent in logs, traces and in code.
+A few definitions before we dive into the main architecture documentation. Also note, Arch borrows from Envoy's terminology
+to keep things consistent in logs and traces, and introduces and clarifies concepts are is relates to LLM applications.
+
+**Agent**: An application that uses LLMs to handle wide-ranging tasks from users via prompts. This could be as simple
+as retrieving or summarizing data from an API, or being able to trigger compleix actions like adjusting ad campaigns, or
+changing travel plans via prompts.
+
+**Arch Config**: Arch operates based on a configuration that controls the behavior of a single instance of the Arch gateway.
+This where you enable capabilities like LLM routing, fast function calling (via prompt_targets), applying guardrails, and enabling critical
+features like metrics and tracing. For the full configuration reference of `arch_config.yaml` see :ref:`here <configuration_refernce>`.

 **Downstream(Ingress)**: An downstream client (web application, etc.) connects to Arch, sends prompts, and receives responses.

@ -14,9 +22,10 @@ to keep things consistent in logs, traces and in code.
   :width: 100%
   :align: center

-**Listener**: A :ref:`listener <arch_overview_listeners>` is a named network location (e.g., port, address, path etc.) that Arch listens on to process prompts
-before forwarding them to your application server endpoints. rch enables you to configure one listener for downstream connections
-(like port 80, 443) and creates a separate internal listener for calls that initiate from your application code to LLMs.
+**Listener**: A :ref:`listener <arch_overview_listeners>` is a named network location (e.g., port, address, path etc.) that Arch
+listens on to process prompts before forwarding them to your application server endpoints. rch enables you to configure one listener
+for downstream connections (like port 80, 443) and creates a separate internal listener for calls that initiate from your application
+code to LLMs.

 .. Note::

@ -24,23 +33,18 @@ before forwarding them to your application server endpoints. rch enables you to
   that you can use (``127.0.0.1:12000``) to proxy egress calls originating from your application to LLMs (API-based or hosted).
   For more details, check out :ref:`LLM provider <llm_provider>`.

-**Instance**: An instance of the Arch gateway. When you start Arch it creates at most two processes. One to handle Layer 7
-networking operations (auth, tls, observability, etc) and the second process to serve models that enable it to make smart
-decisions on how to accept, handle and forward prompts. The second process is optional, as the model serving sevice could be
-hosted on a different network (an API call). But these two processes are considered a single instance of Arch.
+**Prompt Target**: Arch offers a primitive called :ref:`prompt target <prompt_target>` to help separate business logic from
+undifferentiated work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Arch.
+For example, Arch enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt so that you
+can build faster, more accurate retrieval (RAG) apps. To support agentic apps, like scheduling travel plans or sharing comments on a
+document - via prompts, Arch uses its function calling abilities to extract critical information fromthe incoming prompt (or a set of
+prompts) needed by a downstream backend API or function call before calling it directly.

-**Prompt Target**: Arch offers a primitive called :ref:`prompt target <prompt_target>` to help separate business logic from undifferentiated
-work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Arch.
-For example, Arch enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt
-so that you can build faster, more accurate retrieval (RAG) apps. To support agentic apps, like scheduling travel plans or
-sharing comments on a document - via prompts, Arch uses its function calling abilities to extract critical information from
-the incoming prompt (or a set of prompts) needed by a downstream backend API or function call before calling it directly.
+**Model Serving**: Arch is a set of `two` self-contained processes that are designed to run alongside your application servers
+(or on a separate hostconnected via a network).The :ref:`model serving <model_serving>` process helps Arch make intelligent decisions
+about the incoming prompts. The model server is designed to call the (fast) purpose-built LLMs in Arch.

 **Error Target**: :ref:`Error targets <error_target>` are those endpoints that receive forwarded errors from Arch when issues arise,
 such as failing to properly call a function/API, detecting violations of guardrails, or encountering other processing errors.
 These errors are communicated to the application via headers ``X-Arch-[ERROR-TYPE]``, allowing it to handle the errors gracefully
 and take appropriate actions.
-
-**Model Serving**: Arch is a set of `two` self-contained processes that are designed to run alongside your application servers
-(or on a separate hostconnected via a network).The :ref:`model serving <model_serving>` process helps Arch make intelligent decisions about the
-incoming prompts. The model server is designed to call the (fast) purpose-built LLMs in Arch.
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -15,7 +15,7 @@ from sphinxawesome_theme.postprocess import Icons
 project = "Arch Docs"
 copyright = "2024, Katanemo Labs, Inc"
 author = "Katanemo Labs, Inc"
-release = " v0.1.5"
+release = " v0.1.7"

 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
--- a/docs/source/get_started/quickstart.rst
+++ b/docs/source/get_started/quickstart.rst
@ -25,7 +25,7 @@ Arch's CLI allows you to manage and interact with the Arch gateway efficiently.

   $ python -m venv venv
   $ source venv/bin/activate   # On Windows, use: venv\Scripts\activate
-   $ pip install archgw==0.1.6
+   $ pip install archgw==0.1.7


 Build AI Agent with Arch Gateway
--- a/docs/source/guides/observability/monitoring.rst
+++ b/docs/source/guides/observability/monitoring.rst
@ -3,7 +3,75 @@
 Monitoring
 ==========

-Arch offers several monitoring metrics that help you understand three critical aspects of your application:
-latency, token usage, and error rates by an upstream LLM provider. Latency measures the speed at which your
-application is responding to users, which includes metrics like time to first token (TFT), time per output
-token (TOT) metrics, and the total latency as perceived by users.
+`OpenTelemetry <https://opentelemetry.io/>`_ is an open-source observability framework providing APIs
+and instrumentation for generating, collecting, processing, and exporting telemetry data, such as traces,
+metrics, and logs. Its flexible design supports a wide range of backends and seamlessly integrates with
+modern application tools.
+
+Arch acts a *source* for several monitoring metrics related to **prompts** and **LLMs** natively integrated
+via `OpenTelemetry <https://opentelemetry.io/>`_ to help you understand three critical aspects of your application:
+latency, token usage, and error rates by an upstream LLM provider. Latency measures the speed at which your application
+is responding to users, which includes metrics like time to first token (TFT), time per output token (TOT) metrics, and
+the total latency as perceived by users. Below are some screenshots how Arch integrates natively with tools like
+`Grafana <https://grafana.com/grafana/dashboards/>`_ via `Promethus <https://prometheus.io/>`_
+
+
+Metrics Dashboard (via Grafana)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. image:: /_static/img/llm-request-metrics.png
+   :width: 100%
+   :align: center
+
+.. image:: /_static/img/input-token-metrics.png
+   :width: 100%
+   :align: center
+
+.. image:: /_static/img/output-token-metrics.png
+   :width: 100%
+   :align: center
+
+Configure Monitoring
+~~~~~~~~~~~~~~~~~~~~
+Arch gateway publishes stats endpoint at http://localhost:19901/stats. As noted above, Arch is a source for metrics. To view and manipulate dashbaords, you will
+need to configiure `Promethus <https://prometheus.io/>`_ (as a metrics store) and `Grafana <https://grafana.com/grafana/dashboards/>`_ for dashboards. Below
+are some sample configuration files for both, respectively.
+
+.. code-block:: yaml
+    :caption: Sample prometheus.yaml config file
+
+    global:
+    scrape_interval: 15s
+    scrape_timeout: 10s
+    evaluation_interval: 15s
+    alerting:
+    alertmanagers:
+        - static_configs:
+            - targets: []
+        scheme: http
+        timeout: 10s
+        api_version: v2
+    scrape_configs:
+    - job_name: archgw
+        honor_timestamps: true
+        scrape_interval: 15s
+        scrape_timeout: 10s
+        metrics_path: /stats
+        scheme: http
+        static_configs:
+        - targets:
+            - host.docker.internal:19901
+        params:
+        format: ["prometheus"]
+
+
+.. code-block:: yaml
+    :caption: Sample grafana datasource.yaml config file
+
+    apiVersion: 1
+    datasources:
+    - name: Prometheus
+        type: prometheus
+        url: http://prometheus:9090
+        isDefault: true
+        access: proxy
+        editable: true
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -14,7 +14,7 @@ Welcome to Arch!

   <a href="https://www.producthunt.com/posts/arch-3?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_souce=badge-arch&#0045;3" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=565761&theme=light&period=daily" alt="Arch - Build&#0032;fast&#0044;&#0032;hyper&#0045;personalized&#0032;agents&#0032;with&#0032;intelligent&#0032;infra | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>

-`Arch <https://github.com/katanemo/arch>`_ is an intelligent infrastructure primitive for GenAI (built by the contributors of `Envoy <https://www.envoyproxy.io/>`_ ) that born out of the belief that:
+`Arch <https://github.com/katanemo/arch>`_ is an intelligent gateway for agents - an infrastructure primitive for GenAI (built by the contributors of `Envoy <https://www.envoyproxy.io/>`_ ). The project was born out of the belief that:

  *Prompts are nuanced and opaque user requests, which require the same capabilities as traditional HTTP requests including secure handling, intelligent routing, robust observability, and integration with backend (API) systems for personalization - all outside business logic.*

@ -62,6 +62,7 @@ Welcome to Arch!

      build_with_arch/agent
      build_with_arch/rag
+      build_with_arch/multi_turn

  .. tab-item:: Resources

--- a/docs/source/resources/configuration_reference.rst
+++ b/docs/source/resources/configuration_reference.rst
@ -1,11 +1,11 @@
-Configuration Reference
-============================
+.. _configuration_refernce:

-The following is a complete reference of the ``prompt-conifg.yml`` that controls the behavior of a single instance of
-the Arch gateway. We've kept things simple (less than 80 lines) and held off on exposing additional functionality (for
-e.g. suppporting push observability stats, managing prompt-endpoints as virtual cluster, exposing more load balancing
-options, etc). Our belief that the simple things, should be simple. So we offert good defaults for developers, so
-that they can spend more of their time in building features unique to their AI experience.
+Configuration Reference
+=======================
+
+The following is a complete reference of the ``arch_conifg.yml`` that controls the behavior of a single instance of
+the Arch gateway. This where you enable capabilities like routing to upstream LLm providers, defining prompt_targets
+where prompts get routed to, apply guardrails, and enable critical agent observability features.

 .. literalinclude:: includes/arch_config_full_reference.yaml
    :language: yaml