Doc Update (#129)

* init update * Update terminology.rst * fix the branch to create an index.html, and fix pre-commit issues * Doc update * made several changes to the docs after Shuguang's revision * fixing pre-commit issues * fixed the reference file to the final prompt config file * added google analytics --------- Co-authored-by: Salman Paracha <salmanparacha@MacBook-Pro-261.local>
2026-07-17 16:31:04 +02:00 · 2024-10-06 16:54:34 -07:00 · 2024-10-06 16:54:34 -07:00 · 5c7567584d
commit 5c7567584d
parent 2a7b95582c
49 changed files with 1185 additions and 609 deletions
--- a/docs/source/concepts/tech_overview/listener.rst
+++ b/docs/source/concepts/tech_overview/listener.rst
@ -0,0 +1,37 @@
+.. _arch_overview_listeners:
+
+Listener
+---------
+Listener is a top level primitive in Arch, which simplifies the configuration required to bind incoming
+connections from downstream clients, and for egress connections to LLMs (hosted or API)
+
+Arch builds on Envoy's Listener subsystem to streamline connection managemet for developers. Arch minimizes
+the complexity of Envoy's listener setup by using best-practices and exposing only essential settings,
+making it easier for developers to bind connections without deep knowledge of Envoy’s configuration model. This
+simplification ensures that connections are secure, reliable, and optimized for performance.
+
+Downstream (Ingress)
+^^^^^^^^^^^^^^^^^^^^^^
+Developers can configure Arch to accept connections from downstream clients. A downstream listener acts as the
+primary entry point for incoming traffic, handling initial connection setup, including network filtering, gurdrails,
+and additional network security checks. For more details on prompt security and safety,
+see :ref:`here <arch_overview_prompt_handling>`
+
+Upstream (Egress)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Arch automatically configures a listener to route requests from your application to upstream LLM API providers (or hosts).
+When you start Arch, it creates a listener for egress traffic based on the presence of the ``llm_providers`` configuration
+section in the ``prompt_config.yml`` file. Arch binds itself to a local address such as ``127.0.0.1:9000/v1`` or a DNS-based
+address like ``arch.local:9000/v1`` for outgoing traffic. For more details on LLM providers, read :ref:`here <llm_provider>`
+
+Configure Listener
+^^^^^^^^^^^^^^^^^^
+
+To configure a Downstream (Ingress) Listner, simply add the ``listener`` directive to your ``prompt_config.yml`` file:
+
+.. literalinclude:: ../includes/arch_config.yaml
+    :language: yaml
+    :linenos:
+    :lines: 1-18
+    :emphasize-lines: 2-5
+    :caption: Example Configuration
--- a/docs/source/concepts/tech_overview/model_serving.rst
+++ b/docs/source/concepts/tech_overview/model_serving.rst
@ -0,0 +1,56 @@
+.. _arch_model_serving:
+
+Model Serving
+-------------
+
+Arch is a set of **two** self-contained processes that are designed to run alongside your application
+servers (or on a separate host connected via a network). The first process is designated to manage low-level
+networking and HTTP related comcerns, and the other process is for **model serving**, which helps Arch make
+intelligent decisions about the incoming prompts. The model server is designed to call the purpose-built
+LLMs in Arch.
+
+.. image:: /_static/img/arch-system-architecture.jpg
+   :align: center
+   :width: 50%
+
+_____________________________________________________________________________________________________________
+
+Arch' is designed to be deployed in your cloud VPC, on a on-premises host, and can work on devices that don't
+have a GPU. Note, GPU devices are need for fast and cost-efficient use, so that Arch (model server, specifically)
+can process prompts quickly and forward control back to the applicaton host. There are three modes in which Arch
+can be configured to run its **model server** subsystem:
+
+Local Serving (CPU - Moderate)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The following bash commands enable you to configure the model server subsystem in Arch to run local on device
+and only use CPU devices. This will be the slowest option but can be useful in dev/test scenarios where GPUs
+might not be available.
+
+.. code-block:: console
+
+    $ archgw up --local-cpu
+
+Local Serving (GPU- Fast)
+^^^^^^^^^^^^^^^^^^^^^^^^^
+The following bash commands enable you to configure the model server subsystem in Arch to run locally on the
+machine and utilize the GPU available for fast inference across all model use cases, including function calling
+guardails, etc.
+
+.. code-block:: console
+
+    $ archgw up --local
+
+Cloud Serving (GPU - Blazing Fast)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The command below instructs Arch to intelligently use GPUs locally for fast intent detection, but default to
+cloud serving for function calling and guardails scenarios to dramatically improve the speed and overall performance
+of your applications.
+
+.. code-block:: console
+
+    $ archgw up
+
+.. Note::
+    Arch's model serving in the cloud is priced at $0.05M/token (156x cheaper than GPT-4o) with averlage latency
+    of 200ms (10x faster than GPT-4o). Please refer to our :ref:`Get Started <quickstart>` to know
+    how to generate API keys for model serving
--- a/docs/source/concepts/tech_overview/prompt.rst
+++ b/docs/source/concepts/tech_overview/prompt.rst
@ -0,0 +1,136 @@
+.. _arch_overview_prompt_handling:
+
+Prompt
+=================
+
+Arch's primary design point is to securely accept, process and handle prompts. To do that effectively,
+Arch relies on Envoy's HTTP `connection management <https://www.envoyproxy.io/docs/envoy/v1.31.2/intro/arch_overview/http/http_connection_management>`_,
+subsystem and its **prompt handler** subsystem engineered with purpose-built LLMs to
+implement critical functionality on behalf of developers so that you can stay focused on business logic.
+
+.. Note::
+   Arch's **prompt handler** subsystem interacts with the **model** subsytem through Envoy's cluster manager
+   system to ensure robust, resilient and fault-tolerant experience in managing incoming prompts. Read more
+   about the :ref:`model subsystem <arch_model_serving>` and how the LLMs are hosted in Arch.
+
+Messages
+--------
+
+Arch accepts messages directly from the body of the HTTP request in a format that follows the `Hugging Face Messages API <https://huggingface.co/docs/text-generation-inference/en/messages_api>`_.
+This design allows developers to pass a list of messages, where each message is represented as a dictionary
+containing two key-value pairs:
+
+    - **Role**: Defines the role of the message sender, such as "user" or "assistant".
+    - **Content**: Contains the actual text of the message.
+
+
+Prompt Guardrails
+-----------------
+
+Arch is engineered with :ref:`Arch-Guard <prompt_guard>`, an industry leading safety layer, powered by a
+compact and high-performimg LLM that monitors incoming prompts to detect and reject jailbreak attempts -
+ensuring that unauthorized or harmful behaviors are intercepted early in the process.
+
+To add jailbreak guardrails, see example below:
+
+.. literalinclude:: ../includes/arch_config.yaml
+    :language: yaml
+    :linenos:
+    :lines: 1-45
+    :emphasize-lines: 22-26
+    :caption: Example Configuration
+
+.. Note::
+   As a roadmap item, Arch will expose the ability for developers to define custom guardrails via Arch-Guard-v2,
+   and add support for additional safety checks defined by developers and hazardous categories like, violent crimes, privacy, hate,
+   etc. To offer feedback on our roadmap, please visit our `github page <https://github.com/orgs/katanemo/projects/1>`_
+
+
+Prompt Targets
+--------------
+
+Once a prompt passes any configured guardrail checks, Arch processes the contents of the incoming conversation
+and identifies where to forwad the conversation to via its ``prompt_targets`` primitve. Prompt targets are endpoints
+that receive prompts that are processed by Arch. For example, Arch enriches incoming prompts with metadata like knowing
+when a user's intent has changed so that you can build faster, more accurate RAG apps.
+
+Configuring ``prompt_targets`` is simple. See example below:
+
+.. literalinclude:: ../includes/arch_config.yaml
+    :language: yaml
+    :linenos:
+    :emphasize-lines: 29-38
+    :caption: Example Configuration
+
+
+Intent Detection and Prompt Matching:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Arch uses fast Natural Language Inference (NLI) and embedding approaches to first detect the intent of each
+incoming prompt. This intent detection phase analyzes the prompt's content and matches it against predefined
+prompt targets, ensuring that each prompt is forwarded to the most appropriate endpoint. Arch’s intent
+detection framework considers both the name and description of each prompt target, and uses a composite matching
+score between an NLI and cosine similarity to enchance accuracy in forwarding decisions.
+
+- **Embeddings**: By embedding the prompt and comparing it to known target vectors, Arch effectively identifies
+  the closest match, ensuring that the prompt is handled by the correct downstream service.
+
+- **NLI**: NLI techniques further refine the matching process by evaluating the semantic alignment between the
+  prompt and potential targets.
+
+Agentic Apps via Prompt Targets
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To support agentic apps, like scheduling travel plans or sharing comments on a document - via prompts, Arch uses
+its function calling abilities to extract critical information from the incoming prompt (or a set of prompts)
+needed by a downstream backend API or function call before calling it directly. For more details on how you can
+build agentic applications using Arch, see our full guide :ref:`here <arch_agent_guide>`:
+
+.. Note::
+   Arch :ref:`Arch-Function <function_calling>` is the dedicated agentic model engineered in Arch to extract information from
+   a (set of) prompts and executes necessary backend API calls. This allows for efficient handling of agentic tasks,
+   such as scheduling data retrieval, by dynamically interacting with backend services. Arch-Function is a flagship 1.3
+   billion parameter model that matches performance  with frontier models like Claude Sonnet 3.5 ang GPT-4, while
+   being 100x cheaper ($0.05M/token hosted) and 10x faster (p50 latencies of 200ms).
+
+Prompting LLMs
+--------------
+Arch is a single piece of software that is designed to manage both ingress and egress prompt traffic, drawing its
+distributed proxy nature from the robust `Envoy <https://envoyproxy.io>`_. This makes it extremely efficient and capable
+of handling upstream connections to LLMs. If your application is originating code to an API-based LLM, simply use
+the OpenAI client and configure it with Arch. By sending traffic through Arch, you can propagate traces, manage and monitor
+traffic, apply rate limits, and utilize a large set of traffic management capabilities in a centralized way.
+
+.. Attention::
+   When you start Arch, it automatically creates a listener port for egress calls to upstream LLMs. This is based on the
+   ``llm_providers`` configuration section in the ``prompt_config.yml`` file. Arch binds itself to a local address such as
+   127.0.0.1:51001/v1.
+
+
+Example: Using OpenAI Client with Arch as an Egress Gateway
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   import openai
+
+   # Set the OpenAI API base URL to the Arch gateway endpoint
+   openai.api_base = "http://127.0.0.1:51001/v1"
+
+   # No need to set openai.api_key since it's configured in Arch's gateway
+
+   # Use the OpenAI client as usual
+   response = openai.Completion.create(
+      model="text-davinci-003",
+      prompt="What is the capital of France?"
+   )
+
+   print("OpenAI Response:", response.choices[0].text.strip())
+
+In these examples:
+
+    The OpenAI client is used to send traffic directly through the Arch egress proxy to the LLM of your choice, such as OpenAI.
+    The OpenAI client is configured to route traffic via Arch by setting the proxy to 127.0.0.1:51001, assuming Arch is
+    running locally and bound to that address and port.
+
+This setup allows you to take advantage of Arch's advanced traffic management features while interacting with LLM APIs like OpenAI.
--- a/docs/source/concepts/tech_overview/request_lifecycle.rst
+++ b/docs/source/concepts/tech_overview/request_lifecycle.rst
@ -0,0 +1,176 @@
+.. _lifecycle_of_a_request:
+
+Request Lifecycle
+=================
+
+Below we describe the events in the lifecycle of a request passing through an Arch gateway instance. We first
+describe how Arch fits into the request path and then the internal events that take place following
+the arrival of a request at Arch from downtream clients. We follow the request until the corresponding
+dispatch upstream and the response path.
+
+.. image:: /_static/img/network-topology-ingress-egress.jpg
+   :width: 100%
+   :align: center
+
+Terminology
+-----------
+
+We recommend that you get familiar with some of the :ref:`terminology <arch_terminology>` used in Arch
+before reading this section.
+
+Network topology
+----------------
+
+How a request flows through the components in a network (including Arch) depends on the network’s topology.
+Arch can be used in a wide variety of networking topologies. We focus on the inner operation of Arch below,
+but briefly we address how Arch relates to the rest of the network in this section.
+
+- **Downstream(Ingress)** listeners take requests from upstream clients like a web UI or clients that forward
+  prompts to you local application responses from the application flow back through Arch to the downstream.
+
+- **Upstream(Egress)** listeners take requests from the application and forward them to LLMs.
+
+.. image:: /_static/img/network-topology-ingress-egress.jpg
+   :width: 100%
+   :align: center
+
+In practice, Arch can be deployed on the edge and as an internal load balancer between AI agents. A request path may
+traverse multiple Arch gateways:
+
+.. image:: /_static/img/network-topology-agent.jpg
+   :width: 100%
+   :align: center
+
+
+High level architecture
+-----------------------
+Arch is a set of **two** self-contained processes that are designed to run alongside your application servers
+(or on a separate server connected to your application servers via a network). The first process is designated
+to manage HTTP-level networking and connection management concerns (protocol management, request id generation,
+header sanitization, etc.), and the other process is for **model serving**, which helps Arch make intelligent
+decisions about the incoming prompts. The model server hosts the purpose-built LLMs to
+manage several critical, but undifferentiated, prompt related tasks on behalf of developers.
+
+
+The request processing path in Arch has three main parts:
+
+* :ref:`Listener subsystem <arch_overview_listeners>` which handles **downstream** and **upstream** request
+  processing. It is responsible for managing the downstream (ingress) and the upstream (egress) request
+  lifecycle. The downstream and upstream HTTP/2 codec lives here.
+* :ref:`Prompt handler subsystem <arch_overview_prompt_handling>` which is responsible for selecting and
+  forwarding prompts ``prompt_targets`` and establishes the lifecycle of any **upstream** connection to a
+  hosted endpoint that implements domain-specific business logic for incoming promots. This is where knowledge
+  of targets and endpoint health, load balancing and connection pooling exists.
+* :ref:`Model serving subsystem <arch_model_serving>` which helps Arch make intelligent decisions about the
+  incoming prompts. The model server is designed to call the purpose-built LLMs in Arch.
+
+The three subsystems are bridged with either the HTTP router filter, and the cluster manager subsystems of Envoy.
+
+Also, Arch utilizes `Envoy event-based thread model <https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310>`_.
+A main thread is responsible forthe server lifecycle, configuration processing, stats, etc. and some number of
+:ref:`worker threads <arch_overview_threading>` process requests. All threads operate around an event loop (`libevent <https://libevent.org/>`_)
+and any given downstream TCP connection will be handled by exactly one worker thread for its lifetime. Each worker
+thread maintains its own pool of TCP connections to upstream endpoints.
+
+Worker threads rarely share state and operate in a trivially parallel fashion. This threading model
+enables scaling to very high core count CPUs.
+
+Configuration
+-------------
+
+Today, only support a static bootstrap configuration file for simplicity today:
+
+.. literalinclude:: ../includes/arch_config.yaml
+    :language: yaml
+
+
+Request Flow (Ingress)
+----------------------
+
+Overview
+^^^^^^^^
+A brief outline of the lifecycle of a request and response using the example configuration above:
+
+1. **TCP Connection Establishment**:
+   A TCP connection from downstream is accepted by an Arch listener running on a worker thread.
+   The listener filter chain provides SNI and other pre-TLS information. The transport socket, typically TLS,
+   decrypts incoming data for processing.
+
+2. **Prompt Guardrails Check**:
+   Arch first checks the incoming prompts for guardrails such as jailbreak attempts. This ensures
+   that harmful or unwanted behaviors are detected early in the request processing pipeline.
+
+3. **Intent Matching**:
+   The decrypted data stream is deframed by the HTTP/2 codec in Arch's HTTP connection manager. Arch performs
+   intent matching via is **prompt-handler** subsystem using the name and description of the defined prompt targets,
+   determining which endpoint should handle the prompt.
+
+4. **Parameter Gathering with Arch-FC**:
+   If a prompt target requires specific parameters, Arch engages Arch-FC to extract the necessary details
+   from the incoming prompt(s). This process gathers the critical information needed for downstream API calls.
+
+5. **API Call Execution**:
+   Arch routes the prompt to the appropriate backend API or function call. If an endpoint cluster is identified,
+   load balancing is performed, circuit breakers are checked, and the request is proxied to the upstream endpoint.
+
+6. **Default Summarization by Upstream LLM**:
+   By default, if no specific endpoint processing is needed, the prompt is sent to an upstream LLM for summarization.
+   This ensures that responses are concise and relevant, enhancing user experience in RAG (Retrieval-Augmented Generation)
+   and agentic applications.
+
+7. **Error Handling and Forwarding**:
+   Errors encountered during processing, such as failed function calls or guardrail detections, are forwarded to
+   designated error targets. Error details are communicated through specific headers to the application:
+
+   - ``X-Function-Error-Code``: Code indicating the type of function call error.
+   - ``X-Prompt-Guard-Error-Code``: Code specifying violations detected by prompt guardrails.
+   - Additional headers carry messages and timestamps to aid in debugging and logging.
+
+8. **Response Handling**:
+   The upstream endpoint’s TLS transport socket encrypts the response, which is then proxied back downstream.
+   Responses pass through HTTP filters in reverse order, ensuring any necessary processing or modification before final delivery.
+
+
+Request Flow (Egress)
+---------------------
+
+Overview
+--------
+
+A brief outline of the lifecycle of a request and response in the context of egress traffic from an application
+to Large Language Models (LLMs) via Arch:
+
+1. **HTTP Connection Establishment to LLM**:
+   Arch initiates an HTTP connection to the upstream LLM service. This connection is handled by Arch’s egress listener
+   running on a worker thread. The connection typically uses a secure transport protocol such as HTTPS, ensuring the
+   prompt data is encrypted before being sent to the LLM service.
+
+2. **Rate Limiting**:
+   Before sending the request to the LLM, Arch applies rate-limiting policies to ensure that the upstream LLM service
+   is not overwhelmed by excessive traffic. Rate limits are enforced per client or service, ensuring fair usage and
+   preventing accidental or malicious overload. If the rate limit is exceeded, Arch may return an appropriate HTTP
+   error (e.g., 429 Too Many Requests) without sending the prompt to the LLM.
+
+3. **Load Balancing to (hosted) LLM Endpoints**:
+   After passing the rate-limiting checks, Arch routes the prompt to the appropriate LLM endpoint.
+   If multiple LLM providers instances are available, load balancing is performed to distribute traffic evenly
+   across the instances. Arch checks the health of the LLM endpoints using circuit breakers and health checks,
+   ensuring that the prompt is only routed to a healthy, responsive instance.
+
+4. **Response Reception and Forwarding**:
+   Once the LLM processes the prompt, Arch receives the response from the LLM service. The response is typically a
+   generated text, completion, or summarization. Upon reception, Arch decrypts (if necessary) and handles the response,
+   passing it through any egress processing pipeline defined by the application, such as logging or additional response filtering.
+
+
+Post-request processing
+^^^^^^^^^^^^^^^^^^^^^^^^
+Once a request completes, the stream is destroyed. The following also takes places:
+
+* The post-request :ref:`monitoring <monitoring>` are updated (e.g. timing, active requests, upgrades, health checks).
+  Some statistics are updated earlier however, during request processing. Stats are batchedand written by the main
+  thread periodically.
+* :ref:`Access logs <arch_access_logging>` are written to the access log
+* :ref:`Trace <arch_overview_tracing>` spans are finalized. If our example request was traced, a
+  trace span, describing the duration and details of the request would be created by the HCM when
+  processing request headers and then finalized by the HCM during post-request processing.
--- a/docs/source/concepts/tech_overview/tech_overview.rst
+++ b/docs/source/concepts/tech_overview/tech_overview.rst
@ -0,0 +1,14 @@
+.. _tech_overview:
+
+Tech Overview
+=============
+
+.. toctree::
+    :maxdepth: 2
+
+    terminology
+    threading_model
+    listener
+    model_serving
+    prompt
+    request_lifecycle
--- a/docs/source/concepts/tech_overview/terminology.rst
+++ b/docs/source/concepts/tech_overview/terminology.rst
@ -0,0 +1,46 @@
+.. _arch_terminology:
+
+Terminology
+============
+
+A few definitions before we dive into the main architecture documentation. Arch borrows from Envoy's terminology
+to keep things consistent in logs, traces and in code.
+
+**Downstream(Ingress)**: An downstream client (web application, etc.) connects to Arch, sends prompts, and receives responses.
+
+**Upstream(Egress)**: An upstream host that receives connections and prompts from Arch, and returns context or responses for a prompt
+
+.. image:: /_static/img/network-topology-ingress-egress.jpg
+   :width: 100%
+   :align: center
+
+**Listener**: A listener is a named network location (e.g., port, address, path etc.) that Arch listens on to process prompts
+before forwarding them to your application server endpoints. rch enables you to configure one listener for downstream connections
+(like port 80, 443) and creates a separate internal listener for calls that initiate from your application code to LLMs.
+
+.. Note::
+
+   When you start Arch, you specify a listener address/port that you want to bind downstream. But, Arch uses are predefined port
+   that you can use (``127.0.0.1:10000``) to proxy egress calls originating from your application to LLMs (API-based or hosted).
+   For more details, check out :ref:`LLM providers <llm_provider>`
+
+**Instance**: An instance of the Arch gateway. When you start Arch it creates at most two processes. One to handle Layer 7
+networking operations (auth, tls, observability, etc) and the second process to serve models that enable it to make smart
+decisions on how to accept, handle and forward prompts. The second process is optional, as the model serving sevice could be
+hosted on a different network (an API call). But these two processes are considered a single instance of Arch.
+
+**Prompt Targets**: Arch offers a primitive called ``prompt_targets`` to help separate business logic from undifferentiated
+work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Arch.
+For example, Arch enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt
+so that you can build faster, more accurate retrieval (RAG) apps. To support agentic apps, like scheduling travel plans or
+sharing comments on a document - via prompts, Bolt uses its function calling abilities to extract critical information from
+the incoming prompt (or a set of prompts) needed by a downstream backend API or function call before calling it directly.
+
+**Error Targets**: Error targets are those endpoints that receive forwarded errors from Arch when issues arise,
+such as failing to properly call a function/API, detecting violations of guardrails, or encountering other processing errors.
+These errors are communicated to the application via headers (X-Arch-[ERROR-TYPE]), allowing it to handle the errors gracefully
+and take appropriate actions.
+
+**Model Serving**: Arch is a set of **two** self-contained processes that are designed to run alongside your application servers
+(or on a separate hostconnected via a network).The  **model serving** process helps Arch make intelligent decisions about the
+incoming prompts. The model server is designed to call the (fast) purpose-built LLMs in Arch.
--- a/docs/source/concepts/tech_overview/threading_model.rst
+++ b/docs/source/concepts/tech_overview/threading_model.rst
@ -0,0 +1,21 @@
+.. _arch_overview_threading:
+
+Threading Model
+===============
+
+Arch builds on top of Envoy's single process with multiple threads architecture.
+
+A single *primary* thread controls various sporadic coordination tasks while some number of *worker*
+threads perform filtering, and forwarding.
+
+Once a connection is accepted, the connection spends the rest of its lifetime bound to a single worker
+thread. All the functionality around prompt handling from a downstream client is handled in a separate worker thread.
+This allows the majority of Arch to be largely single threaded (embarrassingly parallel) with a small amount
+of more complex code handling coordination between the worker threads.
+
+Generally Arch is written to be 100% non-blocking.
+
+.. tip::
+
+   For most workloads we recommend configuring the number of worker threads to be equal to the number of
+   hardware threads on the machine.