mirror of
https://github.com/katanemo/plano.git
synced 2026-05-12 09:12:43 +02:00
Adil/fix salman docs (#75)
* added the first set of docs for our technical docs * more docuemtnation changes * added support for prompt processing and updated life of a request * updated docs to including getting help sections and updated life of a request * committing local changes for getting started guide, sample applications, and full reference spec for prompt-config * updated configuration reference, added sample app skeleton, updated favico * fixed the configuration refernce file, and made minor changes to the intent detection. commit v1 for now * Updated docs with use cases and example code, updated what is arch, and made minor changes throughout * fixed imaged and minor doc fixes * add sphinx_book_theme * updated README, and make some minor fixes to documetnation * fixed README.md * fixed image width --------- Co-authored-by: Salman Paracha <salmanparacha@MacBook-Pro-261.local> Co-authored-by: Adil Hafeez <adil@katanemo.com>
This commit is contained in:
parent
2d31aeaa36
commit
13dff3089d
33 changed files with 931 additions and 287 deletions
|
|
@ -3,36 +3,33 @@
|
|||
Prompt Processing
|
||||
=================
|
||||
|
||||
.. contents::
|
||||
:local:
|
||||
:depth: 2
|
||||
|
||||
Arch's model serving process is designed to securely handle incoming prompts by detecting jailbreak attempts,
|
||||
processing the prompts, and routing them to appropriate functions or prompt targets based on intent detection.
|
||||
The serving workflow integrates several key components, each playing a crucial role in managing generative AI interactions:
|
||||
processing the prompts, and routing them to appropriate functions or prompt targets based on intent detection.
|
||||
The serving workflow integrates several key components, each playing a crucial role in managing generative
|
||||
AI interactions:
|
||||
|
||||
Jailbreak and Toxicity Guardrails
|
||||
---------------------------------
|
||||
|
||||
Arch employs Arch-Guard, a security layer powered by a compact and high-performimg LLM that monitors incoming prompts to detect
|
||||
and reject jailbreak attempts, ensuring that unauthorized or harmful behaviors are intercepted early in the process. Arch-Guard
|
||||
is the leading model in the industry for jailbreak and toxicity detection. Configuring guardrails is super simple. See example
|
||||
below.
|
||||
|
||||
Arch employs Arch-Guard, a security layer powered by a compact and high-performimg LLM that monitors incoming prompts to detect
|
||||
and reject jailbreak attempts, ensuring that unauthorized or harmful behaviors are intercepted early in the process. Arch-Guard
|
||||
is the leading model in the industry for jailbreak and toxicity detection. Configuring guardrails is super simple. See example
|
||||
below.
|
||||
|
||||
.. literalinclude:: /_config/getting-started.yml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:emphasize-lines: 18-21
|
||||
:emphasize-lines: 24-27
|
||||
:caption: :download:`arch-getting-started.yml </_config/getting-started.yml>`
|
||||
|
||||
|
||||
Prompt Targets
|
||||
---------------
|
||||
|
||||
Once a prompt passes the security checks, Arch processes the content and identifies if any specific functions need to be called.
|
||||
Arch-FC1B, a dedicated function calling module, extracts critical information from the prompt and executes the necessary
|
||||
backend API calls or internal functions. This capability allows for efficient handling of agentic tasks, such as scheduling or
|
||||
data retrieval, by dynamically interacting with backend services.
|
||||
|
||||
Once a prompt passes the security checks, Arch processes the content and identifies if any specific functions need to be called.
|
||||
Arch-FC1B, a dedicated function calling module, extracts critical information from the prompt and executes the necessary
|
||||
backend API calls or internal functions. This capability allows for efficient handling of agentic tasks, such as scheduling or
|
||||
data retrieval, by dynamically interacting with backend services.
|
||||
|
||||
.. image:: /_static/img/function-calling-network-flow.jpg
|
||||
:width: 100%
|
||||
|
|
@ -41,20 +38,20 @@ Prompt Targets
|
|||
Intent Detection and Prompt Matching:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Arch uses Natural Language Inference (NLI) and embedding-based approaches to detect the intent of each incoming prompt.
|
||||
This intent detection phase analyzes the prompt's content and matches it against predefined prompt targets, ensuring that each prompt
|
||||
is forwarded to the most appropriate endpoint. Arch’s intent detection framework considers both the name and description of each prompt target,
|
||||
enhancing accuracy in forwarding decisions.
|
||||
Arch uses Natural Language Inference (NLI) and embedding-based approaches to detect the intent of each incoming prompt.
|
||||
This intent detection phase analyzes the prompt's content and matches it against predefined prompt targets, ensuring that each prompt
|
||||
is forwarded to the most appropriate endpoint. Arch’s intent detection framework considers both the name and description of each prompt target,
|
||||
enhancing accuracy in forwarding decisions.
|
||||
|
||||
- **Embedding Approaches**: By embedding the prompt and comparing it to known target vectors, Arch effectively identifies the closest match,
|
||||
ensuring that the prompt is handled by the correct downstream service.
|
||||
|
||||
- **NLI Integration**: Natural Language Inference techniques further refine the matching process by evaluating the semantic alignment
|
||||
between the prompt and potential targets.
|
||||
- **Embedding Approaches**: By embedding the prompt and comparing it to known target vectors, Arch effectively identifies the closest match,
|
||||
ensuring that the prompt is handled by the correct downstream service.
|
||||
|
||||
- **NLI Integration**: Natural Language Inference techniques further refine the matching process by evaluating the semantic alignment
|
||||
between the prompt and potential targets.
|
||||
|
||||
Forwarding Prompts to Downstream Targets:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
After determining the correct target, Arch forwards the prompt to the designated endpoint, such as an LLM host or API service.
|
||||
This seamless routing mechanism integrates with Arch's broader ecosystem, enabling efficient communication and response generation tailored to the user's intent.
|
||||
After determining the correct target, Arch forwards the prompt to the designated endpoint, such as an LLM host or API service.
|
||||
This seamless routing mechanism integrates with Arch's broader ecosystem, enabling efficient communication and response generation tailored to the user's intent.
|
||||
|
||||
Arch's model serving process combines robust security measures with advanced intent detection and function calling capabilities, creating a reliable and adaptable environment for managing generative AI workflows. This approach not only enhances the accuracy and relevance of responses but also safeguards against malicious usage patterns, aligning with best practices in AI governance.
|
||||
|
|
|
|||
|
|
@ -4,9 +4,9 @@ Introduction
|
|||
============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 3
|
||||
:maxdepth: 2
|
||||
|
||||
what_is_arch
|
||||
architecture/architecture
|
||||
life_of_a_request
|
||||
getting_help
|
||||
getting_help
|
||||
|
|
|
|||
|
|
@ -4,11 +4,11 @@ Life of a Request
|
|||
=================
|
||||
|
||||
Below we describe the events in the life of a request passing through an Arch gateway instance. We first
|
||||
describe how Arch fits into the request path and then the internal events that take place following
|
||||
the arrival of a request at Arch from downtream clients. We follow the request until the corresponding
|
||||
describe how Arch fits into the request path and then the internal events that take place following
|
||||
the arrival of a request at Arch from downtream clients. We follow the request until the corresponding
|
||||
dispatch upstream and the response path.
|
||||
|
||||
.. image:: /_static/img/network-topology-app-server.jpg
|
||||
.. image:: /_static/img/network-topology-ingress-egress.jpg
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
|
|
@ -17,36 +17,36 @@ Terminology
|
|||
|
||||
Arch uses the following terms through its' codebase and documentation:
|
||||
|
||||
* *Listeners*: The Arch primitive responsible for binding to an IP/port, accepting new HTTP connections and orchestrating
|
||||
the downstream facing aspects of prompt processing. Arch relies almostly exclusively on `Envoy's Listener subsystem <arch_overview_listeners>`_.
|
||||
* *Downstream*: an entity connecting to Arch. This may be another AI agent (side car or networked) or a remote client.
|
||||
* *LLM Providers*: a set of upstream LLMs (API-based or network nodes) that Arch routes/forwards user and application-specific prompts to.
|
||||
Arch offers a simply abstract to call different LLMs via model-id, add LLM specific retry, failover and routing capabilities.
|
||||
* *Listeners*: The Arch primitive responsible for binding to an IP/port, accepting new HTTP connections and orchestrating
|
||||
the downstream facing aspects of prompt processing. Arch relies almostly exclusively on `Envoy's Listener subsystem <arch_overview_listeners>`_.
|
||||
* *Downstream*: an entity connecting to Arch. This may be another AI agent (side car or networked) or a remote client.
|
||||
* *LLM Providers*: a set of upstream LLMs (API-based or network nodes) that Arch routes/forwards user and application-specific prompts to.
|
||||
Arch offers a simply abstract to call different LLMs via model-id, add LLM specific retry, failover and routing capabilities.
|
||||
Arch build's on top of Envoy's `Cluster substem <https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager#arch-overview-cluster-manager>`
|
||||
* *Upstream*: A set of hosts that can recieve traffic from an instance of the Arch gateway.
|
||||
* *Prompt Targets*: A core primitive offered in Arch. Prompt targets are endpoints that receive prompts that are processed by Arch.
|
||||
For example, Arch enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt so that you can
|
||||
build faster, more accurate RAG apps. To support agentic apps, like scheduling travel plans or sharing comments on a document - via prompts,
|
||||
|
||||
* *Prompt Targets*: A core primitive offered in Arch. Prompt targets are endpoints that receive prompts that are processed by Arch.
|
||||
For example, Arch enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt so that you can
|
||||
build faster, more accurate RAG apps. To support agentic apps, like scheduling travel plans or sharing comments on a document - via prompts,
|
||||
|
||||
Network topology
|
||||
----------------
|
||||
|
||||
How a request flows through the components in a network (including Arch) depends on the network’s topology.
|
||||
Arch can be used in a wide variety of networking topologies. We focus on the inner operation of Arch below,
|
||||
How a request flows through the components in a network (including Arch) depends on the network’s topology.
|
||||
Arch can be used in a wide variety of networking topologies. We focus on the inner operation of Arch below,
|
||||
but briefly we address how Arch relates to the rest of the network in
|
||||
this section.
|
||||
|
||||
* Ingress listeners take requests from upstream clients like a web UI or clients that forward prompts to you local application
|
||||
Responses from the local application flow back through Arch to the downstream.
|
||||
|
||||
* Egress listeners take requests from the local application and forward them to LLMs. These receiving nodes
|
||||
* Egress listeners take requests from the local application and forward them to LLMs. These receiving nodes
|
||||
will also be typically running Arch and accepting the request via their ingress listeners.
|
||||
|
||||
.. image:: /_static/img/network-topology-app-server.jpg
|
||||
.. image:: /_static/img/network-topology-ingress-egress.jpg
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
In practice, Arch can be deployed on the edge and as an internal load balancer between AI agents. A request path may
|
||||
In practice, Arch can be deployed on the edge and as an internal load balancer between AI agents. A request path may
|
||||
traverse multiple Arch gateways:
|
||||
|
||||
.. image:: /_static/img/network-topology-agent.jpg
|
||||
|
|
@ -78,12 +78,12 @@ The request processing path in Arch has two main parts:
|
|||
The two subsystems are bridged with the HTTP router filter, which forwards the HTTP request from
|
||||
downstream to upstream.
|
||||
|
||||
Arch utilizes `Envoy event-based thread model <https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310>`_.
|
||||
A main thread is responsible forthe server lifecycle, configuration processing, stats, etc. and some number
|
||||
of :ref:`worker threads <arch_overview_threading>` process requests. All threads operate around an event
|
||||
loop (`libevent <https://libevent.org/>`_) and any given downstream TCP connection will be handled by exactly
|
||||
one worker thread for its lifetime. Each worker thread maintains its own pool of TCP connections to upstream
|
||||
endpoints. Today, Arch implemenents its core functionality around prompt handling in worker threads.
|
||||
Arch utilizes `Envoy event-based thread model <https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310>`_.
|
||||
A main thread is responsible forthe server lifecycle, configuration processing, stats, etc. and some number
|
||||
of :ref:`worker threads <arch_overview_threading>` process requests. All threads operate around an event
|
||||
loop (`libevent <https://libevent.org/>`_) and any given downstream TCP connection will be handled by exactly
|
||||
one worker thread for its lifetime. Each worker thread maintains its own pool of TCP connections to upstream
|
||||
endpoints. Today, Arch implemenents its core functionality around prompt handling in worker threads.
|
||||
|
||||
Worker threads rarely share state and operate in a trivially parallel fashion. This threading model
|
||||
enables scaling to very high core count CPUs.
|
||||
|
|
@ -95,30 +95,30 @@ Overview
|
|||
^^^^^^^^
|
||||
A brief outline of the life cycle of a request and response using the example configuration above:
|
||||
|
||||
1. **TCP Connection Establishment**:
|
||||
1. **TCP Connection Establishment**:
|
||||
A TCP connection from downstream is accepted by an Arch listener running on a worker thread. The listener filter chain provides SNI and other pre-TLS information. The transport socket, typically TLS, decrypts incoming data for processing.
|
||||
|
||||
2. **Prompt Guardrails Check**:
|
||||
2. **Prompt Guardrails Check**:
|
||||
Arch first checks the incoming prompts for guardrails such as jailbreak attempts and toxicity. This ensures that harmful or unwanted behaviors are detected early in the request processing pipeline.
|
||||
|
||||
3. **Intent Matching**:
|
||||
3. **Intent Matching**:
|
||||
The decrypted data stream is deframed by the HTTP/2 codec in Arch's HTTP connection manager. Arch performs intent matching using the name and description of the defined prompt targets, determining which endpoint should handle the prompt.
|
||||
|
||||
4. **Parameter Gathering with Arch-FC1B**:
|
||||
4. **Parameter Gathering with Arch-FC1B**:
|
||||
If a prompt target requires specific parameters, Arch engages Arch-FC1B to extract the necessary details from the incoming prompt(s). This process gathers the critical information needed for downstream API calls.
|
||||
|
||||
5. **API Call Execution**:
|
||||
5. **API Call Execution**:
|
||||
Arch routes the prompt to the appropriate backend API or function call. If an endpoint cluster is identified, load balancing is performed, circuit breakers are checked, and the request is proxied to the upstream endpoint. For more details on routing and load balancing, refer to the [Envoy routing documentation](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/intro/arch_overview).
|
||||
|
||||
6. **Default Summarization by Upstream LLM**:
|
||||
6. **Default Summarization by Upstream LLM**:
|
||||
By default, if no specific endpoint processing is needed, the prompt is sent to an upstream LLM for summarization. This ensures that responses are concise and relevant, enhancing user experience in RAG (Retrieval-Augmented Generation) and agentic applications.
|
||||
|
||||
7. **Error Handling and Forwarding**:
|
||||
7. **Error Handling and Forwarding**:
|
||||
Errors encountered during processing, such as failed function calls or guardrail detections, are forwarded to designated error targets. Error details are communicated through specific headers to the application:
|
||||
|
||||
|
||||
- ``X-Function-Error-Code``: Code indicating the type of function call error.
|
||||
- ``X-Prompt-Guard-Error-Code``: Code specifying violations detected by prompt guardrails.
|
||||
- Additional headers carry messages and timestamps to aid in debugging and logging.
|
||||
|
||||
8. **Response Handling**:
|
||||
The upstream endpoint’s TLS transport socket encrypts the response, which is then proxied back downstream. Responses pass through HTTP filters in reverse order, ensuring any necessary processing or modification before final delivery.
|
||||
8. **Response Handling**:
|
||||
The upstream endpoint’s TLS transport socket encrypts the response, which is then proxied back downstream. Responses pass through HTTP filters in reverse order, ensuring any necessary processing or modification before final delivery.
|
||||
|
|
|
|||
|
|
@ -1,72 +1,85 @@
|
|||
What is Arch
|
||||
============
|
||||
|
||||
Arch is an intelligent Layer 7 gateway designed for generative AI apps, agents, and Co-pilots that work
|
||||
with prompts. Written in `Rust <https://www.rust-lang.org/>`_, and engineered with purpose-built
|
||||
:ref:`LLMs <llms_in_arch>`, Arch handles all the critical but undifferentiated tasks related to handling and
|
||||
processing prompts, including rejecting `jailbreak <https://github.com/verazuo/jailbreak_llms>`_ attempts,
|
||||
intelligently calling “backend” APIs to fulfill a user's request represented in a prompt, routing/disaster
|
||||
recovery between upstream LLMs, and managing the observability of prompts and LLM interactions in a centralized way.
|
||||
Arch is an intelligent `(Layer 7) <https://www.cloudflare.com/learning/ddos/what-is-layer-7/>`_ gateway
|
||||
designed for generative AI apps, AI agents, and Co-pilots that work with prompts. Engineered with purpose-built
|
||||
:ref:`LLMs <llms_in_arch>`, Arch handles the critical but undifferentiated tasks related to the handling and
|
||||
processing of prompts, including detecting and rejecting `jailbreak <https://github.com/verazuo/jailbreak_llms>`_
|
||||
attempts, intelligently calling “backend” APIs to fulfill the user's request represented in a prompt, routing to
|
||||
and offering disaster recovery between upstream LLMs, and managing the observability of prompts and LLM interactions
|
||||
in a centralized way.
|
||||
|
||||
The project was born out of the belief that:
|
||||
**The project was born out of the belief that:**
|
||||
|
||||
*prompts are nuanced and opaque user requests that need the same capabilities as network requests
|
||||
in modern (cloud-native) applications, including secure handling, intelligent routing, robust observability,
|
||||
and integration with backend (API) systems for personalization.*
|
||||
*Prompts are nuanced and opaque user requests, which require the same capabilities as traditional HTTP requests
|
||||
including secure handling, intelligent routing, robust observability, and integration with backend (API)
|
||||
systems for personalization - all outside business logic.*
|
||||
|
||||
In practice, achieving the above goal is incredibly difficult. Arch attempts to do so by providing the
|
||||
In practice, achieving the above goal is incredibly difficult. Arch attempts to do so by providing the
|
||||
following high level features:
|
||||
|
||||
**Out of process archtiecture, built on Envoy:** Arch is takes a dependency on `Envoy <http://envoyproxy.io/>`_
|
||||
and is a self-contained process that is designed to run alongside your application servers. Arch uses
|
||||
Envoy's HTTP connection management subsystem and HTTP L7 filtering capabilities to extend its' proxying
|
||||
functionality. This gives Arch several advantages:
|
||||
_____________________________________________________________________________________________________________
|
||||
|
||||
* Arch builds on Envoy's success. Envoy is used at masssive sacle by the leading technology companies of
|
||||
our time including `AirBnB <https://www.airbnb.com>`_, `Dropbox <https://www.dropbox.com>`_,
|
||||
`Google <https://www.google.com>`_, `Reddit <https://www.reddit.com>`_, `Stripe <https://www.stripe.com>`_,
|
||||
etc. Its battle tested and scales linearly with usage and enables developers to focus on what really matters:
|
||||
application and business logic.
|
||||
**Out-of-process architecture, built on** `Envoy <http://envoyproxy.io/>`_: Arch is takes a dependency on
|
||||
Envoy and is a self-contained process that is designed to run alongside your application servers. Arch uses
|
||||
Envoy's HTTP connection management subsystem, HTTP L7 filtering and telemetry capabilities to extend the
|
||||
functionality exclusively for prompts and LLMs. This gives Arch several advantages:
|
||||
|
||||
* Arch works with any application language. A single Arch deployment can act as gateway for AI applications
|
||||
written in Python, Java, C++, Go, Php, etc.
|
||||
* Arch builds on Envoy's proven success. Envoy is used at masssive sacle by the leading technology companies of
|
||||
our time including `AirBnB <https://www.airbnb.com>`_, `Dropbox <https://www.dropbox.com>`_,
|
||||
`Google <https://www.google.com>`_, `Reddit <https://www.reddit.com>`_, `Stripe <https://www.stripe.com>`_,
|
||||
etc. Its battle tested and scales linearly with usage and enables developers to focus on what really matters:
|
||||
application features and business logic.
|
||||
|
||||
* As anyone that has worked with a modern application architecture knows, deploying library upgrades
|
||||
can be incredibly painful. Arch can be deployed and upgraded quickly across your infrastructure
|
||||
transparently.
|
||||
* Arch works with any application language. A single Arch deployment can act as gateway for AI applications
|
||||
written in Python, Java, C++, Go, Php, etc.
|
||||
|
||||
**Engineered with LLMs:** Arch is engineered with specialized LLMs that are desgined for fast, cost-effective
|
||||
and acurrate handling of prompts. These (sub-billion parameter) :ref:`LLMs <llms_in_arch>` are designed to be
|
||||
best-in-class for critcal but undifferentiated prompt-related tasks like 1) applying guardrails for jailbreak
|
||||
attempts 2) extracting critical information from prompts (like follow-on, clarifying questions, etc.) so that
|
||||
you can improve the speed and accuracy of retrieval, and be able to convert prompts into API sematics when necessary
|
||||
to build text-to-action (or agentic) applications. The focus for Arch is to make prompt processing indistiguishable
|
||||
from the processing of a traditional HTTP request before forwarding it to an application server. With our focus on
|
||||
speed and cost, Arch uses purpose-built LLMs and will continue to invest in those to lower latency (and cost) while
|
||||
maintaining exceptional baseline performance with frontier LLMs like `OpenAI <https:openai.com>`_, and
|
||||
`Anthropic <https:www.anthropic.com>`_.
|
||||
* Arch can be deployed and upgraded quickly across your infrastructure transparently without horrid pain of
|
||||
deploying library upgrades in your applications.
|
||||
|
||||
**Prompt Guardrails:** Arch helps you apply prompt guardrails in a centralized way for better governance
|
||||
hygiene. With prompt guardrails you can prevent `jailbreak <https://github.com/verazuo/jailbreak_llms>`_
|
||||
attempts or toxicity present in user's prompts without having to write a single line of code. To learn more about
|
||||
how to configure guardrails available in Arch, read :ref:`more <llms_in_arch>`.
|
||||
**Engineered with Fast LLMs:** Arch is engineered with specialized (sub-billion) LLMs that are desgined for fast,
|
||||
cost-effective and acurrate handling of prompts. These :ref:`LLMs <llms_in_arch>` are designed to be
|
||||
best-in-class for critcal prompt-related tasks like:
|
||||
|
||||
**Function Calling:** Arch helps you personalize GenAI apps by enabling calls to application-specific (API)
|
||||
operations using prompts. This involves any predefined functions or APIs you want to expose to users to
|
||||
perform tasks, gather information, or manipulate data. With function calling, you have flexibilityto support
|
||||
agentic workflows tailored to specific use cases - from updating insurance claims to creating ad campaigns.
|
||||
Arch analyzes prompts, extracts critical information from prompts, engages in lightweight conversation with the
|
||||
user to gather any missing parameters and makes API calls so that you can focus on writing business logic.
|
||||
* **Function/API Calling:** Arch helps you easily personalize your applications by enabling calls to
|
||||
application-specific (API) operations via user prompts. This involves any predefined functions or APIs
|
||||
you want to expose to users to perform tasks, gather information, or manipulate data. With function calling,
|
||||
you have flexibility to support "agentic" experiences tailored to specific use cases - from updating insurance
|
||||
claims to creating ad campaigns - via prompts. Arch analyzes prompts, extracts critical information from
|
||||
prompts, engages in lightweight conversation with the user to gather any missing parameters and makes API
|
||||
calls so that you can focus on writing business logic. For more details, read :ref:`prompt processing <arch_overview_prompt_handling>`.
|
||||
|
||||
**Best-In Class Monitoring & Traffic Management:** Arch offers several monitoring metrics that help you
|
||||
understand three critical aspects of your application: latency, token usage, and error rates by LLM provider.
|
||||
Latency measures the speed at which your application is responding to users, which includes metrics like time
|
||||
to first token (TFT), time per output token (TOT) metrics, and the total latency as perceived by users. In
|
||||
addition, Arch offers several capabilities for calls originating from your applications to upstream LLMs,
|
||||
including a vendor-agnostic SDK to make LLM calls, smart retries on errors from upstream LLMs, and automatic
|
||||
cutover to other LLMs configured for continuous availability and disaster recovery scenarios.
|
||||
* **Prompt Guardrails:** Arch helps you improve the safety of your application by applying prompt guardrails in
|
||||
a centralized way for better governance hygiene. With prompt guardrails you can prevent `jailbreak <https://github.com/verazuo/jailbreak_llms>`_
|
||||
attempts or toxicity present in user's prompts without having to write a single line of code. To learn more
|
||||
about how to configure guardrails available in Arch, read :ref:`prompt processing <arch_overview_prompt_handling>`.
|
||||
|
||||
**Front/edge proxy support:** There is substantial benefit in using the same software at the edge (observability,
|
||||
prompt management, load balancing algorithms, etc.) as it is . Arch has a feature set that makes it well suited
|
||||
as an edge proxy for most modern web application use cases. This includes TLS termination, HTTP/1.1 HTTP/2 and
|
||||
HTTP/3 support and prompt-based routing.
|
||||
* **Intent-Drift Detection:** Developers struggle to handle `follow-up <https://www.reddit.com/r/ChatGPTPromptGenius/comments/17dzmpy/how_to_use_rag_with_conversation_history_for/?>`_,
|
||||
or `clarifying <https://www.reddit.com/r/LocalLLaMA/comments/18mqwg6/best_practice_for_rag_with_followup_chat/>`_
|
||||
questions. Specifically, when users ask for modifications or additions to previous responses their AI applications
|
||||
often generate entirely new responses instead of adjusting the previous ones. Arch offers intent-drift detection as a
|
||||
feature so that developers know when the user has shifted away from the previous intent so that they can improve
|
||||
their retrieval, lower overall token cost and dramatically improve the speed and accuracy of their responses back
|
||||
to users.
|
||||
|
||||
**Traffic Management:** Arch offers several capabilities for LLM calls originating from your applications, including a
|
||||
vendor-agnostic SDK to make LLM calls, smart retries on errors from upstream LLMs, and automatic cutover to other LLMs
|
||||
configured in Arch for continuous availability and disaster recovery scenarios. Arch extends Envoy's `cluster subsystem
|
||||
<https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager>`_ to manage upstream connections
|
||||
to LLMs so that you can build resilient AI applications.
|
||||
|
||||
**Front/edge Gateway:** There is substantial benefit in using the same software at the edge (observability,
|
||||
traffic shaping alogirithms, applying guardrails, etc.) as for outbound LLM inference use cases. Arch has the feature set
|
||||
that makes it exceptionally well suited as an edge gateway for AI applications. This includes TLS termination, rate limiting,
|
||||
and prompt-based routing.
|
||||
|
||||
**Best-In Class Monitoring:** Arch offers several monitoring metrics that help you understand three
|
||||
critical aspects of your application: latency, token usage, and error rates by an upstream LLM provider. Latency
|
||||
measures the speed at which your application is responding to users, which includes metrics like time to first
|
||||
token (TFT), time per output token (TOT) metrics, and the total latency as perceived by users.
|
||||
|
||||
**End-to-End Tracing:** Arch propagates trace context using the W3C Trace Context standard, specifically through
|
||||
the ``traceparent`` header. This allows each component in the system to record its part of the request flow,
|
||||
enabling **end-to-end tracing** across the entire application. By using OpenTelemetry, Arch ensures that
|
||||
developers can capture this trace data consistently and in a format compatible with various observability tools.
|
||||
For more details, read :ref:`tracing <arch_overview_tracing>`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue