mirror of
https://github.com/katanemo/plano.git
synced 2026-06-20 15:28:07 +02:00
Doc Update (#129)
* init update * Update terminology.rst * fix the branch to create an index.html, and fix pre-commit issues * Doc update * made several changes to the docs after Shuguang's revision * fixing pre-commit issues * fixed the reference file to the final prompt config file * added google analytics --------- Co-authored-by: Salman Paracha <salmanparacha@MacBook-Pro-261.local>
This commit is contained in:
parent
2a7b95582c
commit
5c7567584d
49 changed files with 1185 additions and 609 deletions
71
docs/source/concepts/includes/arch_config.yaml
Normal file
71
docs/source/concepts/includes/arch_config.yaml
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
version: "0.1-beta"
|
||||
|
||||
listener:
|
||||
address: 0.0.0.0 # or 127.0.0.1
|
||||
port: 10000
|
||||
# Defines how Arch should parse the content from application/json or text/pain Content-type in the http request
|
||||
message_format: huggingface
|
||||
|
||||
# Centralized way to manage LLMs, manage keys, retry logic, failover and limits in a central way
|
||||
llm_providers:
|
||||
- name: "OpenAI"
|
||||
provider: "openai"
|
||||
access_key: $OPENAI_API_KEY
|
||||
model: gpt-4o
|
||||
default: true
|
||||
stream: true
|
||||
|
||||
# default system prompt used by all prompt targets
|
||||
system_prompt: |
|
||||
You are a network assistant that just offers facts; not advice on manufacturers or purchasing decisions.
|
||||
|
||||
prompt_guards:
|
||||
input_guards:
|
||||
jailbreak:
|
||||
on_exception:
|
||||
message: "Looks like you're curious about my abilities, but I can only provide assistance within my programmed parameters."
|
||||
|
||||
prompt_targets:
|
||||
- name: "reboot_network_device"
|
||||
description: "Helps network operators perform device operations like rebooting a device."
|
||||
endpoint:
|
||||
name: app_server
|
||||
path: "/agent/action"
|
||||
parameters:
|
||||
- name: "device_id"
|
||||
# additional type options include: int | float | bool | string | list | dict
|
||||
type: "string"
|
||||
description: "Identifier of the network device to reboot."
|
||||
required: true
|
||||
- name: "confirmation"
|
||||
type: "string"
|
||||
description: "Confirmation flag to proceed with reboot."
|
||||
default: "no"
|
||||
enum: [yes, no]
|
||||
|
||||
- name: "information_extraction"
|
||||
default: true
|
||||
description: "This prompt handles all scenarios that are question and answer in nature. Like summarization, information extraction, etc."
|
||||
endpoint:
|
||||
name: app_server
|
||||
path: "/agent/summary"
|
||||
# Arch uses the default LLM and treats the response from the endpoint as the prompt to send to the LLM
|
||||
auto_llm_dispatch_on_response: true
|
||||
# override system prompt for this prompt target
|
||||
system_prompt: |
|
||||
You are a helpful information extraction assistant. Use the information that is provided to you.
|
||||
|
||||
error_target:
|
||||
endpoint:
|
||||
name: error_target_1
|
||||
path: /error
|
||||
|
||||
# Arch creates a round-robin load balancing between different endpoints, managed via the cluster subsystem.
|
||||
endpoints:
|
||||
app_server:
|
||||
# value could be ip address or a hostname with port
|
||||
# this could also be a list of endpoints for load balancing
|
||||
# for example endpoint: [ ip1:port, ip2:port ]
|
||||
endpoint: "127.0.0.1:80"
|
||||
# max time to wait for a connection to be established
|
||||
connect_timeout: 0.005s
|
||||
53
docs/source/concepts/llm_provider.rst
Normal file
53
docs/source/concepts/llm_provider.rst
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
.. _llm_provider:
|
||||
|
||||
LLM Provider
|
||||
============
|
||||
|
||||
``llm_provider`` is a top-level primitive in Arch, helping developers centrally define, secure, observe,
|
||||
and manage the usage of of their LLMs. Arch builds on Envoy's reliable `cluster subsystem <https://www.envoyproxy.io/docs/envoy/v1.31.2/intro/arch_overview/upstream/cluster_manager>`_
|
||||
to manage egress traffic to LLMs, which includes intelligent routing, retry and fail-over mechanisms,
|
||||
ensuring high availability and fault tolerance. This abstraction also enables developers to seamlessly
|
||||
switching between LLM providers or upgrade LLM versions, simplifying the integration and scaling of LLMs
|
||||
across applications.
|
||||
|
||||
|
||||
Below is an example of how you can configure ``llm_providers`` with an instance of an Arch gateway.
|
||||
|
||||
.. literalinclude:: includes/arch_config.yaml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:lines: 1-20
|
||||
:emphasize-lines: 10-16
|
||||
:caption: Example Configuration
|
||||
|
||||
.. Note::
|
||||
When you start Arch, it creates a listener port for egress traffic based on the presence of ``llm_providers``
|
||||
configuration section in the ``prompt_config.yml`` file. Arch binds itself to a local address such as
|
||||
``127.0.0.1:51001/v1``.
|
||||
|
||||
Arch also offers vendor-agnostic SDKs and libraries to make LLM calls to API-based LLM providers (like OpenAI,
|
||||
Anthropic, Mistral, Cohere, etc.) and supports calls to OSS LLMs that are hosted on your infrastructure. Arch
|
||||
abstracts the complexities of integrating with different LLM providers, providing a unified interface for making
|
||||
calls, handling retries, managing rate limits, and ensuring seamless integration with cloud-based and on-premise
|
||||
LLMs. Simply configure the details of the LLMs your application will use, and Arch offers a unified interface to
|
||||
make outbound LLM calls.
|
||||
|
||||
Example: Using the OpenAI Python SDK
|
||||
------------------------------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize the Arch client
|
||||
client = OpenAI(base_url="http://127.0.0.1:51001/v1")
|
||||
|
||||
# Define your LLM provider and prompt
|
||||
llm_provider = "openai"
|
||||
prompt = "What is the capital of France?"
|
||||
|
||||
# Send the prompt to the LLM through Arch
|
||||
response = client.completions.create(llm_provider=llm_provider, prompt=prompt)
|
||||
|
||||
# Print the response
|
||||
print("LLM Response:", response)
|
||||
126
docs/source/concepts/prompt_target.rst
Normal file
126
docs/source/concepts/prompt_target.rst
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
Prompt Target
|
||||
==============
|
||||
|
||||
**Prompt Targets** are a fundamental component of Arch, enabling developers to define how different types of user prompts are processed and routed within their generative AI applications.
|
||||
This section provides an in-depth look at prompt targets, including their purpose, configuration, usage, and best practices to help you effectively leverage this feature in your projects.
|
||||
|
||||
What Are Prompt Targets?
|
||||
------------------------
|
||||
Prompt targets are predefined endpoints within Arch that handle specific types of user prompts.
|
||||
They act as the bridge between user inputs and your backend services or APIs, enabling Arch to route, process, and manage prompts efficiently.
|
||||
By defining prompt targets, you can separate your application's business logic from the complexities of prompt processing, ensuring a cleaner and more maintainable codebase.
|
||||
|
||||
|
||||
.. table::
|
||||
:width: 100%
|
||||
|
||||
==================== ============================================
|
||||
**Capability** **Description**
|
||||
==================== ============================================
|
||||
Intent Recognition Identify the purpose of a user prompt.
|
||||
Parameter Extraction Extract necessary data from the prompt.
|
||||
API Invocation Call relevant backend services or functions.
|
||||
Response Handling Process and return responses to the user.
|
||||
==================== ============================================
|
||||
|
||||
Key Features
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Below are the key features of prompt targets that empower developers to build efficient, scalable, and personalized GenAI solutions:
|
||||
|
||||
- **Modular Design**: Define multiple prompt targets to handle diverse functionalities.
|
||||
- **Parameter Management**: Specify required and optional parameters for each target.
|
||||
- **Function Integration**: Seamlessly connect prompts to backend APIs or functions.
|
||||
- **Error Handling**: Direct errors to designated handlers for streamlined troubleshooting.
|
||||
- **Metadata Enrichment**: Attach additional context to prompts for enhanced processing.
|
||||
|
||||
Configuring Prompt Targets
|
||||
--------------------------
|
||||
Configuring prompt targets involves defining them in Arch's configuration file.
|
||||
Each Prompt target specifies how a particular type of prompt should be handled, including the endpoint to invoke and any parameters required.
|
||||
|
||||
Basic Configuration
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A prompt target configuration includes the following elements:
|
||||
|
||||
.. vale Vale.Spelling = NO
|
||||
|
||||
- ``name``: A unique identifier for the prompt target.
|
||||
- ``description``: A brief explanation of what the prompt target does.
|
||||
- ``endpoint``: The API endpoint or function that handles the prompt.
|
||||
- ``parameters`` (Optional): A list of parameters to extract from the prompt.
|
||||
|
||||
Defining Parameters
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
Parameters are the pieces of information that Arch needs to extract from the user's prompt to perform the desired action.
|
||||
Each parameter can be marked as required or optional.
|
||||
Here is a full list of parameter attributes that Arch can support:
|
||||
|
||||
.. table::
|
||||
:width: 100%
|
||||
|
||||
==================== ============================================================================
|
||||
**Attribute** **Description**
|
||||
==================== ============================================================================
|
||||
``name`` Specifies identifier of parameters
|
||||
``type`` Specifies the data type of the parameter.
|
||||
``description`` Provides a human-readable explanation of the parameter's purpose.
|
||||
``required`` Indicates whether the parameter is mandatory or optional
|
||||
``default`` Specifies a default value for the parameter if not provided by the user.
|
||||
``items`` Used in the context of arrays to define the schema of items within an array.
|
||||
``format`` Specifies a format for the parameter value, e.g., date and email
|
||||
``enum`` Lists the allowable values for the parameter.
|
||||
``minimum`` Defines the minimum acceptable value for numeric parameters.
|
||||
``maximum`` Specifies the maximum acceptable value for numeric parameters.
|
||||
==================== ============================================================================
|
||||
|
||||
Example Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
prompt_targets:
|
||||
- name: get_weather
|
||||
description: Get the current weather for a location
|
||||
parameters:
|
||||
- name: location
|
||||
description: The city and state, e.g. San Francisco, New York
|
||||
type: str
|
||||
required: true
|
||||
- name: unit
|
||||
description: The unit of temperature to return
|
||||
type: str
|
||||
enum: ["celsius", "fahrenheit"]
|
||||
endpoint:
|
||||
name: api_server
|
||||
path: /weather
|
||||
|
||||
|
||||
Routing Logic
|
||||
-------------
|
||||
Prompt targets determine where and how user prompts are processed.
|
||||
Arch uses intelligent routing logic to ensure that prompts are directed to the appropriate targets based on their intent and context.
|
||||
|
||||
Default Targets
|
||||
~~~~~~~~~~~~~~~
|
||||
For general-purpose prompts that do not match any specific prompt target, Arch routes them to a designated default target.
|
||||
This is useful for handling open-ended queries like document summarization or information extraction.
|
||||
|
||||
Intent Matching
|
||||
~~~~~~~~~~~~~~~
|
||||
Arch analyzes the user's prompt to determine its intent and matches it with the most suitable prompt target based on the name and description defined in the configuration.
|
||||
|
||||
For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
Prompt: "Can you reboot the router?"
|
||||
Matching Target: reboot_device (based on description matching "reboot devices")
|
||||
|
||||
|
||||
Summary
|
||||
--------
|
||||
Prompt targets are essential for defining how user prompts are handled within your generative AI applications using Arch.
|
||||
By carefully configuring prompt targets, you can ensure that prompts are accurately routed, necessary parameters are extracted, and backend services are invoked seamlessly.
|
||||
This modular approach not only simplifies your application's architecture but also enhances scalability, maintainability, and overall user experience.
|
||||
37
docs/source/concepts/tech_overview/listener.rst
Normal file
37
docs/source/concepts/tech_overview/listener.rst
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
.. _arch_overview_listeners:
|
||||
|
||||
Listener
|
||||
---------
|
||||
Listener is a top level primitive in Arch, which simplifies the configuration required to bind incoming
|
||||
connections from downstream clients, and for egress connections to LLMs (hosted or API)
|
||||
|
||||
Arch builds on Envoy's Listener subsystem to streamline connection managemet for developers. Arch minimizes
|
||||
the complexity of Envoy's listener setup by using best-practices and exposing only essential settings,
|
||||
making it easier for developers to bind connections without deep knowledge of Envoy’s configuration model. This
|
||||
simplification ensures that connections are secure, reliable, and optimized for performance.
|
||||
|
||||
Downstream (Ingress)
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
Developers can configure Arch to accept connections from downstream clients. A downstream listener acts as the
|
||||
primary entry point for incoming traffic, handling initial connection setup, including network filtering, gurdrails,
|
||||
and additional network security checks. For more details on prompt security and safety,
|
||||
see :ref:`here <arch_overview_prompt_handling>`
|
||||
|
||||
Upstream (Egress)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Arch automatically configures a listener to route requests from your application to upstream LLM API providers (or hosts).
|
||||
When you start Arch, it creates a listener for egress traffic based on the presence of the ``llm_providers`` configuration
|
||||
section in the ``prompt_config.yml`` file. Arch binds itself to a local address such as ``127.0.0.1:9000/v1`` or a DNS-based
|
||||
address like ``arch.local:9000/v1`` for outgoing traffic. For more details on LLM providers, read :ref:`here <llm_provider>`
|
||||
|
||||
Configure Listener
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To configure a Downstream (Ingress) Listner, simply add the ``listener`` directive to your ``prompt_config.yml`` file:
|
||||
|
||||
.. literalinclude:: ../includes/arch_config.yaml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:lines: 1-18
|
||||
:emphasize-lines: 2-5
|
||||
:caption: Example Configuration
|
||||
56
docs/source/concepts/tech_overview/model_serving.rst
Normal file
56
docs/source/concepts/tech_overview/model_serving.rst
Normal file
|
|
@ -0,0 +1,56 @@
|
|||
.. _arch_model_serving:
|
||||
|
||||
Model Serving
|
||||
-------------
|
||||
|
||||
Arch is a set of **two** self-contained processes that are designed to run alongside your application
|
||||
servers (or on a separate host connected via a network). The first process is designated to manage low-level
|
||||
networking and HTTP related comcerns, and the other process is for **model serving**, which helps Arch make
|
||||
intelligent decisions about the incoming prompts. The model server is designed to call the purpose-built
|
||||
LLMs in Arch.
|
||||
|
||||
.. image:: /_static/img/arch-system-architecture.jpg
|
||||
:align: center
|
||||
:width: 50%
|
||||
|
||||
_____________________________________________________________________________________________________________
|
||||
|
||||
Arch' is designed to be deployed in your cloud VPC, on a on-premises host, and can work on devices that don't
|
||||
have a GPU. Note, GPU devices are need for fast and cost-efficient use, so that Arch (model server, specifically)
|
||||
can process prompts quickly and forward control back to the applicaton host. There are three modes in which Arch
|
||||
can be configured to run its **model server** subsystem:
|
||||
|
||||
Local Serving (CPU - Moderate)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The following bash commands enable you to configure the model server subsystem in Arch to run local on device
|
||||
and only use CPU devices. This will be the slowest option but can be useful in dev/test scenarios where GPUs
|
||||
might not be available.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ archgw up --local-cpu
|
||||
|
||||
Local Serving (GPU- Fast)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The following bash commands enable you to configure the model server subsystem in Arch to run locally on the
|
||||
machine and utilize the GPU available for fast inference across all model use cases, including function calling
|
||||
guardails, etc.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ archgw up --local
|
||||
|
||||
Cloud Serving (GPU - Blazing Fast)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The command below instructs Arch to intelligently use GPUs locally for fast intent detection, but default to
|
||||
cloud serving for function calling and guardails scenarios to dramatically improve the speed and overall performance
|
||||
of your applications.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ archgw up
|
||||
|
||||
.. Note::
|
||||
Arch's model serving in the cloud is priced at $0.05M/token (156x cheaper than GPT-4o) with averlage latency
|
||||
of 200ms (10x faster than GPT-4o). Please refer to our :ref:`Get Started <quickstart>` to know
|
||||
how to generate API keys for model serving
|
||||
136
docs/source/concepts/tech_overview/prompt.rst
Normal file
136
docs/source/concepts/tech_overview/prompt.rst
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
.. _arch_overview_prompt_handling:
|
||||
|
||||
Prompt
|
||||
=================
|
||||
|
||||
Arch's primary design point is to securely accept, process and handle prompts. To do that effectively,
|
||||
Arch relies on Envoy's HTTP `connection management <https://www.envoyproxy.io/docs/envoy/v1.31.2/intro/arch_overview/http/http_connection_management>`_,
|
||||
subsystem and its **prompt handler** subsystem engineered with purpose-built LLMs to
|
||||
implement critical functionality on behalf of developers so that you can stay focused on business logic.
|
||||
|
||||
.. Note::
|
||||
Arch's **prompt handler** subsystem interacts with the **model** subsytem through Envoy's cluster manager
|
||||
system to ensure robust, resilient and fault-tolerant experience in managing incoming prompts. Read more
|
||||
about the :ref:`model subsystem <arch_model_serving>` and how the LLMs are hosted in Arch.
|
||||
|
||||
Messages
|
||||
--------
|
||||
|
||||
Arch accepts messages directly from the body of the HTTP request in a format that follows the `Hugging Face Messages API <https://huggingface.co/docs/text-generation-inference/en/messages_api>`_.
|
||||
This design allows developers to pass a list of messages, where each message is represented as a dictionary
|
||||
containing two key-value pairs:
|
||||
|
||||
- **Role**: Defines the role of the message sender, such as "user" or "assistant".
|
||||
- **Content**: Contains the actual text of the message.
|
||||
|
||||
|
||||
Prompt Guardrails
|
||||
-----------------
|
||||
|
||||
Arch is engineered with :ref:`Arch-Guard <prompt_guard>`, an industry leading safety layer, powered by a
|
||||
compact and high-performimg LLM that monitors incoming prompts to detect and reject jailbreak attempts -
|
||||
ensuring that unauthorized or harmful behaviors are intercepted early in the process.
|
||||
|
||||
To add jailbreak guardrails, see example below:
|
||||
|
||||
.. literalinclude:: ../includes/arch_config.yaml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:lines: 1-45
|
||||
:emphasize-lines: 22-26
|
||||
:caption: Example Configuration
|
||||
|
||||
.. Note::
|
||||
As a roadmap item, Arch will expose the ability for developers to define custom guardrails via Arch-Guard-v2,
|
||||
and add support for additional safety checks defined by developers and hazardous categories like, violent crimes, privacy, hate,
|
||||
etc. To offer feedback on our roadmap, please visit our `github page <https://github.com/orgs/katanemo/projects/1>`_
|
||||
|
||||
|
||||
Prompt Targets
|
||||
--------------
|
||||
|
||||
Once a prompt passes any configured guardrail checks, Arch processes the contents of the incoming conversation
|
||||
and identifies where to forwad the conversation to via its ``prompt_targets`` primitve. Prompt targets are endpoints
|
||||
that receive prompts that are processed by Arch. For example, Arch enriches incoming prompts with metadata like knowing
|
||||
when a user's intent has changed so that you can build faster, more accurate RAG apps.
|
||||
|
||||
Configuring ``prompt_targets`` is simple. See example below:
|
||||
|
||||
.. literalinclude:: ../includes/arch_config.yaml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:emphasize-lines: 29-38
|
||||
:caption: Example Configuration
|
||||
|
||||
|
||||
Intent Detection and Prompt Matching:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Arch uses fast Natural Language Inference (NLI) and embedding approaches to first detect the intent of each
|
||||
incoming prompt. This intent detection phase analyzes the prompt's content and matches it against predefined
|
||||
prompt targets, ensuring that each prompt is forwarded to the most appropriate endpoint. Arch’s intent
|
||||
detection framework considers both the name and description of each prompt target, and uses a composite matching
|
||||
score between an NLI and cosine similarity to enchance accuracy in forwarding decisions.
|
||||
|
||||
- **Embeddings**: By embedding the prompt and comparing it to known target vectors, Arch effectively identifies
|
||||
the closest match, ensuring that the prompt is handled by the correct downstream service.
|
||||
|
||||
- **NLI**: NLI techniques further refine the matching process by evaluating the semantic alignment between the
|
||||
prompt and potential targets.
|
||||
|
||||
Agentic Apps via Prompt Targets
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To support agentic apps, like scheduling travel plans or sharing comments on a document - via prompts, Arch uses
|
||||
its function calling abilities to extract critical information from the incoming prompt (or a set of prompts)
|
||||
needed by a downstream backend API or function call before calling it directly. For more details on how you can
|
||||
build agentic applications using Arch, see our full guide :ref:`here <arch_agent_guide>`:
|
||||
|
||||
.. Note::
|
||||
Arch :ref:`Arch-Function <function_calling>` is the dedicated agentic model engineered in Arch to extract information from
|
||||
a (set of) prompts and executes necessary backend API calls. This allows for efficient handling of agentic tasks,
|
||||
such as scheduling data retrieval, by dynamically interacting with backend services. Arch-Function is a flagship 1.3
|
||||
billion parameter model that matches performance with frontier models like Claude Sonnet 3.5 ang GPT-4, while
|
||||
being 100x cheaper ($0.05M/token hosted) and 10x faster (p50 latencies of 200ms).
|
||||
|
||||
Prompting LLMs
|
||||
--------------
|
||||
Arch is a single piece of software that is designed to manage both ingress and egress prompt traffic, drawing its
|
||||
distributed proxy nature from the robust `Envoy <https://envoyproxy.io>`_. This makes it extremely efficient and capable
|
||||
of handling upstream connections to LLMs. If your application is originating code to an API-based LLM, simply use
|
||||
the OpenAI client and configure it with Arch. By sending traffic through Arch, you can propagate traces, manage and monitor
|
||||
traffic, apply rate limits, and utilize a large set of traffic management capabilities in a centralized way.
|
||||
|
||||
.. Attention::
|
||||
When you start Arch, it automatically creates a listener port for egress calls to upstream LLMs. This is based on the
|
||||
``llm_providers`` configuration section in the ``prompt_config.yml`` file. Arch binds itself to a local address such as
|
||||
127.0.0.1:51001/v1.
|
||||
|
||||
|
||||
Example: Using OpenAI Client with Arch as an Egress Gateway
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import openai
|
||||
|
||||
# Set the OpenAI API base URL to the Arch gateway endpoint
|
||||
openai.api_base = "http://127.0.0.1:51001/v1"
|
||||
|
||||
# No need to set openai.api_key since it's configured in Arch's gateway
|
||||
|
||||
# Use the OpenAI client as usual
|
||||
response = openai.Completion.create(
|
||||
model="text-davinci-003",
|
||||
prompt="What is the capital of France?"
|
||||
)
|
||||
|
||||
print("OpenAI Response:", response.choices[0].text.strip())
|
||||
|
||||
In these examples:
|
||||
|
||||
The OpenAI client is used to send traffic directly through the Arch egress proxy to the LLM of your choice, such as OpenAI.
|
||||
The OpenAI client is configured to route traffic via Arch by setting the proxy to 127.0.0.1:51001, assuming Arch is
|
||||
running locally and bound to that address and port.
|
||||
|
||||
This setup allows you to take advantage of Arch's advanced traffic management features while interacting with LLM APIs like OpenAI.
|
||||
176
docs/source/concepts/tech_overview/request_lifecycle.rst
Normal file
176
docs/source/concepts/tech_overview/request_lifecycle.rst
Normal file
|
|
@ -0,0 +1,176 @@
|
|||
.. _lifecycle_of_a_request:
|
||||
|
||||
Request Lifecycle
|
||||
=================
|
||||
|
||||
Below we describe the events in the lifecycle of a request passing through an Arch gateway instance. We first
|
||||
describe how Arch fits into the request path and then the internal events that take place following
|
||||
the arrival of a request at Arch from downtream clients. We follow the request until the corresponding
|
||||
dispatch upstream and the response path.
|
||||
|
||||
.. image:: /_static/img/network-topology-ingress-egress.jpg
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
Terminology
|
||||
-----------
|
||||
|
||||
We recommend that you get familiar with some of the :ref:`terminology <arch_terminology>` used in Arch
|
||||
before reading this section.
|
||||
|
||||
Network topology
|
||||
----------------
|
||||
|
||||
How a request flows through the components in a network (including Arch) depends on the network’s topology.
|
||||
Arch can be used in a wide variety of networking topologies. We focus on the inner operation of Arch below,
|
||||
but briefly we address how Arch relates to the rest of the network in this section.
|
||||
|
||||
- **Downstream(Ingress)** listeners take requests from upstream clients like a web UI or clients that forward
|
||||
prompts to you local application responses from the application flow back through Arch to the downstream.
|
||||
|
||||
- **Upstream(Egress)** listeners take requests from the application and forward them to LLMs.
|
||||
|
||||
.. image:: /_static/img/network-topology-ingress-egress.jpg
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
In practice, Arch can be deployed on the edge and as an internal load balancer between AI agents. A request path may
|
||||
traverse multiple Arch gateways:
|
||||
|
||||
.. image:: /_static/img/network-topology-agent.jpg
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
|
||||
High level architecture
|
||||
-----------------------
|
||||
Arch is a set of **two** self-contained processes that are designed to run alongside your application servers
|
||||
(or on a separate server connected to your application servers via a network). The first process is designated
|
||||
to manage HTTP-level networking and connection management concerns (protocol management, request id generation,
|
||||
header sanitization, etc.), and the other process is for **model serving**, which helps Arch make intelligent
|
||||
decisions about the incoming prompts. The model server hosts the purpose-built LLMs to
|
||||
manage several critical, but undifferentiated, prompt related tasks on behalf of developers.
|
||||
|
||||
|
||||
The request processing path in Arch has three main parts:
|
||||
|
||||
* :ref:`Listener subsystem <arch_overview_listeners>` which handles **downstream** and **upstream** request
|
||||
processing. It is responsible for managing the downstream (ingress) and the upstream (egress) request
|
||||
lifecycle. The downstream and upstream HTTP/2 codec lives here.
|
||||
* :ref:`Prompt handler subsystem <arch_overview_prompt_handling>` which is responsible for selecting and
|
||||
forwarding prompts ``prompt_targets`` and establishes the lifecycle of any **upstream** connection to a
|
||||
hosted endpoint that implements domain-specific business logic for incoming promots. This is where knowledge
|
||||
of targets and endpoint health, load balancing and connection pooling exists.
|
||||
* :ref:`Model serving subsystem <arch_model_serving>` which helps Arch make intelligent decisions about the
|
||||
incoming prompts. The model server is designed to call the purpose-built LLMs in Arch.
|
||||
|
||||
The three subsystems are bridged with either the HTTP router filter, and the cluster manager subsystems of Envoy.
|
||||
|
||||
Also, Arch utilizes `Envoy event-based thread model <https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310>`_.
|
||||
A main thread is responsible forthe server lifecycle, configuration processing, stats, etc. and some number of
|
||||
:ref:`worker threads <arch_overview_threading>` process requests. All threads operate around an event loop (`libevent <https://libevent.org/>`_)
|
||||
and any given downstream TCP connection will be handled by exactly one worker thread for its lifetime. Each worker
|
||||
thread maintains its own pool of TCP connections to upstream endpoints.
|
||||
|
||||
Worker threads rarely share state and operate in a trivially parallel fashion. This threading model
|
||||
enables scaling to very high core count CPUs.
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
Today, only support a static bootstrap configuration file for simplicity today:
|
||||
|
||||
.. literalinclude:: ../includes/arch_config.yaml
|
||||
:language: yaml
|
||||
|
||||
|
||||
Request Flow (Ingress)
|
||||
----------------------
|
||||
|
||||
Overview
|
||||
^^^^^^^^
|
||||
A brief outline of the lifecycle of a request and response using the example configuration above:
|
||||
|
||||
1. **TCP Connection Establishment**:
|
||||
A TCP connection from downstream is accepted by an Arch listener running on a worker thread.
|
||||
The listener filter chain provides SNI and other pre-TLS information. The transport socket, typically TLS,
|
||||
decrypts incoming data for processing.
|
||||
|
||||
2. **Prompt Guardrails Check**:
|
||||
Arch first checks the incoming prompts for guardrails such as jailbreak attempts. This ensures
|
||||
that harmful or unwanted behaviors are detected early in the request processing pipeline.
|
||||
|
||||
3. **Intent Matching**:
|
||||
The decrypted data stream is deframed by the HTTP/2 codec in Arch's HTTP connection manager. Arch performs
|
||||
intent matching via is **prompt-handler** subsystem using the name and description of the defined prompt targets,
|
||||
determining which endpoint should handle the prompt.
|
||||
|
||||
4. **Parameter Gathering with Arch-FC**:
|
||||
If a prompt target requires specific parameters, Arch engages Arch-FC to extract the necessary details
|
||||
from the incoming prompt(s). This process gathers the critical information needed for downstream API calls.
|
||||
|
||||
5. **API Call Execution**:
|
||||
Arch routes the prompt to the appropriate backend API or function call. If an endpoint cluster is identified,
|
||||
load balancing is performed, circuit breakers are checked, and the request is proxied to the upstream endpoint.
|
||||
|
||||
6. **Default Summarization by Upstream LLM**:
|
||||
By default, if no specific endpoint processing is needed, the prompt is sent to an upstream LLM for summarization.
|
||||
This ensures that responses are concise and relevant, enhancing user experience in RAG (Retrieval-Augmented Generation)
|
||||
and agentic applications.
|
||||
|
||||
7. **Error Handling and Forwarding**:
|
||||
Errors encountered during processing, such as failed function calls or guardrail detections, are forwarded to
|
||||
designated error targets. Error details are communicated through specific headers to the application:
|
||||
|
||||
- ``X-Function-Error-Code``: Code indicating the type of function call error.
|
||||
- ``X-Prompt-Guard-Error-Code``: Code specifying violations detected by prompt guardrails.
|
||||
- Additional headers carry messages and timestamps to aid in debugging and logging.
|
||||
|
||||
8. **Response Handling**:
|
||||
The upstream endpoint’s TLS transport socket encrypts the response, which is then proxied back downstream.
|
||||
Responses pass through HTTP filters in reverse order, ensuring any necessary processing or modification before final delivery.
|
||||
|
||||
|
||||
Request Flow (Egress)
|
||||
---------------------
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
A brief outline of the lifecycle of a request and response in the context of egress traffic from an application
|
||||
to Large Language Models (LLMs) via Arch:
|
||||
|
||||
1. **HTTP Connection Establishment to LLM**:
|
||||
Arch initiates an HTTP connection to the upstream LLM service. This connection is handled by Arch’s egress listener
|
||||
running on a worker thread. The connection typically uses a secure transport protocol such as HTTPS, ensuring the
|
||||
prompt data is encrypted before being sent to the LLM service.
|
||||
|
||||
2. **Rate Limiting**:
|
||||
Before sending the request to the LLM, Arch applies rate-limiting policies to ensure that the upstream LLM service
|
||||
is not overwhelmed by excessive traffic. Rate limits are enforced per client or service, ensuring fair usage and
|
||||
preventing accidental or malicious overload. If the rate limit is exceeded, Arch may return an appropriate HTTP
|
||||
error (e.g., 429 Too Many Requests) without sending the prompt to the LLM.
|
||||
|
||||
3. **Load Balancing to (hosted) LLM Endpoints**:
|
||||
After passing the rate-limiting checks, Arch routes the prompt to the appropriate LLM endpoint.
|
||||
If multiple LLM providers instances are available, load balancing is performed to distribute traffic evenly
|
||||
across the instances. Arch checks the health of the LLM endpoints using circuit breakers and health checks,
|
||||
ensuring that the prompt is only routed to a healthy, responsive instance.
|
||||
|
||||
4. **Response Reception and Forwarding**:
|
||||
Once the LLM processes the prompt, Arch receives the response from the LLM service. The response is typically a
|
||||
generated text, completion, or summarization. Upon reception, Arch decrypts (if necessary) and handles the response,
|
||||
passing it through any egress processing pipeline defined by the application, such as logging or additional response filtering.
|
||||
|
||||
|
||||
Post-request processing
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Once a request completes, the stream is destroyed. The following also takes places:
|
||||
|
||||
* The post-request :ref:`monitoring <monitoring>` are updated (e.g. timing, active requests, upgrades, health checks).
|
||||
Some statistics are updated earlier however, during request processing. Stats are batchedand written by the main
|
||||
thread periodically.
|
||||
* :ref:`Access logs <arch_access_logging>` are written to the access log
|
||||
* :ref:`Trace <arch_overview_tracing>` spans are finalized. If our example request was traced, a
|
||||
trace span, describing the duration and details of the request would be created by the HCM when
|
||||
processing request headers and then finalized by the HCM during post-request processing.
|
||||
14
docs/source/concepts/tech_overview/tech_overview.rst
Normal file
14
docs/source/concepts/tech_overview/tech_overview.rst
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
.. _tech_overview:
|
||||
|
||||
Tech Overview
|
||||
=============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
terminology
|
||||
threading_model
|
||||
listener
|
||||
model_serving
|
||||
prompt
|
||||
request_lifecycle
|
||||
46
docs/source/concepts/tech_overview/terminology.rst
Normal file
46
docs/source/concepts/tech_overview/terminology.rst
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
.. _arch_terminology:
|
||||
|
||||
Terminology
|
||||
============
|
||||
|
||||
A few definitions before we dive into the main architecture documentation. Arch borrows from Envoy's terminology
|
||||
to keep things consistent in logs, traces and in code.
|
||||
|
||||
**Downstream(Ingress)**: An downstream client (web application, etc.) connects to Arch, sends prompts, and receives responses.
|
||||
|
||||
**Upstream(Egress)**: An upstream host that receives connections and prompts from Arch, and returns context or responses for a prompt
|
||||
|
||||
.. image:: /_static/img/network-topology-ingress-egress.jpg
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
**Listener**: A listener is a named network location (e.g., port, address, path etc.) that Arch listens on to process prompts
|
||||
before forwarding them to your application server endpoints. rch enables you to configure one listener for downstream connections
|
||||
(like port 80, 443) and creates a separate internal listener for calls that initiate from your application code to LLMs.
|
||||
|
||||
.. Note::
|
||||
|
||||
When you start Arch, you specify a listener address/port that you want to bind downstream. But, Arch uses are predefined port
|
||||
that you can use (``127.0.0.1:10000``) to proxy egress calls originating from your application to LLMs (API-based or hosted).
|
||||
For more details, check out :ref:`LLM providers <llm_provider>`
|
||||
|
||||
**Instance**: An instance of the Arch gateway. When you start Arch it creates at most two processes. One to handle Layer 7
|
||||
networking operations (auth, tls, observability, etc) and the second process to serve models that enable it to make smart
|
||||
decisions on how to accept, handle and forward prompts. The second process is optional, as the model serving sevice could be
|
||||
hosted on a different network (an API call). But these two processes are considered a single instance of Arch.
|
||||
|
||||
**Prompt Targets**: Arch offers a primitive called ``prompt_targets`` to help separate business logic from undifferentiated
|
||||
work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Arch.
|
||||
For example, Arch enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt
|
||||
so that you can build faster, more accurate retrieval (RAG) apps. To support agentic apps, like scheduling travel plans or
|
||||
sharing comments on a document - via prompts, Bolt uses its function calling abilities to extract critical information from
|
||||
the incoming prompt (or a set of prompts) needed by a downstream backend API or function call before calling it directly.
|
||||
|
||||
**Error Targets**: Error targets are those endpoints that receive forwarded errors from Arch when issues arise,
|
||||
such as failing to properly call a function/API, detecting violations of guardrails, or encountering other processing errors.
|
||||
These errors are communicated to the application via headers (X-Arch-[ERROR-TYPE]), allowing it to handle the errors gracefully
|
||||
and take appropriate actions.
|
||||
|
||||
**Model Serving**: Arch is a set of **two** self-contained processes that are designed to run alongside your application servers
|
||||
(or on a separate hostconnected via a network).The **model serving** process helps Arch make intelligent decisions about the
|
||||
incoming prompts. The model server is designed to call the (fast) purpose-built LLMs in Arch.
|
||||
21
docs/source/concepts/tech_overview/threading_model.rst
Normal file
21
docs/source/concepts/tech_overview/threading_model.rst
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
.. _arch_overview_threading:
|
||||
|
||||
Threading Model
|
||||
===============
|
||||
|
||||
Arch builds on top of Envoy's single process with multiple threads architecture.
|
||||
|
||||
A single *primary* thread controls various sporadic coordination tasks while some number of *worker*
|
||||
threads perform filtering, and forwarding.
|
||||
|
||||
Once a connection is accepted, the connection spends the rest of its lifetime bound to a single worker
|
||||
thread. All the functionality around prompt handling from a downstream client is handled in a separate worker thread.
|
||||
This allows the majority of Arch to be largely single threaded (embarrassingly parallel) with a small amount
|
||||
of more complex code handling coordination between the worker threads.
|
||||
|
||||
Generally Arch is written to be 100% non-blocking.
|
||||
|
||||
.. tip::
|
||||
|
||||
For most workloads we recommend configuring the number of worker threads to be equal to the number of
|
||||
hardware threads on the machine.
|
||||
Loading…
Add table
Add a link
Reference in a new issue