mirror of
https://github.com/katanemo/plano.git
synced 2026-05-02 20:32:42 +02:00
Salmanap/docs v1 push (#92)
* updated model serving, updated the config references, architecture docs and added the llm_provider section * several documentation changes to improve sections like life_of_a_request, model serving subsystem --------- Co-authored-by: Salman Paracha <salmanparacha@MacBook-Pro-261.local>
This commit is contained in:
parent
8a4e11077c
commit
7168b14ed3
19 changed files with 375 additions and 119 deletions
|
|
@ -8,3 +8,5 @@ Technical Architecture
|
|||
intro/threading_model
|
||||
listeners/listeners
|
||||
prompt_processing/prompt_processing
|
||||
listeners/llm_provider
|
||||
model_serving/model_serving
|
||||
|
|
|
|||
|
|
@ -1,12 +1,14 @@
|
|||
.. _arch_terminology:
|
||||
|
||||
Terminology
|
||||
============
|
||||
|
||||
A few definitions before we dive into the main architecture documentation. Arch borrows from Envoy's terminology
|
||||
to keep things consistent in logs, traces and in code.
|
||||
|
||||
**Downstream**: An downstream client (web application, etc.) connects to Arch, sends requests, and receives responses.
|
||||
**Downstream(Ingress)**: An downstream client (web application, etc.) connects to Arch, sends prompts, and receives responses.
|
||||
|
||||
**Upstream**: An upstream host receives connections and prompts from Arch, and returns context or responses for a prompt
|
||||
**Upstream(Egress)**: An upstream host that receives connections and prompts from Arch, and returns context or responses for a prompt
|
||||
|
||||
.. image:: /_static/img/network-topology-ingress-egress.jpg
|
||||
:width: 100%
|
||||
|
|
@ -18,27 +20,27 @@ before forwarding them to your application server endpoints. rch enables you to
|
|||
|
||||
.. Note::
|
||||
|
||||
When you start Arch, you specify a listener address/port that you want to bind downstream (. But Arch uses are predefined port that you
|
||||
can use for outbound calls to LLMs and other services 127.0.0.1:10000
|
||||
When you start Arch, you specify a listener address/port that you want to bind downstream. But, Arch uses are predefined port
|
||||
that you can use (``127.0.0.1:10000``) to proxy egress calls originating from your application to LLMs (API-based or hosted).
|
||||
For more details, check out :ref:`LLM providers <llm_providers>`
|
||||
|
||||
**Instance**: An instance of the Arch gateway. When you start Arch it creates at most two processes. One to handle Layer 7
|
||||
networking operations (auth, tls, observability, etc) and the second process to serve models that enable it to make smart
|
||||
decisions on how to accept, handle and forward prompts. The second process is optional, as the model serving sevice could be
|
||||
hosted on a different network (an API call). But these two processes are considered a single instance of Arch.
|
||||
|
||||
**System Prompt**: An initial text or message that is provided by the developer that Arch can use to call an downstream LLM
|
||||
in order to generate a response from the LLM model. The system prompt can be thought of as the input or query that the model
|
||||
uses to generate its response. The quality and specificity of the system prompt can have a significant impact on the relevance
|
||||
and accuracy of the model's response. Therefore, it is important to provide a clear and concise system prompt that accurately
|
||||
conveys the user's intended message or question.
|
||||
|
||||
**Prompt Targets**: Arch offers a primitive called “prompt targets” to help separate business logic from undifferentiated
|
||||
work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Bolt.
|
||||
For example, Bolt enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt
|
||||
so that you can build faster, more accurate RAG apps. To support agentic apps, like scheduling travel plans or sharing comments
|
||||
on a document - via prompts, Bolt uses its function calling abilities to extract critical information from the incoming prompt
|
||||
(or a set of prompts) needed by a downstream backend API or function call before calling it directly.
|
||||
**Prompt Targets**: Arch offers a primitive called ``prompt_targets`` to help separate business logic from undifferentiated
|
||||
work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Arch.
|
||||
For example, Arch enriches incoming prompts with metadata like knowing when a request is a follow-up or clarifying prompt
|
||||
so that you can build faster, more accurate retrieval (RAG) apps. To support agentic apps, like scheduling travel plans or
|
||||
sharing comments on a document - via prompts, Bolt uses its function calling abilities to extract critical information from
|
||||
the incoming prompt (or a set of prompts) needed by a downstream backend API or function call before calling it directly.
|
||||
|
||||
**Error Targets**: Error targets are those endpoints that receive forwarded errors from Arch when issues arise,
|
||||
such as failing to properly call a function/API, detecting violations of guardrails, or encountering other processing errors.
|
||||
These errors are communicated to the application via headers (X-Arch-[ERROR-TYPE]), allowing it to handle the errors gracefully and take appropriate actions.
|
||||
These errors are communicated to the application via headers (X-Arch-[ERROR-TYPE]), allowing it to handle the errors gracefully
|
||||
and take appropriate actions.
|
||||
|
||||
**Model Serving**: Arch is a set of **two** self-contained processes that are designed to run alongside your application servers
|
||||
(or on a separate hostconnected via a network).The **model serving** process helps Arch make intelligent decisions about the
|
||||
incoming prompts. The model server is designed to call the (fast) purpose-built :ref:`LLMs <llms_in_arch>` in Arch.
|
||||
|
|
|
|||
|
|
@ -1,27 +1,37 @@
|
|||
.. _arch_overview_listeners:
|
||||
|
||||
Listener
|
||||
========
|
||||
Arch leverages Envoy’s Listener subsystem to streamline connection management for developers.
|
||||
By building on Envoy’s robust architecture, Arch simplifies the configuration required to bind incoming
|
||||
connections from downstream clients and efficiently manages internal listeners for outgoing connections
|
||||
to LLM hosts and APIs.
|
||||
---------
|
||||
Listener is a top level primitive in Arch, which simplifies the configuration required to bind incoming
|
||||
connections from downstream clients, and for egress connections to LLMs (hosted or API)
|
||||
|
||||
**Listener Subsystem Overview**
|
||||
Arch builds on Envoy's Listener subsystem to streamline connection managemet for developers. Arch minimizes
|
||||
the complexity of Envoy's listener setup by using best-practices and exposing only essential settings,
|
||||
making it easier for developers to bind connections without deep knowledge of Envoy’s configuration model. This
|
||||
simplification ensures that connections are secure, reliable, and optimized for performance.
|
||||
|
||||
- **Downstream Connections**: Arch uses Envoy's Listener subsystem to accept connections from downstream clients.
|
||||
A listener acts as the primary entry point for incoming traffic, handling initial connection setup, including network
|
||||
filtering and security checks, such as SNI and TLS termination. For more details on the listener subsystem, refer to the
|
||||
`Envoy Listener Configuration <https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/listeners>`_.
|
||||
Downstream (Ingress)
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
Developers can configure Arch to accept connections from downstream clients. A downstream listener acts as the
|
||||
primary entry point for incoming traffic, handling initial connection setup, including network filtering, gurdrails,
|
||||
and additional network security checks. For more details on prompt security and safety,
|
||||
see :ref:`here <arch_overview_prompt_handling>`
|
||||
|
||||
- **Internal Listeners for Outgoing Connections**: Arch automatically configures internal listeners to route requests
|
||||
from prompts origination from your application services to appropriate upstream targets, including LLM hosts and backend APIs.
|
||||
This configuration abstracts away complex networking setups, allowing developers to focus on business logic rather than the
|
||||
intricacies of connection management and multiple SDKs to work with different LLM providers.
|
||||
Upstream (Egress)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Arch automatically configures a listener to route requests from your application to upstream LLM API providers (or hosts).
|
||||
When you start Arch, it creates a listener for egress traffic based on the presence of the ``llm_providers`` configuration
|
||||
section in the ``prompt_config.yml`` file. Arch binds itself to a local address such as ``127.0.0.1:9000/v1`` or a DNS-based
|
||||
address like ``arch.local:9000/v1`` for outgoing traffic. For more details on LLM providers, read :ref:`here <llm_providers>`
|
||||
|
||||
Configure Listener
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
- **Simplified Configuration**: Arch minimizes the complexity of traditional Envoy setups by pre-defining essential
|
||||
listener settings, making it easier for developers to bind connections without deep knowledge of Envoy’s configuration model.
|
||||
This simplification ensures that connections are secure, reliable, and optimized for performance.
|
||||
To configure a Downstream (Ingress) Listner, simply add the ``listener`` directive to your ``prompt_config.yml`` file:
|
||||
|
||||
Arch’s dependency on Envoy’s Listener subsystem provides a powerful, developer-friendly interface for managing connections,
|
||||
enhancing the overall efficiency of handling prompts and routing them to the correct endpoints within a generative AI application.
|
||||
.. literalinclude:: /_config/getting-started.yml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:lines: 1-18
|
||||
:emphasize-lines: 2-5
|
||||
:caption: :download:`arch-getting-started.yml </_config/getting-started.yml>`
|
||||
52
docs/source/intro/architecture/listeners/llm_provider.rst
Normal file
52
docs/source/intro/architecture/listeners/llm_provider.rst
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
.. _llm_providers:
|
||||
|
||||
LLM Provider
|
||||
------------
|
||||
|
||||
``llm_provider`` is a top-level primitive in Arch, helping developers centrally define, secure, observe,
|
||||
and manage the usage of of their LLMs. Arch builds on Envoy's reliable `cluster subsystem <https://www.envoyproxy.io/docs/envoy/v1.31.2/intro/arch_overview/upstream/cluster_manager>`_
|
||||
to manage egress traffic to LLMs, which includes intelligent routing, retry and fail-over mechanisms,
|
||||
ensuring high availability and fault tolerance. This abstraction also enables developers to seamlessly switching between LLM providers or upgrade LLM versions, simplifying the integration and scaling of LLMs across
|
||||
applications.
|
||||
|
||||
|
||||
Below is an example of how you can configure ``llm_providers`` with an instance of an Arch gateway.
|
||||
|
||||
.. literalinclude:: /_config/getting-started.yml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:lines: 1-20
|
||||
:emphasize-lines: 11-18
|
||||
:caption: :download:`arch-getting-started.yml </_config/getting-started.yml>`
|
||||
|
||||
.. Note::
|
||||
When you start Arch, it creates a listener port for egress traffic based on the presence of ``llm_providers``
|
||||
configuration section in the ``prompt_config.yml`` file. Arch binds itself to a local address such as
|
||||
``127.0.0.1:9000/v1`` or a DNS-based address like ``arch.local:9000/v1`` for egress traffic.
|
||||
|
||||
Arch also offers vendor-agnostic SDKs and libraries to make LLM calls to API-based LLM providers (like OpenAI,
|
||||
Anthropic, Mistral, Cohere, etc.) and supports calls to OSS LLMs that are hosted on your infrastructure. Arch
|
||||
abstracts the complexities of integrating with different LLM providers, providing a unified interface for making
|
||||
calls, handling retries, managing rate limits, and ensuring seamless integration with cloud-based and on-premise
|
||||
LLMs. Simply configure the details of the LLMs your application will use, and Arch offers a unified interface to
|
||||
make outbound LLM calls.
|
||||
|
||||
Example: Using the Arch Python SDK
|
||||
----------------------------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from arch_client import ArchClient
|
||||
|
||||
# Initialize the Arch client
|
||||
client = ArchClient(base_url="http://127.0.0.1:9000/v1")
|
||||
|
||||
# Define your LLM provider and prompt
|
||||
model_id = "openai"
|
||||
prompt = "What is the capital of France?"
|
||||
|
||||
# Send the prompt to the LLM through Arch
|
||||
response = client.completions.create(llm_provider=llm_provider, prompt=prompt)
|
||||
|
||||
# Print the response
|
||||
print("LLM Response:", response)
|
||||
|
|
@ -0,0 +1,56 @@
|
|||
.. _arch_model_serving:
|
||||
|
||||
Model Serving
|
||||
-------------
|
||||
|
||||
Arch is a set of **two** self-contained processes that are designed to run alongside your application
|
||||
servers (or on a separate host connected via a network). The first process is designated to manage low-level
|
||||
networking and HTTP related comcerns, and the other process is for **model serving**, which helps Arch make
|
||||
intelligent decisions about the incoming prompts. The model server is designed to call the purpose-built
|
||||
:ref:`LLMs <llms_in_arch>` in Arch.
|
||||
|
||||
.. image:: /_static/img/arch-system-architecture.jpg
|
||||
:align: center
|
||||
:width: 50%
|
||||
|
||||
_____________________________________________________________________________________________________________
|
||||
|
||||
Arch' is designed to be deployed in your cloud VPC, on a on-premises host, and can work on devices that don't
|
||||
have a GPU. Note, GPU devices are need for fast and cost-efficient use, so that Arch (model server, specifically)
|
||||
can process prompts quickly and forward control back to the applicaton host. There are three modes in which Arch
|
||||
can be configured to run its **model server** subsystem:
|
||||
|
||||
Local Serving (CPU - Moderate)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The following bash commands enable you to configure the model server subsystem in Arch to run local on device
|
||||
and only use CPU devices. This will be the slowest option but can be useful in dev/test scenarios where GPUs
|
||||
might not be available.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
archgw up --local -cpu
|
||||
|
||||
Local Serving (GPU- Fast)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The following bash commands enable you to configure the model server subsystem in Arch to run locally on the
|
||||
machine and utilize the GPU available for fast inference across all model use cases, including function calling
|
||||
guardails, etc.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
archgw up --local
|
||||
|
||||
Cloud Serving (GPU - Blazing Fast)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The command below instructs Arch to intelligently use GPUs locally for fast intent detection, but default to
|
||||
cloud serving for function calling and guardails scenarios to dramatically improve the speed and overall performance
|
||||
of your applications.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
archgw up
|
||||
|
||||
.. Note::
|
||||
Arch's model serving in the cloud is priced at $0.05M/token (156x cheaper than GPT-4o) with averlage latency
|
||||
of 200ms (10x faster than GPT-4o). Please refer to our :ref:`getting started guide <getting_started>` to know
|
||||
how to generate API keys for model serving
|
||||
|
|
@ -1,23 +1,37 @@
|
|||
.. _arch_overview_prompt_handling:
|
||||
|
||||
Prompts
|
||||
=======
|
||||
-------
|
||||
|
||||
Arch's primary design point is to securely accept, process and handle prompts. To do that effectively,
|
||||
Arch relies on Envoy's HTTP `connection management <https://www.envoyproxy.io/docs/envoy/v1.31.2/intro/arch_overview/http/http_connection_management>`_,
|
||||
subsystem and its prompt-handler subsystem engineered with purpose-built :ref:`LLMs <llms_in_arch>` to implement
|
||||
critical functionality on behalf of developers so that you can stay focused on business logic.
|
||||
subsystem and its **prompt handler** subsystem engineered with purpose-built :ref:`LLMs <llms_in_arch>` to
|
||||
implement critical functionality on behalf of developers so that you can stay focused on business logic.
|
||||
|
||||
.. Note::
|
||||
Arch's **prompt handler** subsystem interacts with the **model** subsytem through Envoy's cluster manager
|
||||
system to ensure robust, resilient and fault-tolerant experience in managing incoming prompts. Read more
|
||||
about the :ref:`model subsystem <arch_model_serving>` and how the LLMs are hosted in Arch.
|
||||
|
||||
Messages
|
||||
--------
|
||||
|
||||
Arch accepts messages directly from the body of the HTTP request in a format that follows the `Hugging Face Messages API <https://huggingface.co/docs/text-generation-inference/en/messages_api>`_.
|
||||
This design allows developers to pass a list of messages, where each message is represented as a dictionary
|
||||
containing two key-value pairs:
|
||||
|
||||
- **Role**: Defines the role of the message sender, such as "user" or "assistant".
|
||||
- **Content**: Contains the actual text of the message.
|
||||
|
||||
|
||||
Prompt Guardrails
|
||||
-----------------
|
||||
|
||||
Arch is engineered with :ref:`Arch-Guard <llms_in_arch>`, an industry leading safety layer, powered by a
|
||||
compact and high-performimg LLM that monitors incoming prompts to detect and reject jailbreak attempts and
|
||||
several safety related concerns, ensuring that unauthorized or harmful behaviors are intercepted early in
|
||||
the process. Arch-Guard is a composite model combining work from the industry leading Meta LLama models and
|
||||
purposely-tuned models that offer exceptional overall performance.
|
||||
compact and high-performimg LLM that monitors incoming prompts to detect and reject jailbreak attempts -
|
||||
ensuring that unauthorized or harmful behaviors are intercepted early in the process.
|
||||
|
||||
To add prompt guardrails, see example below:
|
||||
To add jailbreak guardrails, see example below:
|
||||
|
||||
.. literalinclude:: /_config/getting-started.yml
|
||||
:language: yaml
|
||||
|
|
@ -26,9 +40,9 @@ To add prompt guardrails, see example below:
|
|||
:caption: :download:`arch-getting-started.yml </_config/getting-started.yml>`
|
||||
|
||||
.. Note::
|
||||
As a roadmap item, Arch will expose the ability for developers to define custom guardrails via Arch-Guard-v2,
|
||||
which would enforce instructions defined by the application developer to control conversational flow. To
|
||||
offer feedback on our roadmap, please visit our `github page <https://github.com/orgs/katanemo/projects/1>`_
|
||||
As a roadmap item, Arch will expose the ability for developers to define custom guardrails via Arch-Guard-v2,
|
||||
and add support for additional safety checks defined by developers and hazardous categories like, violent crimes, privacy, hate,
|
||||
etc. To offer feedback on our roadmap, please visit our `github page <https://github.com/orgs/katanemo/projects/1>`_
|
||||
|
||||
|
||||
Prompt Targets
|
||||
|
|
@ -132,7 +146,6 @@ Example: Using OpenAI Client with Arch as an Egress Gateway
|
|||
|
||||
print("OpenAI Response:", response.choices[0].text.strip())
|
||||
|
||||
|
||||
In these examples:
|
||||
|
||||
The ArchClient is used to send traffic directly through the Arch egress proxy to the LLM of your choice, such as OpenAI.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue