Salmanap/docs v1 push (#92)

* updated model serving, updated the config references, architecture docs and added the llm_provider section * several documentation changes to improve sections like life_of_a_request, model serving subsystem --------- Co-authored-by: Salman Paracha <salmanparacha@MacBook-Pro-261.local>
2026-05-05 05:42:49 +02:00 · 2024-09-27 15:37:49 -07:00 · 2024-09-27 15:37:49 -07:00 · 7168b14ed3
commit 7168b14ed3
parent 8a4e11077c
19 changed files with 375 additions and 119 deletions
--- a/docs/source/intro/architecture/model_serving/model_serving.rst
+++ b/docs/source/intro/architecture/model_serving/model_serving.rst
@ -0,0 +1,56 @@
+.. _arch_model_serving:
+
+Model Serving
+-------------
+
+Arch is a set of **two** self-contained processes that are designed to run alongside your application 
+servers (or on a separate host connected via a network). The first process is designated to manage low-level 
+networking and HTTP related comcerns, and the other process is for **model serving**, which helps Arch make 
+intelligent decisions about the incoming prompts. The model server is designed to call the purpose-built 
+:ref:`LLMs <llms_in_arch>` in Arch.
+
+.. image:: /_static/img/arch-system-architecture.jpg
+   :align: center
+   :width: 50%
+
+_____________________________________________________________________________________________________________
+
+Arch' is designed to be deployed in your cloud VPC, on a on-premises host, and can work on devices that don't 
+have a GPU. Note, GPU devices are need for fast and cost-efficient use, so that Arch (model server, specifically) 
+can process prompts quickly and forward control back to the applicaton host. There are three modes in which Arch 
+can be configured to run its **model server** subsystem:
+
+Local Serving (CPU - Moderate)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The following bash commands enable you to configure the model server subsystem in Arch to run local on device 
+and only use CPU devices. This will be the slowest option but can be useful in dev/test scenarios where GPUs 
+might not be available. 
+
+.. code-block:: bash
+
+    archgw up --local -cpu
+
+Local Serving (GPU- Fast)
+^^^^^^^^^^^^^^^^^^^^^^^^^
+The following bash commands enable you to configure the model server subsystem in Arch to run locally on the 
+machine and utilize the GPU available for fast inference across all model use cases, including function calling
+guardails, etc.
+
+.. code-block:: bash
+
+    archgw up --local 
+
+Cloud Serving (GPU - Blazing Fast)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The command below instructs Arch to intelligently use GPUs locally for fast intent detection, but default to 
+cloud serving for function calling and guardails scenarios to dramatically improve the speed and overall performance 
+of your applications. 
+
+.. code-block:: bash
+
+    archgw up 
+
+.. Note::
+    Arch's model serving in the cloud is priced at $0.05M/token (156x cheaper than GPT-4o) with averlage latency 
+    of 200ms (10x faster than GPT-4o). Please refer to our :ref:`getting started guide <getting_started>` to know 
+    how to generate API keys for model serving