diff --git a/README.md b/README.md index 876a237a..7b327f83 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ _Arch is a smart proxy server designed as a modular edge and AI gateway for agen [Quickstart](#Quickstart) • [Demos](#Demos) • -[Route LLMs](#Use-Arch-as-a-LLM-Router) • +[Route LLMs](#use-arch-as-a-llm-router) • [Build agentic apps with Arch](#Build-Agentic-Apps-with-Arch) • [Documentation](https://docs.archgw.com) • [Contact](#Contact) @@ -41,7 +41,7 @@ With Arch, you can move faster by focusing on higher-level objectives in a langu **Core Features**: - `🚦 Route to Agents`: Engineered with purpose-built [LLMs](https://huggingface.co/collections/katanemo/arch-function-66f209a693ea8df14317ad68) for fast (<100ms) agent routing and hand-off - - `🔗 Route to LLMs`: Unify access to LLMs with support for [dynamic routing](#Preference-based-Routing). Model aliases [coming soon](https://github.com/katanemo/archgw/issues/557) + - `🔗 Route to LLMs`: Unify access to LLMs with support for [three routing strategies](#use-arch-as-a-llm-router). - `⛨ Guardrails`: Centrally configure and prevent harmful outcomes and ensure safe user interactions - `⚡ Tools Use`: For common agentic scenarios let Arch instantly clarify and convert prompts to tools/API calls - `🕵 Observability`: W3C compatible request tracing and LLM metrics that instantly plugin with popular tools @@ -87,10 +87,10 @@ $ pip install archgw==0.3.12 ``` ### Use Arch as a LLM Router -Arch supports two primary routing strategies for LLMs: model-based routing and preference-based routing. +Arch supports three powerful routing strategies for LLMs: model-based routing, alias-based routing, and preference-based routing. Each strategy offers different levels of abstraction and control for managing your LLM infrastructure. #### Model-based Routing -Model-based routing allows you to configure static model names for routing. This is useful when you always want to use a specific model for certain tasks, or manually swap between models. Below an example configuration for model-based routing, and you can follow our [usage guide](demos/use_cases/README.md) on how to get working. +Model-based routing allows you to configure specific models with static routing. This is ideal when you need direct control over which models handle specific requests. Arch supports 11+ LLM providers including OpenAI, Anthropic, DeepSeek, Mistral, Groq, and more. ```yaml version: v0.1.0 @@ -103,16 +103,31 @@ listeners: timeout: 30s llm_providers: - - access_key: $OPENAI_API_KEY - model: openai/gpt-4o + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY default: true - - access_key: $MISTRAL_API_KEY - model: mistral/mistral-3b-latest + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + ``` -#### Preference-based Routing -Preference-based routing is designed for more dynamic and intelligent selection of models. Instead of static model names, you write plain-language routing policies that describe the type of task or preference — for example: +You can then route to specific models using any OpenAI-compatible client: + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://127.0.0.1:12000/v1", api_key="test") + +# Route to specific model +response = client.chat.completions.create( + model="anthropic/claude-3-5-sonnet-20241022", + messages=[{"role": "user", "content": "Explain quantum computing"}] +) +``` + +#### Alias-based Routing +Alias-based routing lets you create semantic model names that map to underlying providers. This approach decouples your application code from specific model names, making it easy to experiment with different models or handle provider changes. ```yaml version: v0.1.0 @@ -125,21 +140,68 @@ listeners: timeout: 30s llm_providers: - - model: openai/gpt-4.1 + - model: openai/gpt-4o access_key: $OPENAI_API_KEY - default: true - routing_preferences: - - name: code generation - description: generating new code snippets, functions, or boilerplate based on user prompts or requirements - - model: openai/gpt-4o-mini - access_key: $OPENAI_API_KEY - routing_preferences: - - name: code understanding - description: understand and explain existing code snippets, functions, or libraries + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + +model_aliases: + # Model aliases - friendly names that map to actual model names + fast-model: + target: gpt-4o-mini + + reasoning-model: + target: gpt-4o + + creative-model: + target: claude-3-5-sonnet-20241022 ``` -Arch uses a lightweight 1.5B autoregressive model to map prompts (and conversation context) to these policies. This approach adapts to intent drift, supports multi-turn conversations, and avoids the brittleness of embedding-based classifiers or manual if/else chains. No retraining is required when adding new models or updating policies — routing is governed entirely by human-readable rules. You can learn more about the design, benchmarks, and methodology behind preference-based routing in our paper: +Use semantic aliases in your application code: + +```python +# Your code uses semantic names instead of provider-specific ones +response = client.chat.completions.create( + model="reasoning-model", # Routes to best available reasoning model + messages=[{"role": "user", "content": "Solve this complex problem..."}] +) +``` + +#### Preference-aligned Routing +Preference-aligned routing provides intelligent, dynamic model selection based on natural language descriptions of tasks and preferences. Instead of hardcoded routing logic, you describe what each model is good at using plain English. + +```yaml +version: v0.1.0 + +listeners: + egress_traffic: + address: 0.0.0.0 + port: 12000 + message_format: openai + timeout: 30s + +llm_providers: + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY + routing_preferences: + - name: complex_reasoning + description: deep analysis, mathematical problem solving, and logical reasoning + - name: creative_writing + description: storytelling, creative content, and artistic writing + + - model: deepseek/deepseek-coder + access_key: $DEEPSEEK_API_KEY + routing_preferences: + - name: code_generation + description: generating new code, writing functions, and creating scripts + - name: code_review + description: analyzing existing code for bugs, improvements, and optimization +``` + +Arch uses a lightweight 1.5B autoregressive model to intelligently map user prompts to these preferences, automatically selecting the best model for each request. This approach adapts to intent drift, supports multi-turn conversations, and avoids brittle embedding-based classifiers or manual if/else chains. No retraining required when adding models or updating policies — routing is governed entirely by human-readable rules. + +**Learn More**: Check our [documentation](https://docs.archgw.com/concepts/llm_providers/llm_providers.html) for comprehensive provider setup guides and routing strategies. You can learn more about the design, benchmarks, and methodology behind preference-based routing in our paper:
diff --git a/arch/tools/cli/config_generator.py b/arch/tools/cli/config_generator.py index 01a85095..8f0dcefd 100644 --- a/arch/tools/cli/config_generator.py +++ b/arch/tools/cli/config_generator.py @@ -17,6 +17,7 @@ SUPPORTED_PROVIDERS = [ "together_ai", "azure_openai", "xai", + "ollama", ] @@ -124,10 +125,12 @@ def validate_and_render_schema(): f"Invalid model name {model_name}. Please provide model name in the format /." ) provider = model_name_tokens[0] - # Validate azure_openai provider requires base_url - if provider == "azure_openai" and llm_provider.get("base_url") is None: + # Validate azure_openai and ollama provider requires base_url + if (provider == "azure_openai" or provider == "ollama") and llm_provider.get( + "base_url" + ) is None: raise Exception( - f"Provider 'azure_openai' requires 'base_url' to be set for model {model_name}" + f"Provider '{provider}' requires 'base_url' to be set for model {model_name}" ) model_id = "/".join(model_name_tokens[1:]) diff --git a/crates/common/src/configuration.rs b/crates/common/src/configuration.rs index 034e9148..9ad2dc0a 100644 --- a/crates/common/src/configuration.rs +++ b/crates/common/src/configuration.rs @@ -173,6 +173,8 @@ pub enum LlmProviderType { TogetherAI, #[serde(rename = "azure_openai")] AzureOpenAI, + #[serde(rename = "ollama")] + Ollama, } impl Display for LlmProviderType { @@ -188,6 +190,7 @@ impl Display for LlmProviderType { LlmProviderType::XAI => write!(f, "xai"), LlmProviderType::TogetherAI => write!(f, "together_ai"), LlmProviderType::AzureOpenAI => write!(f, "azure_openai"), + LlmProviderType::Ollama => write!(f, "ollama"), } } } diff --git a/crates/hermesllm/src/providers/id.rs b/crates/hermesllm/src/providers/id.rs index 13ef4c6e..649e730c 100644 --- a/crates/hermesllm/src/providers/id.rs +++ b/crates/hermesllm/src/providers/id.rs @@ -16,6 +16,7 @@ pub enum ProviderId { AzureOpenAI, XAI, TogetherAI, + Ollama, } impl From<&str> for ProviderId { @@ -32,6 +33,7 @@ impl From<&str> for ProviderId { "azure_openai" => ProviderId::AzureOpenAI, "xai" => ProviderId::XAI, "together_ai" => ProviderId::TogetherAI, + "ollama" => ProviderId::Ollama, _ => panic!("Unknown provider: {}", value), } } @@ -55,7 +57,8 @@ impl ProviderId { | ProviderId::GitHub | ProviderId::AzureOpenAI | ProviderId::XAI - | ProviderId::TogetherAI, + | ProviderId::TogetherAI + | ProviderId::Ollama, SupportedAPIs::AnthropicMessagesAPI(_)) => SupportedAPIs::OpenAIChatCompletions(OpenAIApi::ChatCompletions), (ProviderId::OpenAI @@ -67,7 +70,8 @@ impl ProviderId { | ProviderId::GitHub | ProviderId::AzureOpenAI | ProviderId::XAI - | ProviderId::TogetherAI, + | ProviderId::TogetherAI + | ProviderId::Ollama, SupportedAPIs::OpenAIChatCompletions(_)) => SupportedAPIs::OpenAIChatCompletions(OpenAIApi::ChatCompletions), } } @@ -87,6 +91,7 @@ impl Display for ProviderId { ProviderId::AzureOpenAI => write!(f, "azure_openai"), ProviderId::XAI => write!(f, "xai"), ProviderId::TogetherAI => write!(f, "together_ai"), + ProviderId::Ollama => write!(f, "ollama"), } } } diff --git a/demos/use_cases/model_alias_routing/arch_config_with_aliases.yaml b/demos/use_cases/model_alias_routing/arch_config_with_aliases.yaml index d42583e4..49245d55 100644 --- a/demos/use_cases/model_alias_routing/arch_config_with_aliases.yaml +++ b/demos/use_cases/model_alias_routing/arch_config_with_aliases.yaml @@ -26,12 +26,16 @@ llm_providers: - model: anthropic/claude-3-haiku-20240307 access_key: $ANTHROPIC_API_KEY - # Azure OpenAI Models + # Azure OpenAI Models - model: azure_openai/gpt-5-mini access_key: $AZURE_API_KEY base_url: https://katanemo.openai.azure.com + # Ollama Models + - model: ollama/llama3.1 + base_url: http://host.docker.internal:11434 + # Model aliases - friendly names that map to actual provider names model_aliases: @@ -60,7 +64,7 @@ model_aliases: target: gpt-4o-mini chat-model: - target: gpt-4o + target: llama3.1 creative-model: target: claude-3-5-sonnet-20241022 diff --git a/docs/source/concepts/llm_provider.rst b/docs/source/concepts/llm_provider.rst deleted file mode 100644 index eabdaa96..00000000 --- a/docs/source/concepts/llm_provider.rst +++ /dev/null @@ -1,76 +0,0 @@ -.. _llm_provider: - -LLM Provider -============ - -**LLM provider** is a top-level primitive in Arch, helping developers centrally define, secure, observe, -and manage the usage of their LLMs. Arch builds on Envoy's reliable `cluster subsystem `_ -to manage egress traffic to LLMs, which includes intelligent routing, retry and fail-over mechanisms, -ensuring high availability and fault tolerance. This abstraction also enables developers to seamlessly -switching between LLM providers or upgrade LLM versions, simplifying the integration and scaling of LLMs -across applications. - - -Below is an example of how you can configure ``llm_providers`` with an instance of an Arch gateway. - -.. literalinclude:: includes/arch_config.yaml - :language: yaml - :linenos: - :lines: 1-20 - :emphasize-lines: 10-16 - :caption: Example Configuration - -.. Note:: - When you start Arch, it creates a listener port for egress traffic based on the presence of ``llm_providers`` - configuration section in the ``arch_config.yml`` file. Arch binds itself to a local address such as - ``127.0.0.1:12000``. - -Arch also offers vendor-agnostic SDKs and libraries to make LLM calls to API-based LLM providers (like OpenAI, -Anthropic, Mistral, Cohere, etc.) and supports calls to OSS LLMs that are hosted on your infrastructure. Arch -abstracts the complexities of integrating with different LLM providers, providing a unified interface for making -calls, handling retries, managing rate limits, and ensuring seamless integration with cloud-based and on-premise -LLMs. Simply configure the details of the LLMs your application will use, and Arch offers a unified interface to -make outbound LLM calls. - -Adding custom LLM Provider --------------------------- - -We support any OpenAI compliant LLM for example mistral, openai, ollama etc. We also offer first class support for OpenAI, Anthropic, DeepSeek, Mistral, Groq, and Ollama based models. -You can easily configure an LLM that communicates over the OpenAI API interface, by following the below guide. - -For example following code block shows you how to add an ollama-supported LLM in the ``arch_config.yaml`` file. - -.. code-block:: yaml - - llm_providers: - - model: some_custom_llm_provider/llama3.2 - provider_interface: openai - base_url: http://host.docker.internal:11434 - -And in the following code block shows you how to add mistral llm provider in the ``arch_config.yaml`` file. - -.. code-block:: yaml - - llm_providers: - - name: mistral/ministral-3b-latest - access_key: $MISTRAL_API_KEY - -Example: Using the OpenAI Python SDK ------------------------------------- - -.. code-block:: python - - from openai import OpenAI - - # Initialize the Arch client - client = OpenAI(base_url="http://127.0.0.1:2000/") - - # Define your model and messages - model = "llama3.2" - messages = [{"role": "user", "content": "What is the capital of France?"}] - - # Send the messages to the LLM through Arch - response = client.chat.completions.create(model=model, messages=messages) - - # Print the response - print("LLM Response:", response.choices[0].message.content) diff --git a/docs/source/concepts/llm_providers/client_libraries.rst b/docs/source/concepts/llm_providers/client_libraries.rst new file mode 100644 index 00000000..5c3aa0b4 --- /dev/null +++ b/docs/source/concepts/llm_providers/client_libraries.rst @@ -0,0 +1,420 @@ +.. _client_libraries: + +Client Libraries +================ + +Arch provides a unified interface that works seamlessly with multiple client libraries and tools. You can use your preferred client library without changing your existing code - just point it to Arch's gateway endpoints. + +Supported Clients +------------------ + +- **OpenAI SDK** - Full compatibility with OpenAI's official client +- **Anthropic SDK** - Native support for Anthropic's client library +- **cURL** - Direct HTTP requests for any programming language +- **Custom HTTP Clients** - Any HTTP client that supports REST APIs + +Gateway Endpoints +----------------- + +Arch exposes two main endpoints: + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Endpoint + - Purpose + * - ``http://127.0.0.1:12000/v1/chat/completions`` + - OpenAI-compatible chat completions (LLM Gateway) + * - ``http://127.0.0.1:12000/v1/messages`` + - Anthropic-compatible messages (LLM Gateway) + +OpenAI (Python) SDK +------------------- + +The OpenAI SDK works with any provider through Arch's OpenAI-compatible endpoint. + +**Installation:** + +.. code-block:: bash + + pip install openai + +**Basic Usage:** + +.. code-block:: python + + from openai import OpenAI + + # Point to Arch's LLM Gateway + client = OpenAI( + api_key="test-key", # Can be any value for local testing + base_url="http://127.0.0.1:12000/v1" + ) + + # Use any model configured in your arch_config.yaml + completion = client.chat.completions.create( + model="gpt-4o-mini", # Or use :ref:`model aliases ` like "fast-model" + max_tokens=50, + messages=[ + { + "role": "user", + "content": "Hello, how are you?" + } + ] + ) + + print(completion.choices[0].message.content) + +**Streaming Responses:** + +.. code-block:: python + + from openai import OpenAI + + client = OpenAI( + api_key="test-key", + base_url="http://127.0.0.1:12000/v1" + ) + + stream = client.chat.completions.create( + model="gpt-4o-mini", + max_tokens=50, + messages=[ + { + "role": "user", + "content": "Tell me a short story" + } + ], + stream=True + ) + + # Collect streaming chunks + for chunk in stream: + if chunk.choices[0].delta.content: + print(chunk.choices[0].delta.content, end="") + +**Using with Non-OpenAI Models:** + +The OpenAI SDK can be used with any provider configured in Arch: + +.. code-block:: python + + # Using Claude model through OpenAI SDK + completion = client.chat.completions.create( + model="claude-3-5-sonnet-20241022", + max_tokens=50, + messages=[ + { + "role": "user", + "content": "Explain quantum computing briefly" + } + ] + ) + + # Using Ollama model through OpenAI SDK + completion = client.chat.completions.create( + model="llama3.1", + max_tokens=50, + messages=[ + { + "role": "user", + "content": "What's the capital of France?" + } + ] + ) + +Anthropic (Python) SDK +---------------------- + +The Anthropic SDK works with any provider through Arch's Anthropic-compatible endpoint. + +**Installation:** + +.. code-block:: bash + + pip install anthropic + +**Basic Usage:** + +.. code-block:: python + + import anthropic + + # Point to Arch's LLM Gateway + client = anthropic.Anthropic( + api_key="test-key", # Can be any value for local testing + base_url="http://127.0.0.1:12000" + ) + + # Use any model configured in your arch_config.yaml + message = client.messages.create( + model="claude-3-5-sonnet-20241022", + max_tokens=50, + messages=[ + { + "role": "user", + "content": "Hello, please respond briefly!" + } + ] + ) + + print(message.content[0].text) + +**Streaming Responses:** + +.. code-block:: python + + import anthropic + + client = anthropic.Anthropic( + api_key="test-key", + base_url="http://127.0.0.1:12000" + ) + + with client.messages.stream( + model="claude-3-5-sonnet-20241022", + max_tokens=50, + messages=[ + { + "role": "user", + "content": "Tell me about artificial intelligence" + } + ] + ) as stream: + # Collect text deltas + for text in stream.text_stream: + print(text, end="") + + # Get final assembled message + final_message = stream.get_final_message() + final_text = "".join(block.text for block in final_message.content if block.type == "text") + +**Using with Non-Anthropic Models:** + +The Anthropic SDK can be used with any provider configured in Arch: + +.. code-block:: python + + # Using OpenAI model through Anthropic SDK + message = client.messages.create( + model="gpt-4o-mini", + max_tokens=50, + messages=[ + { + "role": "user", + "content": "Explain machine learning in simple terms" + } + ] + ) + + # Using Ollama model through Anthropic SDK + message = client.messages.create( + model="llama3.1", + max_tokens=50, + messages=[ + { + "role": "user", + "content": "What is Python programming?" + } + ] + ) + +cURL Examples +------------- + +For direct HTTP requests or integration with any programming language: + +**OpenAI-Compatible Endpoint:** + +.. code-block:: bash + + # Basic request + curl -X POST http://127.0.0.1:12000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer test-key" \ + -d '{ + "model": "gpt-4o-mini", + "messages": [ + {"role": "user", "content": "Hello!"} + ], + "max_tokens": 50 + }' + + # Using :ref:`model aliases ` + curl -X POST http://127.0.0.1:12000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "fast-model", + "messages": [ + {"role": "user", "content": "Summarize this text..."} + ], + "max_tokens": 100 + }' + + # Streaming request + curl -X POST http://127.0.0.1:12000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gpt-4o-mini", + "messages": [ + {"role": "user", "content": "Tell me a story"} + ], + "stream": true, + "max_tokens": 200 + }' + +**Anthropic-Compatible Endpoint:** + +.. code-block:: bash + + # Basic request + curl -X POST http://127.0.0.1:12000/v1/messages \ + -H "Content-Type: application/json" \ + -H "x-api-key: test-key" \ + -H "anthropic-version: 2023-06-01" \ + -d '{ + "model": "claude-3-5-sonnet-20241022", + "max_tokens": 50, + "messages": [ + {"role": "user", "content": "Hello Claude!"} + ] + }' + +Cross-Client Compatibility +-------------------------- + +One of Arch's key features is cross-client compatibility. You can: + +**Use OpenAI SDK with Claude Models:** + +.. code-block:: python + + # OpenAI client calling Claude model + from openai import OpenAI + + client = OpenAI(base_url="http://127.0.0.1:12000/v1", api_key="test") + + response = client.chat.completions.create( + model="claude-3-5-sonnet-20241022", # Claude model + messages=[{"role": "user", "content": "Hello"}] + ) + +**Use Anthropic SDK with OpenAI Models:** + +.. code-block:: python + + # Anthropic client calling OpenAI model + import anthropic + + client = anthropic.Anthropic(base_url="http://127.0.0.1:12000", api_key="test") + + response = client.messages.create( + model="gpt-4o-mini", # OpenAI model + max_tokens=50, + messages=[{"role": "user", "content": "Hello"}] + ) + +**Mix and Match with** :ref:`Model Aliases `: + +.. code-block:: python + + # Same code works with different underlying models + def ask_question(client, question): + return client.chat.completions.create( + model="reasoning-model", # Alias could point to any provider + messages=[{"role": "user", "content": question}] + ) + + # Works regardless of what "reasoning-model" actually points to + openai_client = OpenAI(base_url="http://127.0.0.1:12000/v1", api_key="test") + response = ask_question(openai_client, "Solve this math problem...") + +Error Handling +-------------- + +**OpenAI SDK Error Handling:** + +.. code-block:: python + + from openai import OpenAI + import openai + + client = OpenAI(base_url="http://127.0.0.1:12000/v1", api_key="test") + + try: + completion = client.chat.completions.create( + model="nonexistent-model", + messages=[{"role": "user", "content": "Hello"}] + ) + except openai.NotFoundError as e: + print(f"Model not found: {e}") + except openai.APIError as e: + print(f"API error: {e}") + +**Anthropic SDK Error Handling:** + +.. code-block:: python + + import anthropic + + client = anthropic.Anthropic(base_url="http://127.0.0.1:12000", api_key="test") + + try: + message = client.messages.create( + model="nonexistent-model", + max_tokens=50, + messages=[{"role": "user", "content": "Hello"}] + ) + except anthropic.NotFoundError as e: + print(f"Model not found: {e}") + except anthropic.APIError as e: + print(f"API error: {e}") + +Best Practices +-------------- + +**Use** :ref:`Model Aliases `: +Instead of hardcoding provider-specific model names, use semantic aliases: + +.. code-block:: python + + # Good - uses semantic alias + model = "fast-model" + + # Less ideal - hardcoded provider model + model = "openai/gpt-4o-mini" + +**Environment-Based Configuration:** +Use different :ref:`model aliases ` for different environments: + +.. code-block:: python + + import os + + # Development uses cheaper/faster models + model = os.getenv("MODEL_ALIAS", "dev.chat.v1") + + response = client.chat.completions.create( + model=model, + messages=[{"role": "user", "content": "Hello"}] + ) + +**Graceful Fallbacks:** +Implement fallback logic for better reliability: + +.. code-block:: python + + def chat_with_fallback(client, messages, primary_model="smart-model", fallback_model="fast-model"): + try: + return client.chat.completions.create(model=primary_model, messages=messages) + except Exception as e: + print(f"Primary model failed, trying fallback: {e}") + return client.chat.completions.create(model=fallback_model, messages=messages) + +See Also +-------- + +- :ref:`supported_providers` - Configure your providers and see available models +- :ref:`model_aliases` - Create semantic model names +- :ref:`llm_router` - Intelligent routing capabilities diff --git a/docs/source/concepts/llm_providers/llm_providers.rst b/docs/source/concepts/llm_providers/llm_providers.rst new file mode 100644 index 00000000..782a7163 --- /dev/null +++ b/docs/source/concepts/llm_providers/llm_providers.rst @@ -0,0 +1,80 @@ +.. _llm_providers: + +LLM Providers +============= +**LLM Providers** are a top-level primitive in Arch, helping developers centrally define, secure, observe, +and manage the usage of their LLMs. Arch builds on Envoy's reliable `cluster subsystem `_ +to manage egress traffic to LLMs, which includes intelligent routing, retry and fail-over mechanisms, +ensuring high availability and fault tolerance. This abstraction also enables developers to seamlessly +switch between LLM providers or upgrade LLM versions, simplifying the integration and scaling of LLMs +across applications. + +Today, we are enabling you to connect to 11+ different AI providers through a unified interface with advanced routing and management capabilities. +Whether you're using OpenAI, Anthropic, Azure OpenAI, local Ollama models, or any OpenAI-compatible provider, Arch provides seamless integration with enterprise-grade features. + +Core Capabilities +----------------- + +**Multi-Provider Support** +Connect to any combination of providers simultaneously (see :ref:`supported_providers` for full details): + +- **First-Class Providers**: Native integrations with OpenAI, Anthropic, DeepSeek, Mistral, Groq, Google Gemini, Together AI, xAI, Azure OpenAI, and Ollama +- **OpenAI-Compatible Providers**: Any provider implementing the OpenAI Chat Completions API standard + +**Intelligent Routing** +Three powerful routing approaches to optimize model selection: + +- **Model-based Routing**: Direct routing to specific models using provider/model names (see :ref:`supported_providers`) +- **Alias-based Routing**: Semantic routing using custom aliases (see :ref:`model_aliases`) +- **Preference-aligned Routing**: Intelligent routing using the Arch-Router model (see :ref:`preference_aligned_routing`) + +**Unified Client Interface** +Use your preferred client library without changing existing code (see :ref:`client_libraries` for details): + +- **OpenAI Python SDK**: Full compatibility with all providers +- **Anthropic Python SDK**: Native support with cross-provider capabilities +- **cURL & HTTP Clients**: Direct REST API access for any programming language +- **Custom Integrations**: Standard HTTP interfaces for seamless integration + +Key Benefits +------------ + +- **Provider Flexibility**: Switch between providers without changing client code +- **Three Routing Methods**: Choose from model-based, alias-based, or preference-aligned routing (using `Arch-Router-1.5B `_) strategies +- **Cost Optimization**: Route requests to cost-effective models based on complexity +- **Performance Optimization**: Use fast models for simple tasks, powerful models for complex reasoning +- **Environment Management**: Configure different models for different environments +- **Future-Proof**: Easy to add new providers and upgrade models + +Common Use Cases +---------------- + +**Development Teams** +- Use aliases like ``dev.chat.v1`` and ``prod.chat.v1`` for environment-specific models +- Route simple queries to fast/cheap models, complex tasks to powerful models +- Test new models safely using canary deployments (coming soon) + +**Production Applications** +- Implement fallback strategies across multiple providers for reliability +- Use intelligent routing to optimize cost and performance automatically +- Monitor usage patterns and model performance across providers + +**Enterprise Deployments** +- Connect to both cloud providers and on-premises models (Ollama, custom deployments) +- Apply consistent security and governance policies across all providers +- Scale across regions using different provider endpoints + +Advanced Features +----------------- +- :ref:`preference_aligned_routing` - Learn about preference-aligned dynamic routing and intelligent model selection + +Getting Started +--------------- +Dive into specific areas based on your needs: + +.. toctree:: + :maxdepth: 2 + + supported_providers + client_libraries + model_aliases diff --git a/docs/source/concepts/llm_providers/model_aliases.rst b/docs/source/concepts/llm_providers/model_aliases.rst new file mode 100644 index 00000000..2d29be93 --- /dev/null +++ b/docs/source/concepts/llm_providers/model_aliases.rst @@ -0,0 +1,254 @@ +.. _model_aliases: + +Model Aliases +============= + +Model aliases provide semantic, version-controlled names for your models, enabling cleaner client code, easier model management, and advanced routing capabilities. Instead of using provider-specific model names like ``gpt-4o-mini`` or ``claude-3-5-sonnet-20241022``, you can create meaningful aliases like ``fast-model`` or ``arch.summarize.v1``. + +**Benefits of Model Aliases:** + +- **Semantic Naming**: Use descriptive names that reflect the model's purpose +- **Version Control**: Implement versioning schemes (e.g., ``v1``, ``v2``) for model upgrades +- **Environment Management**: Different aliases can point to different models across environments +- **Client Simplification**: Clients use consistent, meaningful names regardless of underlying provider +- **Advanced Routing (Coming Soon)**: Enable guardrails, fallbacks, and traffic splitting at the alias level + +Basic Configuration +------------------- + +**Simple Alias Mapping** + +.. code-block:: yaml + :caption: Basic Model Aliases + + llm_providers: + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY + + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY + + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + + - model: ollama/llama3.1 + base_url: http://host.docker.internal:11434 + + # Define aliases that map to the models above + model_aliases: + # Semantic versioning approach + arch.summarize.v1: + target: gpt-4o-mini + + arch.reasoning.v1: + target: gpt-4o + + arch.creative.v1: + target: claude-3-5-sonnet-20241022 + + # Functional aliases + fast-model: + target: gpt-4o-mini + + smart-model: + target: gpt-4o + + creative-model: + target: claude-3-5-sonnet-20241022 + + # Local model alias + local-chat: + target: llama3.1 + +Using Aliases +------------- + +**Client Code Examples** + +Once aliases are configured, clients can use semantic names instead of provider-specific model names: + +.. code-block:: python + :caption: Python Client Usage + + from openai import OpenAI + + client = OpenAI(base_url="http://127.0.0.1:12000/") + + # Use semantic alias instead of provider model name + response = client.chat.completions.create( + model="arch.summarize.v1", # Points to gpt-4o-mini + messages=[{"role": "user", "content": "Summarize this document..."}] + ) + + # Switch to a different capability + response = client.chat.completions.create( + model="arch.reasoning.v1", # Points to gpt-4o + messages=[{"role": "user", "content": "Solve this complex problem..."}] + ) + +.. code-block:: bash + :caption: cURL Example + + curl -X POST http://127.0.0.1:12000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "fast-model", + "messages": [{"role": "user", "content": "Hello!"}] + }' + +Naming Best Practices +--------------------- + +**Semantic Versioning** + +Use version numbers for backward compatibility and gradual model upgrades: + +.. code-block:: yaml + + model_aliases: + # Current production version + arch.summarize.v1: + target: gpt-4o-mini + + # Beta version for testing + arch.summarize.v2: + target: gpt-4o + + # Stable alias that always points to latest + arch.summarize.latest: + target: gpt-4o-mini + +**Purpose-Based Naming** + +Create aliases that reflect the intended use case: + +.. code-block:: yaml + + model_aliases: + # Task-specific + code-reviewer: + target: gpt-4o + + document-summarizer: + target: gpt-4o-mini + + creative-writer: + target: claude-3-5-sonnet-20241022 + + data-analyst: + target: gpt-4o + +**Environment-Specific Aliases** + +Different environments can use different underlying models: + +.. code-block:: yaml + + model_aliases: + # Development environment - use faster/cheaper models + dev.chat.v1: + target: gpt-4o-mini + + # Production environment - use more capable models + prod.chat.v1: + target: gpt-4o + + # Staging environment - test new models + staging.chat.v1: + target: claude-3-5-sonnet-20241022 + +Advanced Features (Coming Soon) +-------------------------------- + +The following features are planned for future releases of model aliases: + +**Guardrails Integration** + +Apply safety, cost, or latency rules at the alias level: + +.. code-block:: yaml + :caption: Future Feature - Guardrails + + model_aliases: + arch.reasoning.v1: + target: gpt-oss-120b + guardrails: + max_latency: 5s + max_cost_per_request: 0.10 + block_categories: ["jailbreak", "PII"] + content_filters: + - type: "profanity" + - type: "sensitive_data" + +**Fallback Chains** + +Provide a chain of models if the primary target fails or hits quota limits: + +.. code-block:: yaml + :caption: Future Feature - Fallbacks + + model_aliases: + arch.summarize.v1: + target: gpt-4o-mini + fallbacks: + - target: llama3.1 + conditions: ["quota_exceeded", "timeout"] + - target: claude-3-haiku-20240307 + conditions: ["primary_and_first_fallback_failed"] + +**Traffic Splitting & Canary Deployments** + +Distribute traffic across multiple models for A/B testing or gradual rollouts: + +.. code-block:: yaml + :caption: Future Feature - Traffic Splitting + + model_aliases: + arch.v1: + targets: + - model: llama3.1 + weight: 80 + - model: gpt-4o-mini + weight: 20 + + # Canary deployment + arch.experimental.v1: + targets: + - model: gpt-4o # Current stable + weight: 95 + - model: o1-preview # New model being tested + weight: 5 + +**Load Balancing** + +Distribute requests across multiple instances of the same model: + +.. code-block:: yaml + :caption: Future Feature - Load Balancing + + model_aliases: + high-throughput-chat: + load_balance: + algorithm: "round_robin" # or "least_connections", "weighted" + targets: + - model: gpt-4o-mini + endpoint: "https://api-1.example.com" + - model: gpt-4o-mini + endpoint: "https://api-2.example.com" + - model: gpt-4o-mini + endpoint: "https://api-3.example.com" + + +Validation Rules +---------------- + +- Alias names must be valid identifiers (alphanumeric, dots, hyphens, underscores) +- Target models must be defined in the ``llm_providers`` section +- Circular references between aliases are not allowed +- Weights in traffic splitting must sum to 100 + +See Also +-------- + +- :ref:`llm_providers` - Learn about configuring LLM providers +- :ref:`llm_router` - Understand how aliases work with intelligent routing diff --git a/docs/source/concepts/llm_providers/supported_providers.rst b/docs/source/concepts/llm_providers/supported_providers.rst new file mode 100644 index 00000000..4c32b40b --- /dev/null +++ b/docs/source/concepts/llm_providers/supported_providers.rst @@ -0,0 +1,551 @@ +.. _supported_providers: + +Supported Providers & Configuration +=================================== + +Arch provides first-class support for multiple LLM providers through native integrations and OpenAI-compatible interfaces. This comprehensive guide covers all supported providers, their available chat models, and detailed configuration instructions. + +.. note:: + **Model Support:** Arch supports all chat models from each provider, not just the examples shown in this guide. The configurations below demonstrate common models for reference, but you can use any chat model available from your chosen provider. + +Configuration Structure +----------------------- + +All providers are configured in the ``llm_providers`` section of your ``arch_config.yaml`` file: + +.. code-block:: yaml + + version: v0.1 + + listeners: + egress_traffic: + address: 0.0.0.0 + port: 12000 + message_format: openai + timeout: 30s + + llm_providers: + # Provider configurations go here + - model: provider/model-name + access_key: $API_KEY + # Additional provider-specific options + +**Common Configuration Fields:** + +- ``model``: Provider prefix and model name (format: ``provider/model-name``) +- ``access_key``: API key for authentication (supports environment variables) +- ``default``: Mark a model as the default (optional, boolean) +- ``name``: Custom name for the provider instance (optional) +- ``base_url``: Custom endpoint URL (required for some providers) + +Provider Categories +------------------- + +**First-Class Providers** +Native integrations with built-in support for provider-specific features and authentication. + +**OpenAI-Compatible Providers** +Any provider that implements the OpenAI API interface can be configured using custom endpoints. + +Supported API Endpoints +------------------------ + +Arch supports the following standardized endpoints across providers: + +.. list-table:: + :header-rows: 1 + :widths: 30 30 40 + + * - Endpoint + - Purpose + - Supported Clients + * - ``/v1/chat/completions`` + - OpenAI-style chat completions + - OpenAI SDK, cURL, custom clients + * - ``/v1/messages`` + - Anthropic-style messages + - Anthropic SDK, cURL, custom clients + +First-Class Providers +--------------------- + +OpenAI +~~~~~~ + +**Provider Prefix:** ``openai/`` + +**API Endpoint:** ``/v1/chat/completions`` + +**Authentication:** API Key - Get your OpenAI API key from `OpenAI Platform `_. + +**Supported Chat Models:** All OpenAI chat models including GPT-5, GPT-4o, GPT-4, GPT-3.5-turbo, and all future releases. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - GPT-5 + - ``openai/gpt-5`` + - Next-generation model (use any model name from OpenAI's API) + * - GPT-4o + - ``openai/gpt-4o`` + - Latest multimodal model + * - GPT-4o mini + - ``openai/gpt-4o-mini`` + - Fast, cost-effective model + * - GPT-4 + - ``openai/gpt-4`` + - High-capability reasoning model + * - GPT-3.5 Turbo + - ``openai/gpt-3.5-turbo`` + - Balanced performance and cost + * - o3-mini + - ``openai/o3-mini`` + - Reasoning-focused model (preview) + * - o3 + - ``openai/o3`` + - Advanced reasoning model (preview) + +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + # Latest models (examples - use any OpenAI chat model) + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY + default: true + + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY + + # Use any model name from OpenAI's API + - model: openai/gpt-5 + access_key: $OPENAI_API_KEY + +Anthropic +~~~~~~~~~ + +**Provider Prefix:** ``anthropic/`` + +**API Endpoint:** ``/v1/messages`` + +**Authentication:** API Key - Get your Anthropic API key from `Anthropic Console `_. + +**Supported Chat Models:** All Anthropic Claude models including Claude Sonnet 4, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, and all future releases. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - Claude Sonnet 4 + - ``anthropic/claude-sonnet-4`` + - Next-generation model (use any model name from Anthropic's API) + * - Claude 3.5 Sonnet + - ``anthropic/claude-3-5-sonnet-20241022`` + - Latest high-performance model + * - Claude 3.5 Haiku + - ``anthropic/claude-3-5-haiku-20241022`` + - Fast and efficient model + * - Claude 3 Opus + - ``anthropic/claude-3-opus-20240229`` + - Most capable model for complex tasks + * - Claude 3 Sonnet + - ``anthropic/claude-3-sonnet-20240229`` + - Balanced performance model + * - Claude 3 Haiku + - ``anthropic/claude-3-haiku-20240307`` + - Fastest model + +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + # Latest models (examples - use any Anthropic chat model) + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + + - model: anthropic/claude-3-5-haiku-20241022 + access_key: $ANTHROPIC_API_KEY + + # Use any model name from Anthropic's API + - model: anthropic/claude-sonnet-4 + access_key: $ANTHROPIC_API_KEY + +DeepSeek +~~~~~~~~ + +**Provider Prefix:** ``deepseek/`` + +**API Endpoint:** ``/v1/chat/completions`` + +**Authentication:** API Key - Get your DeepSeek API key from `DeepSeek Platform `_. + +**Supported Chat Models:** All DeepSeek chat models including DeepSeek-Chat, DeepSeek-Coder, and all future releases. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - DeepSeek Chat + - ``deepseek/deepseek-chat`` + - General purpose chat model + * - DeepSeek Coder + - ``deepseek/deepseek-coder`` + - Code-specialized model + +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + - model: deepseek/deepseek-chat + access_key: $DEEPSEEK_API_KEY + + - model: deepseek/deepseek-coder + access_key: $DEEPSEEK_API_KEY + +Mistral AI +~~~~~~~~~~ + +**Provider Prefix:** ``mistral/`` + +**API Endpoint:** ``/v1/chat/completions`` + +**Authentication:** API Key - Get your Mistral API key from `Mistral AI Console `_. + +**Supported Chat Models:** All Mistral chat models including Mistral Large, Mistral Small, Ministral, and all future releases. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - Mistral Large + - ``mistral/mistral-large-latest`` + - Most capable model + * - Mistral Medium + - ``mistral/mistral-medium-latest`` + - Balanced performance + * - Mistral Small + - ``mistral/mistral-small-latest`` + - Fast and efficient + * - Ministral 3B + - ``mistral/ministral-3b-latest`` + - Compact model + +**Configuration Examples:** +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + - model: mistral/mistral-large-latest + access_key: $MISTRAL_API_KEY + + - model: mistral/mistral-small-latest + access_key: $MISTRAL_API_KEY + +Groq +~~~~ + +**Provider Prefix:** ``groq/`` + +**API Endpoint:** ``/openai/v1/chat/completions`` (transformed internally) + +**Authentication:** API Key - Get your Groq API key from `Groq Console `_. + +**Supported Chat Models:** All Groq chat models including Llama 3, Mixtral, Gemma, and all future releases. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - Llama 3.1 8B + - ``groq/llama3-8b-8192`` + - Fast inference Llama model + * - Llama 3.1 70B + - ``groq/llama3-70b-8192`` + - Larger Llama model + * - Mixtral 8x7B + - ``groq/mixtral-8x7b-32768`` + - Mixture of experts model + +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + - model: groq/llama3-8b-8192 + access_key: $GROQ_API_KEY + + - model: groq/mixtral-8x7b-32768 + access_key: $GROQ_API_KEY + +Google Gemini +~~~~~~~~~~~~~ + +**Provider Prefix:** ``gemini/`` + +**API Endpoint:** ``/v1beta/openai/chat/completions`` (transformed internally) + +**Authentication:** API Key - Get your Google AI API key from `Google AI Studio `_. + +**Supported Chat Models:** All Google Gemini chat models including Gemini 1.5 Pro, Gemini 1.5 Flash, and all future releases. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - Gemini 1.5 Pro + - ``gemini/gemini-1.5-pro`` + - Advanced reasoning and creativity + * - Gemini 1.5 Flash + - ``gemini/gemini-1.5-flash`` + - Fast and efficient model + +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + - model: gemini/gemini-1.5-pro + access_key: $GOOGLE_API_KEY + + - model: gemini/gemini-1.5-flash + access_key: $GOOGLE_API_KEY + +Together AI +~~~~~~~~~~~ + +**Provider Prefix:** ``together_ai/`` + +**API Endpoint:** ``/v1/chat/completions`` + +**Authentication:** API Key - Get your Together AI API key from `Together AI Settings `_. + +**Supported Chat Models:** All Together AI chat models including Llama, CodeLlama, Mixtral, Qwen, and hundreds of other open-source models. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - Meta Llama 2 7B + - ``together_ai/meta-llama/Llama-2-7b-chat-hf`` + - Open source chat model + * - Meta Llama 2 13B + - ``together_ai/meta-llama/Llama-2-13b-chat-hf`` + - Larger open source model + * - Code Llama 34B + - ``together_ai/codellama/CodeLlama-34b-Instruct-hf`` + - Code-specialized model + +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + - model: together_ai/meta-llama/Llama-2-7b-chat-hf + access_key: $TOGETHER_API_KEY + + - model: together_ai/codellama/CodeLlama-34b-Instruct-hf + access_key: $TOGETHER_API_KEY + +xAI +~~~ + +**Provider Prefix:** ``xai/`` + +**API Endpoint:** ``/v1/chat/completions`` + +**Authentication:** API Key - Get your xAI API key from `xAI Console `_. + +**Supported Chat Models:** All xAI chat models including Grok Beta and all future releases. + +.. list-table:: + :header-rows: 1 + :widths: 30 20 50 + + * - Model Name + - Model ID for Config + - Description + * - Grok Beta + - ``xai/grok-beta`` + - Conversational AI model + +**Configuration Examples:** + +.. code-block:: yaml + + llm_providers: + - model: xai/grok-beta + access_key: $XAI_API_KEY + +Providers Requiring Base URL +---------------------------- + +Azure OpenAI +~~~~~~~~~~~~ + +**Provider Prefix:** ``azure_openai/`` + +**API Endpoint:** ``/openai/deployments/{deployment-name}/chat/completions`` (constructed automatically) + +**Authentication:** API Key + Base URL - Get your Azure OpenAI API key from `Azure Portal `_ → Your OpenAI Resource → Keys and Endpoint. + +**Supported Chat Models:** All Azure OpenAI chat models including GPT-4o, GPT-4, GPT-3.5-turbo deployed in your Azure subscription. + +.. code-block:: yaml + + llm_providers: + # Single deployment + - model: azure_openai/gpt-4o + access_key: $AZURE_OPENAI_API_KEY + base_url: https://your-resource.openai.azure.com + + # Multiple deployments + - model: azure_openai/gpt-4o-mini + access_key: $AZURE_OPENAI_API_KEY + base_url: https://your-resource.openai.azure.com + +Ollama +~~~~~~ + +**Provider Prefix:** ``ollama/`` + +**API Endpoint:** ``/v1/chat/completions`` (Ollama's OpenAI-compatible endpoint) + +**Authentication:** None (Base URL only) - Install Ollama from `Ollama.com `_ and pull your desired models. + +**Supported Chat Models:** All chat models available in your local Ollama installation. Use ``ollama list`` to see installed models. + +.. code-block:: yaml + + llm_providers: + # Local Ollama installation + - model: ollama/llama3.1 + base_url: http://localhost:11434 + + # Ollama in Docker (from host) + - model: ollama/codellama + base_url: http://host.docker.internal:11434 + +OpenAI-Compatible Providers +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Supported Models:** Any chat models from providers that implement the OpenAI Chat Completions API standard. + +For providers that implement the OpenAI API but aren't natively supported: + +.. code-block:: yaml + + llm_providers: + # Generic OpenAI-compatible provider + - model: custom-provider/custom-model + base_url: https://api.customprovider.com + provider_interface: openai + access_key: $CUSTOM_API_KEY + + # Local deployment + - model: local/llama2-7b + base_url: http://localhost:8000 + provider_interface: openai + +Advanced Configuration +---------------------- + +Multiple Provider Instances +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Configure multiple instances of the same provider: + +.. code-block:: yaml + + llm_providers: + # Production OpenAI + - model: openai/gpt-4o + access_key: $OPENAI_PROD_KEY + name: openai-prod + + # Development OpenAI (different key/quota) + - model: openai/gpt-4o-mini + access_key: $OPENAI_DEV_KEY + name: openai-dev + +Default Model Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Mark one model as the default for fallback scenarios: + +.. code-block:: yaml + + llm_providers: + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY + default: true # Used when no specific model is requested + +Routing Preferences +~~~~~~~~~~~~~~~~~~~ + +Configure routing preferences for dynamic model selection: + +.. code-block:: yaml + + llm_providers: + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY + routing_preferences: + - name: complex_reasoning + description: deep analysis, mathematical problem solving, and logical reasoning + - name: code_review + description: reviewing and analyzing existing code for bugs and improvements + + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + routing_preferences: + - name: creative_writing + description: creative content generation, storytelling, and writing assistance + +Model Selection Guidelines +-------------------------- + +**For Production Applications:** +- **High Performance**: OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet +- **Cost-Effective**: OpenAI GPT-4o mini, Anthropic Claude 3.5 Haiku +- **Code Tasks**: DeepSeek Coder, Together AI Code Llama +- **Local Deployment**: Ollama with Llama 3.1 or Code Llama + +**For Development/Testing:** +- **Fast Iteration**: Groq models (optimized inference) +- **Local Testing**: Ollama models +- **Cost Control**: Smaller models like GPT-4o mini or Mistral Small + +See Also +-------- + +- :ref:`client_libraries` - Using different client libraries with providers +- :ref:`model_aliases` - Creating semantic model names +- :ref:`llm_router` - Setting up intelligent routing +- :ref:`client_libraries` - Using different client libraries +- :ref:`model_aliases` - Creating semantic model names diff --git a/docs/source/concepts/tech_overview/listener.rst b/docs/source/concepts/tech_overview/listener.rst index b6795ce6..dd486986 100644 --- a/docs/source/concepts/tech_overview/listener.rst +++ b/docs/source/concepts/tech_overview/listener.rst @@ -22,7 +22,7 @@ Upstream (Egress) Arch automatically configures a listener to route requests from your application to upstream LLM API providers (or hosts). When you start Arch, it creates a listener for egress traffic based on the presence of the ``listener`` configuration section in the configuration file. Arch binds itself to a local address such as ``127.0.0.1:12000/v1`` or a DNS-based -address like ``arch.local:12000/v1`` for outgoing traffic. For more details on LLM providers, read :ref:`here `. +address like ``arch.local:12000/v1`` for outgoing traffic. For more details on LLM providers, read :ref:`here `. Configure Listener ^^^^^^^^^^^^^^^^^^ diff --git a/docs/source/concepts/tech_overview/terminology.rst b/docs/source/concepts/tech_overview/terminology.rst index 0184ed40..4257a20f 100644 --- a/docs/source/concepts/tech_overview/terminology.rst +++ b/docs/source/concepts/tech_overview/terminology.rst @@ -31,7 +31,7 @@ code to LLMs. When you start Arch, you specify a listener address/port that you want to bind downstream. But, Arch uses are predefined port that you can use (``127.0.0.1:12000``) to proxy egress calls originating from your application to LLMs (API-based or hosted). - For more details, check out :ref:`LLM provider `. + For more details, check out :ref:`LLM providers `. **Prompt Target**: Arch offers a primitive called :ref:`prompt target ` to help separate business logic from undifferentiated work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Arch. diff --git a/docs/source/get_started/overview.rst b/docs/source/get_started/overview.rst index ac769cc2..8a9a43a7 100644 --- a/docs/source/get_started/overview.rst +++ b/docs/source/get_started/overview.rst @@ -3,10 +3,9 @@ Overview ============ -`Arch `_ is a smart edge and AI gateway for AI-native apps - one that is natively designed to handle and process prompts, not just network traffic. - -Built by contributors to the widely adopted `Envoy Proxy `_, Arch handles the *pesky low-level work* in building agentic apps — like applying guardrails, clarifying vague user input, routing prompts to the right agent, and unifying access to any LLM. It’s a language and framework friendly infrastructure layer designed to help you build and ship agentic apps faster. +`Arch `_ is a smart edge and AI gateway for AI agents - one that is natively designed to handle and process prompts, not just network traffic. +Built by contributors to the widely adopted `Envoy Proxy `_, Arch handles the *pesky low-level work* in building agentic apps — like applying guardrails, clarifying vague user input, routing prompts to the right agent, and unifying access to any LLM. It’s a protocol-friendly and framework-agnostic infrastructure layer designed to help you build and ship agentic apps faster. In this documentation, you will learn how to quickly set up Arch to trigger API calls via prompts, apply prompt guardrails without writing any application-level logic, simplify the interaction with upstream LLMs, and improve observability all while simplifying your application development process. @@ -53,8 +52,8 @@ Deep dive into essential ideas and mechanisms behind Arch: Learn about the technology stack - .. grid-item-card:: :octicon:`webhook` LLM Provider - :link: ../concepts/llm_provider.html + .. grid-item-card:: :octicon:`webhook` LLM Providers + :link: ../concepts/llm_providers/llm_providers.html Explore Arch’s LLM integration options diff --git a/docs/source/guides/function_calling.rst b/docs/source/guides/function_calling.rst index 072571d2..54113dd0 100644 --- a/docs/source/guides/function_calling.rst +++ b/docs/source/guides/function_calling.rst @@ -134,7 +134,7 @@ It will automatically validate parameters, and ensure that the required paramete Once a downstream function (API) is called, Arch Gateway takes the response and sends it an upstream LLM to complete the request (for summarization, Q/A, text generation tasks). -For more details on how Arch Gateway enables you to centralize usage of LLMs, please read :ref:`LLM providers `. +For more details on how Arch Gateway enables you to centralize usage of LLMs, please read :ref:`LLM providers `. By completing these steps, you enable Arch to manage the process from validation to response, ensuring users receive consistent, reliable results - and that you are focused on the stuff that matters most. diff --git a/docs/source/guides/llm_router.rst b/docs/source/guides/llm_router.rst index f999860c..963df0f0 100644 --- a/docs/source/guides/llm_router.rst +++ b/docs/source/guides/llm_router.rst @@ -5,18 +5,113 @@ LLM Routing With the rapid proliferation of large language models (LLM) — each optimized for different strengths, style, or latency/cost profile — routing has become an essential technique to operationalize the use of different models. -Arch Router is an intelligent routing system that automatically selects the most appropriate LLM for each user request based on user-defined usage preferences. Specifically Arch-Router guides model selection by matching queries to user-defined domains (e.g., finance and healthcare) and action types (e.g., code generation, image editing, etc.). -Our preference-aligned approach matches practical definitions of performance in the real world and makes routing decisions more transparent and adaptable. +Arch provides three distinct routing approaches to meet different use cases: + +1. **Model-based Routing**: Direct routing to specific models using provider/model names +2. **Alias-based Routing**: Semantic routing using custom aliases that map to underlying models +3. **Preference-aligned Routing**: Intelligent routing using the Arch-Router model based on context and user-defined preferences This enables optimal performance, cost efficiency, and response quality by matching requests with the most suitable model from your available LLM fleet. -Routing Workflow -------------------------- +Routing Methods +--------------- + +Model-based Routing +~~~~~~~~~~~~~~~~~~~ + +Direct routing allows you to specify exact provider and model combinations using the format ``provider/model-name``: + +- Use provider-specific names like ``openai/gpt-4o`` or ``anthropic/claude-3-5-sonnet-20241022`` +- Provides full control and transparency over which model handles each request +- Ideal for production workloads where you want predictable routing behavior + +Alias-based Routing +~~~~~~~~~~~~~~~~~~~ + +Alias-based routing lets you create semantic model names that decouple your application from specific providers: + +- Use meaningful names like ``fast-model``, ``reasoning-model``, or ``arch.summarize.v1`` (see :ref:`model_aliases`) +- Maps semantic names to underlying provider models for easier experimentation and provider switching +- Ideal for applications that want abstraction from specific model names while maintaining control + +.. _preference_aligned_routing: + +Preference-aligned Routing (Arch-Router) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Traditional LLM routing approaches face significant limitations: they evaluate performance using benchmarks that often fail to capture human preferences, select from fixed model pools, and operate as "black boxes" without practical mechanisms for encoding user preferences. + +Arch's preference-aligned routing addresses these challenges by applying a fundamental engineering principle: decoupling. The framework separates route selection (matching queries to human-readable policies) from model assignment (mapping policies to specific LLMs). This separation allows you to define routing policies using descriptive labels like ``Domain: 'finance', Action: 'analyze_earnings_report'`` rather than cryptic identifiers, while independently configuring which models handle each policy. + +The `Arch-Router `_ model automatically selects the most appropriate LLM based on: + +- Domain Analysis: Identifies the subject matter (e.g., legal, healthcare, programming) +- Action Classification: Determines the type of operation (e.g., summarization, code generation, translation) +- User-Defined Preferences: Maps domains and actions to preferred models using transparent, configurable routing decisions +- Human Preference Alignment: Uses domain-action mappings that capture subjective evaluation criteria, ensuring routing aligns with real-world user needs rather than just benchmark scores + +This approach supports seamlessly adding new models without retraining and is ideal for dynamic, context-aware routing that adapts to request content and intent. + + +Model-based Routing Workflow +---------------------------- + +For direct model routing, the process is straightforward: + +#. **Client Request** + + The client specifies the exact model using provider/model format (``openai/gpt-4o``). + +#. **Provider Validation** + + Arch validates that the specified provider and model are configured and available. + +#. **Direct Routing** + + The request is sent directly to the specified model without analysis or decision-making. + +#. **Response Handling** + + The response is returned to the client with optional metadata about the routing decision. + + +Alias-based Routing Workflow +----------------------------- + +For alias-based routing, the process includes name resolution: + +#. **Client Request** + + The client specifies a semantic alias name (``reasoning-model``). + +#. **Alias Resolution** + + Arch resolves the alias to the actual provider/model name based on configuration. + +#. **Model Selection** + + If the alias maps to multiple models, Arch selects one based on availability and load balancing. + +#. **Request Forwarding** + + The request is forwarded to the resolved model. + +#. **Response Handling** + + The response is returned with optional metadata about the alias resolution. + + +.. _preference_aligned_routing_workflow: + +Preference-aligned Routing Workflow (Arch-Router) +------------------------------------------------- + +For preference-aligned dynamic routing, the process involves intelligent analysis: #. **Prompt Analysis** - When a user submits a prompt, the Router analyzes it to determine the domain (subject matter) or action (type of operation requested). + When a user submits a prompt without specifying a model, the Arch-Router analyzes it to determine the domain (subject matter) and action (type of operation requested). #. **Model Selection** @@ -32,7 +127,18 @@ Routing Workflow Arch-Router ------------------------- -The `Arch-Router `_ is a state-of-the-art **preference-based routing model** specifically designed for intelligent LLM selection. This model delivers production-ready performance with low latency and high accuracy. +The `Arch-Router `_ is a state-of-the-art **preference-based routing model** specifically designed to address the limitations of traditional LLM routing. This compact 1.5B model delivers production-ready performance with low latency and high accuracy while solving key routing challenges. + +**Addressing Traditional Routing Limitations:** + +**Human Preference Alignment** +Unlike benchmark-driven approaches, Arch-Router learns to match queries with human preferences by using domain-action mappings that capture subjective evaluation criteria, ensuring routing decisions align with real-world user needs. + +**Flexible Model Integration** +The system supports seamlessly adding new models for routing without requiring retraining or architectural modifications, enabling dynamic adaptation to evolving model landscapes. + +**Preference-Encoded Routing** +Provides a practical mechanism to encode user preferences through domain-action mappings, offering transparent and controllable routing decisions that can be customized for specific use cases. To support effective routing, Arch-Router introduces two key concepts: @@ -53,51 +159,186 @@ In summary, Arch-Router demonstrates: - **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments. -Implementing LLM Routing ------------------------------ +Implementing Routing +-------------------- -To configure LLM routing in our gateway, you need to define a prompt target configuration that specifies the routing model and the LLM providers. This configuration will allow Arch Gateway to route incoming prompts to the appropriate model based on the defined routes. - -Below is an example to show how to set up a prompt target for the Arch Router: - -- **Step 1: Define the routing model in the `routing` section**. You can use the `archgw-v1-router-model` as the katanemo routing model or any other routing model you prefer. - -- **Step 2: Define the listeners in the `listeners` section**. This is where you specify the address and port for incoming traffic, as well as the message format (e.g., OpenAI). - -- **Step 3: Define the LLM providers in the `llm_providers` section**. This is where you specify the routing model, and any other models you want to use for specific tasks and their route usage descriptions (e.g., code generation, code understanding). - -.. Note:: - Make sure you define a model for default usage, such as `gpt-4o`, which will be used when no specific route is matched for an user prompt. +**Model-based Routing** +For direct model routing, configure your LLM providers with specific provider/model names: .. code-block:: yaml - :caption: Route Config Example - + :caption: Model-based Routing Configuration listeners: - egress_traffic: + egress_traffic: address: 0.0.0.0 port: 12000 message_format: openai timeout: 30s llm_providers: + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY + default: true - - model: openai/gpt-4o-mini - access_key: $OPENAI_API_KEY - default: true + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY - - model: openai/gpt-4o - access_key: $OPENAI_API_KEY - routing_preferences: - - name: code understanding - description: understand and explain existing code snippets, functions, or libraries + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY - - model: openai/gpt-4.1 - access_key: $OPENAI_API_KEY - routing_preferences: - - name: code generation - description: generating new code snippets, functions, or boilerplate based on user prompts or requirements +Clients specify exact models: + +.. code-block:: python + + # Direct provider/model specification + response = client.chat.completions.create( + model="openai/gpt-4o-mini", + messages=[{"role": "user", "content": "Hello!"}] + ) + + response = client.chat.completions.create( + model="anthropic/claude-3-5-sonnet-20241022", + messages=[{"role": "user", "content": "Write a story"}] + ) + +**Alias-based Routing** + +Configure semantic aliases that map to underlying models: + +.. code-block:: yaml + :caption: Alias-based Routing Configuration + + listeners: + egress_traffic: + address: 0.0.0.0 + port: 12000 + message_format: openai + timeout: 30s + + llm_providers: + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY + + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY + + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + + model_aliases: + # Model aliases - friendly names that map to actual provider names + fast-model: + target: gpt-4o-mini + + reasoning-model: + target: gpt-4o + + creative-model: + target: claude-3-5-sonnet-20241022 + +Clients use semantic names: + +.. code-block:: python + + # Using semantic aliases + response = client.chat.completions.create( + model="fast-model", # Routes to best available fast model + messages=[{"role": "user", "content": "Quick summary please"}] + ) + + response = client.chat.completions.create( + model="reasoning-model", # Routes to best reasoning model + messages=[{"role": "user", "content": "Solve this complex problem"}] + ) + +**Preference-aligned Routing (Arch-Router)** + +To configure preference-aligned dynamic routing, you need to define routing preferences that map domains and actions to specific models: + +.. code-block:: yaml + :caption: Preference-Aligned Dynamic Routing Configuration + + listeners: + egress_traffic: + address: 0.0.0.0 + port: 12000 + message_format: openai + timeout: 30s + + llm_providers: + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY + default: true + + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY + routing_preferences: + - name: code understanding + description: understand and explain existing code snippets, functions, or libraries + - name: complex reasoning + description: deep analysis, mathematical problem solving, and logical reasoning + + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + routing_preferences: + - name: creative writing + description: creative content generation, storytelling, and writing assistance + - name: code generation + description: generating new code snippets, functions, or boilerplate based on user prompts + +Clients can let the router decide or use aliases: + +.. code-block:: python + + # Let Arch-Router choose based on content + response = client.chat.completions.create( + messages=[{"role": "user", "content": "Write a creative story about space exploration"}] + # No model specified - router will analyze and choose claude-3-5-sonnet-20241022 + ) + + +Combining Routing Methods +------------------------- + +You can combine static model selection with dynamic routing preferences for maximum flexibility: + +.. code-block:: yaml + :caption: Hybrid Routing Configuration + + llm_providers: + - model: openai/gpt-4o-mini + access_key: $OPENAI_API_KEY + default: true + + - model: openai/gpt-4o + access_key: $OPENAI_API_KEY + routing_preferences: + - name: complex_reasoning + description: deep analysis and complex problem solving + + - model: anthropic/claude-3-5-sonnet-20241022 + access_key: $ANTHROPIC_API_KEY + routing_preferences: + - name: creative_tasks + description: creative writing and content generation + + model_aliases: + # Model aliases - friendly names that map to actual provider names + fast-model: + target: gpt-4o-mini + + reasoning-model: + target: gpt-4o + + # Aliases that can also participate in dynamic routing + creative-model: + target: claude-3-5-sonnet-20241022 + +This configuration allows clients to: + +1. **Use direct model selection**: ``model="fast-model"`` +2. **Let the router decide**: No model specified, router analyzes content Example Use Cases ------------------------- @@ -112,7 +353,7 @@ Here are common scenarios where Arch-Router excels: - **Conversational Routing**: Track conversation context to identify when topics shift between domains or when the type of assistance needed changes mid-conversation. -Best practice +Best practicesm ------------------------- - **💡Consistent Naming:** Route names should align with their descriptions. diff --git a/docs/source/index.rst b/docs/source/index.rst index bd724eef..9d5a554c 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -39,7 +39,7 @@ Built by contributors to the widely adopted `Envoy Proxy