mirror of
https://github.com/katanemo/plano.git
synced 2026-06-17 15:25:17 +02:00
fixed docs and added ollama as a first-class LLM provider
This commit is contained in:
parent
8d0b468345
commit
f7c9d04da9
16 changed files with 1612 additions and 149 deletions
|
|
@ -1,76 +0,0 @@
|
|||
.. _llm_provider:
|
||||
|
||||
LLM Provider
|
||||
============
|
||||
|
||||
**LLM provider** is a top-level primitive in Arch, helping developers centrally define, secure, observe,
|
||||
and manage the usage of their LLMs. Arch builds on Envoy's reliable `cluster subsystem <https://www.envoyproxy.io/docs/envoy/v1.31.2/intro/arch_overview/upstream/cluster_manager>`_
|
||||
to manage egress traffic to LLMs, which includes intelligent routing, retry and fail-over mechanisms,
|
||||
ensuring high availability and fault tolerance. This abstraction also enables developers to seamlessly
|
||||
switching between LLM providers or upgrade LLM versions, simplifying the integration and scaling of LLMs
|
||||
across applications.
|
||||
|
||||
|
||||
Below is an example of how you can configure ``llm_providers`` with an instance of an Arch gateway.
|
||||
|
||||
.. literalinclude:: includes/arch_config.yaml
|
||||
:language: yaml
|
||||
:linenos:
|
||||
:lines: 1-20
|
||||
:emphasize-lines: 10-16
|
||||
:caption: Example Configuration
|
||||
|
||||
.. Note::
|
||||
When you start Arch, it creates a listener port for egress traffic based on the presence of ``llm_providers``
|
||||
configuration section in the ``arch_config.yml`` file. Arch binds itself to a local address such as
|
||||
``127.0.0.1:12000``.
|
||||
|
||||
Arch also offers vendor-agnostic SDKs and libraries to make LLM calls to API-based LLM providers (like OpenAI,
|
||||
Anthropic, Mistral, Cohere, etc.) and supports calls to OSS LLMs that are hosted on your infrastructure. Arch
|
||||
abstracts the complexities of integrating with different LLM providers, providing a unified interface for making
|
||||
calls, handling retries, managing rate limits, and ensuring seamless integration with cloud-based and on-premise
|
||||
LLMs. Simply configure the details of the LLMs your application will use, and Arch offers a unified interface to
|
||||
make outbound LLM calls.
|
||||
|
||||
Adding custom LLM Provider
|
||||
--------------------------
|
||||
|
||||
We support any OpenAI compliant LLM for example mistral, openai, ollama etc. We also offer first class support for OpenAI, Anthropic, DeepSeek, Mistral, Groq, and Ollama based models.
|
||||
You can easily configure an LLM that communicates over the OpenAI API interface, by following the below guide.
|
||||
|
||||
For example following code block shows you how to add an ollama-supported LLM in the ``arch_config.yaml`` file.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: some_custom_llm_provider/llama3.2
|
||||
provider_interface: openai
|
||||
base_url: http://host.docker.internal:11434
|
||||
|
||||
And in the following code block shows you how to add mistral llm provider in the ``arch_config.yaml`` file.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- name: mistral/ministral-3b-latest
|
||||
access_key: $MISTRAL_API_KEY
|
||||
|
||||
Example: Using the OpenAI Python SDK
|
||||
------------------------------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize the Arch client
|
||||
client = OpenAI(base_url="http://127.0.0.1:2000/")
|
||||
|
||||
# Define your model and messages
|
||||
model = "llama3.2"
|
||||
messages = [{"role": "user", "content": "What is the capital of France?"}]
|
||||
|
||||
# Send the messages to the LLM through Arch
|
||||
response = client.chat.completions.create(model=model, messages=messages)
|
||||
|
||||
# Print the response
|
||||
print("LLM Response:", response.choices[0].message.content)
|
||||
420
docs/source/concepts/llm_providers/client_libraries.rst
Normal file
420
docs/source/concepts/llm_providers/client_libraries.rst
Normal file
|
|
@ -0,0 +1,420 @@
|
|||
.. _client_libraries:
|
||||
|
||||
Client Libraries
|
||||
================
|
||||
|
||||
Arch provides a unified interface that works seamlessly with multiple client libraries and tools. You can use your preferred client library without changing your existing code - just point it to Arch's gateway endpoints.
|
||||
|
||||
Supported Clients
|
||||
------------------
|
||||
|
||||
- **OpenAI SDK** - Full compatibility with OpenAI's official client
|
||||
- **Anthropic SDK** - Native support for Anthropic's client library
|
||||
- **cURL** - Direct HTTP requests for any programming language
|
||||
- **Custom HTTP Clients** - Any HTTP client that supports REST APIs
|
||||
|
||||
Gateway Endpoints
|
||||
-----------------
|
||||
|
||||
Arch exposes two main endpoints:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 40 60
|
||||
|
||||
* - Endpoint
|
||||
- Purpose
|
||||
* - ``http://127.0.0.1:12000/v1/chat/completions``
|
||||
- OpenAI-compatible chat completions (LLM Gateway)
|
||||
* - ``http://127.0.0.1:12000/v1/messages``
|
||||
- Anthropic-compatible messages (LLM Gateway)
|
||||
|
||||
OpenAI (Python) SDK
|
||||
-------------------
|
||||
|
||||
The OpenAI SDK works with any provider through Arch's OpenAI-compatible endpoint.
|
||||
|
||||
**Installation:**
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install openai
|
||||
|
||||
**Basic Usage:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
# Point to Arch's LLM Gateway
|
||||
client = OpenAI(
|
||||
api_key="test-key", # Can be any value for local testing
|
||||
base_url="http://127.0.0.1:12000/v1"
|
||||
)
|
||||
|
||||
# Use any model configured in your arch_config.yaml
|
||||
completion = client.chat.completions.create(
|
||||
model="gpt-4o-mini", # Or use :ref:`model aliases <model_aliases>` like "fast-model"
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello, how are you?"
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
**Streaming Responses:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
api_key="test-key",
|
||||
base_url="http://127.0.0.1:12000/v1"
|
||||
)
|
||||
|
||||
stream = client.chat.completions.create(
|
||||
model="gpt-4o-mini",
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Tell me a short story"
|
||||
}
|
||||
],
|
||||
stream=True
|
||||
)
|
||||
|
||||
# Collect streaming chunks
|
||||
for chunk in stream:
|
||||
if chunk.choices[0].delta.content:
|
||||
print(chunk.choices[0].delta.content, end="")
|
||||
|
||||
**Using with Non-OpenAI Models:**
|
||||
|
||||
The OpenAI SDK can be used with any provider configured in Arch:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Using Claude model through OpenAI SDK
|
||||
completion = client.chat.completions.create(
|
||||
model="claude-3-5-sonnet-20241022",
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Explain quantum computing briefly"
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
# Using Ollama model through OpenAI SDK
|
||||
completion = client.chat.completions.create(
|
||||
model="llama3.1",
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What's the capital of France?"
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
Anthropic (Python) SDK
|
||||
----------------------
|
||||
|
||||
The Anthropic SDK works with any provider through Arch's Anthropic-compatible endpoint.
|
||||
|
||||
**Installation:**
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install anthropic
|
||||
|
||||
**Basic Usage:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import anthropic
|
||||
|
||||
# Point to Arch's LLM Gateway
|
||||
client = anthropic.Anthropic(
|
||||
api_key="test-key", # Can be any value for local testing
|
||||
base_url="http://127.0.0.1:12000"
|
||||
)
|
||||
|
||||
# Use any model configured in your arch_config.yaml
|
||||
message = client.messages.create(
|
||||
model="claude-3-5-sonnet-20241022",
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello, please respond briefly!"
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
print(message.content[0].text)
|
||||
|
||||
**Streaming Responses:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import anthropic
|
||||
|
||||
client = anthropic.Anthropic(
|
||||
api_key="test-key",
|
||||
base_url="http://127.0.0.1:12000"
|
||||
)
|
||||
|
||||
with client.messages.stream(
|
||||
model="claude-3-5-sonnet-20241022",
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Tell me about artificial intelligence"
|
||||
}
|
||||
]
|
||||
) as stream:
|
||||
# Collect text deltas
|
||||
for text in stream.text_stream:
|
||||
print(text, end="")
|
||||
|
||||
# Get final assembled message
|
||||
final_message = stream.get_final_message()
|
||||
final_text = "".join(block.text for block in final_message.content if block.type == "text")
|
||||
|
||||
**Using with Non-Anthropic Models:**
|
||||
|
||||
The Anthropic SDK can be used with any provider configured in Arch:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Using OpenAI model through Anthropic SDK
|
||||
message = client.messages.create(
|
||||
model="gpt-4o-mini",
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Explain machine learning in simple terms"
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
# Using Ollama model through Anthropic SDK
|
||||
message = client.messages.create(
|
||||
model="llama3.1",
|
||||
max_tokens=50,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What is Python programming?"
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
cURL Examples
|
||||
-------------
|
||||
|
||||
For direct HTTP requests or integration with any programming language:
|
||||
|
||||
**OpenAI-Compatible Endpoint:**
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Basic request
|
||||
curl -X POST http://127.0.0.1:12000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer test-key" \
|
||||
-d '{
|
||||
"model": "gpt-4o-mini",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello!"}
|
||||
],
|
||||
"max_tokens": 50
|
||||
}'
|
||||
|
||||
# Using :ref:`model aliases <model_aliases>`
|
||||
curl -X POST http://127.0.0.1:12000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "fast-model",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Summarize this text..."}
|
||||
],
|
||||
"max_tokens": 100
|
||||
}'
|
||||
|
||||
# Streaming request
|
||||
curl -X POST http://127.0.0.1:12000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4o-mini",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Tell me a story"}
|
||||
],
|
||||
"stream": true,
|
||||
"max_tokens": 200
|
||||
}'
|
||||
|
||||
**Anthropic-Compatible Endpoint:**
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Basic request
|
||||
curl -X POST http://127.0.0.1:12000/v1/messages \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-api-key: test-key" \
|
||||
-H "anthropic-version: 2023-06-01" \
|
||||
-d '{
|
||||
"model": "claude-3-5-sonnet-20241022",
|
||||
"max_tokens": 50,
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello Claude!"}
|
||||
]
|
||||
}'
|
||||
|
||||
Cross-Client Compatibility
|
||||
--------------------------
|
||||
|
||||
One of Arch's key features is cross-client compatibility. You can:
|
||||
|
||||
**Use OpenAI SDK with Claude Models:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# OpenAI client calling Claude model
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://127.0.0.1:12000/v1", api_key="test")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="claude-3-5-sonnet-20241022", # Claude model
|
||||
messages=[{"role": "user", "content": "Hello"}]
|
||||
)
|
||||
|
||||
**Use Anthropic SDK with OpenAI Models:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Anthropic client calling OpenAI model
|
||||
import anthropic
|
||||
|
||||
client = anthropic.Anthropic(base_url="http://127.0.0.1:12000", api_key="test")
|
||||
|
||||
response = client.messages.create(
|
||||
model="gpt-4o-mini", # OpenAI model
|
||||
max_tokens=50,
|
||||
messages=[{"role": "user", "content": "Hello"}]
|
||||
)
|
||||
|
||||
**Mix and Match with** :ref:`Model Aliases <model_aliases>`:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Same code works with different underlying models
|
||||
def ask_question(client, question):
|
||||
return client.chat.completions.create(
|
||||
model="reasoning-model", # Alias could point to any provider
|
||||
messages=[{"role": "user", "content": question}]
|
||||
)
|
||||
|
||||
# Works regardless of what "reasoning-model" actually points to
|
||||
openai_client = OpenAI(base_url="http://127.0.0.1:12000/v1", api_key="test")
|
||||
response = ask_question(openai_client, "Solve this math problem...")
|
||||
|
||||
Error Handling
|
||||
--------------
|
||||
|
||||
**OpenAI SDK Error Handling:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
import openai
|
||||
|
||||
client = OpenAI(base_url="http://127.0.0.1:12000/v1", api_key="test")
|
||||
|
||||
try:
|
||||
completion = client.chat.completions.create(
|
||||
model="nonexistent-model",
|
||||
messages=[{"role": "user", "content": "Hello"}]
|
||||
)
|
||||
except openai.NotFoundError as e:
|
||||
print(f"Model not found: {e}")
|
||||
except openai.APIError as e:
|
||||
print(f"API error: {e}")
|
||||
|
||||
**Anthropic SDK Error Handling:**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import anthropic
|
||||
|
||||
client = anthropic.Anthropic(base_url="http://127.0.0.1:12000", api_key="test")
|
||||
|
||||
try:
|
||||
message = client.messages.create(
|
||||
model="nonexistent-model",
|
||||
max_tokens=50,
|
||||
messages=[{"role": "user", "content": "Hello"}]
|
||||
)
|
||||
except anthropic.NotFoundError as e:
|
||||
print(f"Model not found: {e}")
|
||||
except anthropic.APIError as e:
|
||||
print(f"API error: {e}")
|
||||
|
||||
Best Practices
|
||||
--------------
|
||||
|
||||
**Use** :ref:`Model Aliases <model_aliases>`:
|
||||
Instead of hardcoding provider-specific model names, use semantic aliases:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Good - uses semantic alias
|
||||
model = "fast-model"
|
||||
|
||||
# Less ideal - hardcoded provider model
|
||||
model = "openai/gpt-4o-mini"
|
||||
|
||||
**Environment-Based Configuration:**
|
||||
Use different :ref:`model aliases <model_aliases>` for different environments:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import os
|
||||
|
||||
# Development uses cheaper/faster models
|
||||
model = os.getenv("MODEL_ALIAS", "dev.chat.v1")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": "Hello"}]
|
||||
)
|
||||
|
||||
**Graceful Fallbacks:**
|
||||
Implement fallback logic for better reliability:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def chat_with_fallback(client, messages, primary_model="smart-model", fallback_model="fast-model"):
|
||||
try:
|
||||
return client.chat.completions.create(model=primary_model, messages=messages)
|
||||
except Exception as e:
|
||||
print(f"Primary model failed, trying fallback: {e}")
|
||||
return client.chat.completions.create(model=fallback_model, messages=messages)
|
||||
|
||||
See Also
|
||||
--------
|
||||
|
||||
- :ref:`supported_providers` - Configure your providers and see available models
|
||||
- :ref:`model_aliases` - Create semantic model names
|
||||
- :ref:`llm_router` - Intelligent routing capabilities
|
||||
94
docs/source/concepts/llm_providers/llm_providers.rst
Normal file
94
docs/source/concepts/llm_providers/llm_providers.rst
Normal file
|
|
@ -0,0 +1,94 @@
|
|||
.. _llm_providers:
|
||||
|
||||
LLM Providers
|
||||
=============
|
||||
**LLM Providers** are a top-level primitive in Arch, helping developers centrally define, secure, observe,
|
||||
and manage the usage of their LLMs. Arch builds on Envoy's reliable `cluster subsystem <https://www.envoyproxy.io/docs/envoy/v1.31.2/intro/arch_overview/upstream/cluster_manager>`_
|
||||
to manage egress traffic to LLMs, which includes intelligent routing, retry and fail-over mechanisms,
|
||||
ensuring high availability and fault tolerance. This abstraction also enables developers to seamlessly
|
||||
switch between LLM providers or upgrade LLM versions, simplifying the integration and scaling of LLMs
|
||||
across applications.
|
||||
|
||||
Today, we are enabling you to connect to 11+ different AI providers through a unified interface with advanced routing and management capabilities.
|
||||
Whether you're using OpenAI, Anthropic, Azure OpenAI, local Ollama models, or any OpenAI-compatible provider, Arch provides seamless integration with enterprise-grade features.
|
||||
|
||||
Core Capabilities
|
||||
-----------------
|
||||
|
||||
**Multi-Provider Support**
|
||||
Connect to any combination of providers simultaneously:
|
||||
|
||||
- **First-Class Providers**: Native integrations with OpenAI, Anthropic, DeepSeek, Mistral, Groq, Google Gemini, Together AI, xAI, Azure OpenAI, and Ollama
|
||||
- **OpenAI-Compatible Providers**: Support for any provider implementing OpenAI's API interface
|
||||
|
||||
**Intelligent Routing**
|
||||
Two powerful routing approaches to optimize model selection:
|
||||
|
||||
- **Static Model Selection**: Direct routing using provider names or semantic model aliases
|
||||
- **Preference-Aligned Dynamic Routing**: Intelligent, context-aware routing using the Arch-Router model that analyzes prompts and selects optimal models based on domain and action preferences
|
||||
|
||||
**Model Aliases & Management**
|
||||
Create semantic, version-controlled names for simplified model management:
|
||||
|
||||
- **Semantic Naming**: Use descriptive names like ``fast-model``, ``reasoning-model``, or ``arch.summarize.v1``
|
||||
- **Environment Management**: Different aliases for dev/staging/production environments
|
||||
- **Version Control**: Implement versioning schemes for gradual model upgrades
|
||||
- **Future Features**: Planned support for guardrails, fallback chains, and traffic splitting
|
||||
|
||||
**Unified Client Interface**
|
||||
Use your preferred client library without changing existing code:
|
||||
|
||||
- **OpenAI Python SDK**: Full compatibility with all providers
|
||||
- **Anthropic Python SDK**: Native support with cross-provider capabilities
|
||||
- **cURL & HTTP Clients**: Direct REST API access for any programming language
|
||||
- **Custom Integrations**: Standard HTTP interfaces for seamless integration
|
||||
|
||||
Key Benefits
|
||||
------------
|
||||
|
||||
- **Provider Flexibility**: Switch between providers without changing client code
|
||||
- **Intelligent Routing**: Automatically select the best model for each request
|
||||
- **Cost Optimization**: Route requests to cost-effective models based on complexity
|
||||
- **Performance Optimization**: Use fast models for simple tasks, powerful models for complex reasoning
|
||||
- **Environment Management**: Configure different models for different environments
|
||||
- **Future-Proof**: Easy to add new providers and upgrade models
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
Dive into specific areas based on your needs:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
supported_providers
|
||||
client_libraries
|
||||
model_aliases
|
||||
|
||||
**3. Advanced Features**
|
||||
- **:ref:`llm_router`**: Learn about preference-aligned dynamic routing and intelligent model selection
|
||||
|
||||
Common Use Cases
|
||||
----------------
|
||||
|
||||
**Development Teams**
|
||||
- Use aliases like ``dev.chat.v1`` and ``prod.chat.v1`` for environment-specific models
|
||||
- Route simple queries to fast/cheap models, complex tasks to powerful models
|
||||
- Test new models safely using canary deployments (coming soon)
|
||||
|
||||
**Production Applications**
|
||||
- Implement fallback strategies across multiple providers for reliability
|
||||
- Use intelligent routing to optimize cost and performance automatically
|
||||
- Monitor usage patterns and model performance across providers
|
||||
|
||||
**Enterprise Deployments**
|
||||
- Connect to both cloud providers and on-premises models (Ollama, custom deployments)
|
||||
- Apply consistent security and governance policies across all providers
|
||||
- Scale across regions using different provider endpoints
|
||||
|
||||
Next Steps
|
||||
----------
|
||||
|
||||
1. **:ref:`supported_providers`** - See all supported providers, models, and configuration examples
|
||||
2. **:ref:`client_libraries`** - Start using with your preferred client
|
||||
3. **:ref:`model_aliases`** - Create semantic model names
|
||||
4. **:ref:`llm_router`** - Set up intelligent routing
|
||||
254
docs/source/concepts/llm_providers/model_aliases.rst
Normal file
254
docs/source/concepts/llm_providers/model_aliases.rst
Normal file
|
|
@ -0,0 +1,254 @@
|
|||
.. _model_aliases:
|
||||
|
||||
Model Aliases
|
||||
=============
|
||||
|
||||
Model aliases provide semantic, version-controlled names for your models, enabling cleaner client code, easier model management, and advanced routing capabilities. Instead of using provider-specific model names like ``gpt-4o-mini`` or ``claude-3-5-sonnet-20241022``, you can create meaningful aliases like ``fast-model`` or ``arch.summarize.v1``.
|
||||
|
||||
**Benefits of Model Aliases:**
|
||||
|
||||
- **Semantic Naming**: Use descriptive names that reflect the model's purpose
|
||||
- **Version Control**: Implement versioning schemes (e.g., ``v1``, ``v2``) for model upgrades
|
||||
- **Environment Management**: Different aliases can point to different models across environments
|
||||
- **Client Simplification**: Clients use consistent, meaningful names regardless of underlying provider
|
||||
- **Advanced Routing (Coming Soon)**: Enable guardrails, fallbacks, and traffic splitting at the alias level
|
||||
|
||||
Basic Configuration
|
||||
-------------------
|
||||
|
||||
**Simple Alias Mapping**
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Basic Model Aliases
|
||||
|
||||
llm_providers:
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
|
||||
- model: anthropic/claude-3-5-sonnet-20241022
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
|
||||
- model: ollama/llama3.1
|
||||
base_url: http://host.docker.internal:11434
|
||||
|
||||
# Define aliases that map to the models above
|
||||
model_aliases:
|
||||
# Semantic versioning approach
|
||||
arch.summarize.v1:
|
||||
target: gpt-4o-mini
|
||||
|
||||
arch.reasoning.v1:
|
||||
target: gpt-4o
|
||||
|
||||
arch.creative.v1:
|
||||
target: claude-3-5-sonnet-20241022
|
||||
|
||||
# Functional aliases
|
||||
fast-model:
|
||||
target: gpt-4o-mini
|
||||
|
||||
smart-model:
|
||||
target: gpt-4o
|
||||
|
||||
creative-model:
|
||||
target: claude-3-5-sonnet-20241022
|
||||
|
||||
# Local model alias
|
||||
local-chat:
|
||||
target: llama3.1
|
||||
|
||||
Using Aliases
|
||||
-------------
|
||||
|
||||
**Client Code Examples**
|
||||
|
||||
Once aliases are configured, clients can use semantic names instead of provider-specific model names:
|
||||
|
||||
.. code-block:: python
|
||||
:caption: Python Client Usage
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://127.0.0.1:12000/")
|
||||
|
||||
# Use semantic alias instead of provider model name
|
||||
response = client.chat.completions.create(
|
||||
model="arch.summarize.v1", # Points to gpt-4o-mini
|
||||
messages=[{"role": "user", "content": "Summarize this document..."}]
|
||||
)
|
||||
|
||||
# Switch to a different capability
|
||||
response = client.chat.completions.create(
|
||||
model="arch.reasoning.v1", # Points to gpt-4o
|
||||
messages=[{"role": "user", "content": "Solve this complex problem..."}]
|
||||
)
|
||||
|
||||
.. code-block:: bash
|
||||
:caption: cURL Example
|
||||
|
||||
curl -X POST http://127.0.0.1:12000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "fast-model",
|
||||
"messages": [{"role": "user", "content": "Hello!"}]
|
||||
}'
|
||||
|
||||
Naming Best Practices
|
||||
---------------------
|
||||
|
||||
**Semantic Versioning**
|
||||
|
||||
Use version numbers for backward compatibility and gradual model upgrades:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
model_aliases:
|
||||
# Current production version
|
||||
arch.summarize.v1:
|
||||
target: gpt-4o-mini
|
||||
|
||||
# Beta version for testing
|
||||
arch.summarize.v2:
|
||||
target: gpt-4o
|
||||
|
||||
# Stable alias that always points to latest
|
||||
arch.summarize.latest:
|
||||
target: gpt-4o-mini
|
||||
|
||||
**Purpose-Based Naming**
|
||||
|
||||
Create aliases that reflect the intended use case:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
model_aliases:
|
||||
# Task-specific
|
||||
code-reviewer:
|
||||
target: gpt-4o
|
||||
|
||||
document-summarizer:
|
||||
target: gpt-4o-mini
|
||||
|
||||
creative-writer:
|
||||
target: claude-3-5-sonnet-20241022
|
||||
|
||||
data-analyst:
|
||||
target: gpt-4o
|
||||
|
||||
**Environment-Specific Aliases**
|
||||
|
||||
Different environments can use different underlying models:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
model_aliases:
|
||||
# Development environment - use faster/cheaper models
|
||||
dev.chat.v1:
|
||||
target: gpt-4o-mini
|
||||
|
||||
# Production environment - use more capable models
|
||||
prod.chat.v1:
|
||||
target: gpt-4o
|
||||
|
||||
# Staging environment - test new models
|
||||
staging.chat.v1:
|
||||
target: claude-3-5-sonnet-20241022
|
||||
|
||||
Advanced Features (Coming Soon)
|
||||
--------------------------------
|
||||
|
||||
The following features are planned for future releases of model aliases:
|
||||
|
||||
**Guardrails Integration**
|
||||
|
||||
Apply safety, cost, or latency rules at the alias level:
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Future Feature - Guardrails
|
||||
|
||||
model_aliases:
|
||||
arch.reasoning.v1:
|
||||
target: gpt-oss-120b
|
||||
guardrails:
|
||||
max_latency: 5s
|
||||
max_cost_per_request: 0.10
|
||||
block_categories: ["jailbreak", "PII"]
|
||||
content_filters:
|
||||
- type: "profanity"
|
||||
- type: "sensitive_data"
|
||||
|
||||
**Fallback Chains**
|
||||
|
||||
Provide a chain of models if the primary target fails or hits quota limits:
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Future Feature - Fallbacks
|
||||
|
||||
model_aliases:
|
||||
arch.summarize.v1:
|
||||
target: gpt-4o-mini
|
||||
fallbacks:
|
||||
- target: llama3.1
|
||||
conditions: ["quota_exceeded", "timeout"]
|
||||
- target: claude-3-haiku-20240307
|
||||
conditions: ["primary_and_first_fallback_failed"]
|
||||
|
||||
**Traffic Splitting & Canary Deployments**
|
||||
|
||||
Distribute traffic across multiple models for A/B testing or gradual rollouts:
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Future Feature - Traffic Splitting
|
||||
|
||||
model_aliases:
|
||||
arch.v1:
|
||||
targets:
|
||||
- model: llama3.1
|
||||
weight: 80
|
||||
- model: gpt-4o-mini
|
||||
weight: 20
|
||||
|
||||
# Canary deployment
|
||||
arch.experimental.v1:
|
||||
targets:
|
||||
- model: gpt-4o # Current stable
|
||||
weight: 95
|
||||
- model: o1-preview # New model being tested
|
||||
weight: 5
|
||||
|
||||
**Load Balancing**
|
||||
|
||||
Distribute requests across multiple instances of the same model:
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Future Feature - Load Balancing
|
||||
|
||||
model_aliases:
|
||||
high-throughput-chat:
|
||||
load_balance:
|
||||
algorithm: "round_robin" # or "least_connections", "weighted"
|
||||
targets:
|
||||
- model: gpt-4o-mini
|
||||
endpoint: "https://api-1.example.com"
|
||||
- model: gpt-4o-mini
|
||||
endpoint: "https://api-2.example.com"
|
||||
- model: gpt-4o-mini
|
||||
endpoint: "https://api-3.example.com"
|
||||
|
||||
|
||||
Validation Rules
|
||||
----------------
|
||||
|
||||
- Alias names must be valid identifiers (alphanumeric, dots, hyphens, underscores)
|
||||
- Target models must be defined in the ``llm_providers`` section
|
||||
- Circular references between aliases are not allowed
|
||||
- Weights in traffic splitting must sum to 100
|
||||
|
||||
See Also
|
||||
--------
|
||||
|
||||
- :ref:`llm_providers` - Learn about configuring LLM providers
|
||||
- :ref:`llm_router` - Understand how aliases work with intelligent routing
|
||||
551
docs/source/concepts/llm_providers/supported_providers.rst
Normal file
551
docs/source/concepts/llm_providers/supported_providers.rst
Normal file
|
|
@ -0,0 +1,551 @@
|
|||
.. _supported_providers:
|
||||
|
||||
Supported Providers & Configuration
|
||||
===================================
|
||||
|
||||
Arch provides first-class support for multiple LLM providers through native integrations and OpenAI-compatible interfaces. This comprehensive guide covers all supported providers, their available chat models, and detailed configuration instructions.
|
||||
|
||||
.. note::
|
||||
**Model Support:** Arch supports all chat models from each provider, not just the examples shown in this guide. The configurations below demonstrate common models for reference, but you can use any chat model available from your chosen provider.
|
||||
|
||||
Configuration Structure
|
||||
-----------------------
|
||||
|
||||
All providers are configured in the ``llm_providers`` section of your ``arch_config.yaml`` file:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
version: v0.1
|
||||
|
||||
listeners:
|
||||
egress_traffic:
|
||||
address: 0.0.0.0
|
||||
port: 12000
|
||||
message_format: openai
|
||||
timeout: 30s
|
||||
|
||||
llm_providers:
|
||||
# Provider configurations go here
|
||||
- model: provider/model-name
|
||||
access_key: $API_KEY
|
||||
# Additional provider-specific options
|
||||
|
||||
**Common Configuration Fields:**
|
||||
|
||||
- ``model``: Provider prefix and model name (format: ``provider/model-name``)
|
||||
- ``access_key``: API key for authentication (supports environment variables)
|
||||
- ``default``: Mark a model as the default (optional, boolean)
|
||||
- ``name``: Custom name for the provider instance (optional)
|
||||
- ``base_url``: Custom endpoint URL (required for some providers)
|
||||
|
||||
Provider Categories
|
||||
-------------------
|
||||
|
||||
**First-Class Providers**
|
||||
Native integrations with built-in support for provider-specific features and authentication.
|
||||
|
||||
**OpenAI-Compatible Providers**
|
||||
Any provider that implements the OpenAI API interface can be configured using custom endpoints.
|
||||
|
||||
Supported API Endpoints
|
||||
------------------------
|
||||
|
||||
Arch supports the following standardized endpoints across providers:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 30 40
|
||||
|
||||
* - Endpoint
|
||||
- Purpose
|
||||
- Supported Clients
|
||||
* - ``/v1/chat/completions``
|
||||
- OpenAI-style chat completions
|
||||
- OpenAI SDK, cURL, custom clients
|
||||
* - ``/v1/messages``
|
||||
- Anthropic-style messages
|
||||
- Anthropic SDK, cURL, custom clients
|
||||
|
||||
First-Class Providers
|
||||
---------------------
|
||||
|
||||
OpenAI
|
||||
~~~~~~
|
||||
|
||||
**Provider Prefix:** ``openai/``
|
||||
|
||||
**API Endpoint:** ``/v1/chat/completions``
|
||||
|
||||
**Authentication:** API Key - Get your OpenAI API key from `OpenAI Platform <https://platform.openai.com/api-keys>`_.
|
||||
|
||||
**Supported Chat Models:** All OpenAI chat models including GPT-5, GPT-4o, GPT-4, GPT-3.5-turbo, and all future releases.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - GPT-5
|
||||
- ``openai/gpt-5``
|
||||
- Next-generation model (use any model name from OpenAI's API)
|
||||
* - GPT-4o
|
||||
- ``openai/gpt-4o``
|
||||
- Latest multimodal model
|
||||
* - GPT-4o mini
|
||||
- ``openai/gpt-4o-mini``
|
||||
- Fast, cost-effective model
|
||||
* - GPT-4
|
||||
- ``openai/gpt-4``
|
||||
- High-capability reasoning model
|
||||
* - GPT-3.5 Turbo
|
||||
- ``openai/gpt-3.5-turbo``
|
||||
- Balanced performance and cost
|
||||
* - o3-mini
|
||||
- ``openai/o3-mini``
|
||||
- Reasoning-focused model (preview)
|
||||
* - o3
|
||||
- ``openai/o3``
|
||||
- Advanced reasoning model (preview)
|
||||
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
# Latest models (examples - use any OpenAI chat model)
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
|
||||
# Use any model name from OpenAI's API
|
||||
- model: openai/gpt-5
|
||||
access_key: $OPENAI_API_KEY
|
||||
|
||||
Anthropic
|
||||
~~~~~~~~~
|
||||
|
||||
**Provider Prefix:** ``anthropic/``
|
||||
|
||||
**API Endpoint:** ``/v1/messages``
|
||||
|
||||
**Authentication:** API Key - Get your Anthropic API key from `Anthropic Console <https://console.anthropic.com/settings/keys>`_.
|
||||
|
||||
**Supported Chat Models:** All Anthropic Claude models including Claude Sonnet 4, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, and all future releases.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - Claude Sonnet 4
|
||||
- ``anthropic/claude-sonnet-4``
|
||||
- Next-generation model (use any model name from Anthropic's API)
|
||||
* - Claude 3.5 Sonnet
|
||||
- ``anthropic/claude-3-5-sonnet-20241022``
|
||||
- Latest high-performance model
|
||||
* - Claude 3.5 Haiku
|
||||
- ``anthropic/claude-3-5-haiku-20241022``
|
||||
- Fast and efficient model
|
||||
* - Claude 3 Opus
|
||||
- ``anthropic/claude-3-opus-20240229``
|
||||
- Most capable model for complex tasks
|
||||
* - Claude 3 Sonnet
|
||||
- ``anthropic/claude-3-sonnet-20240229``
|
||||
- Balanced performance model
|
||||
* - Claude 3 Haiku
|
||||
- ``anthropic/claude-3-haiku-20240307``
|
||||
- Fastest model
|
||||
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
# Latest models (examples - use any Anthropic chat model)
|
||||
- model: anthropic/claude-3-5-sonnet-20241022
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
|
||||
- model: anthropic/claude-3-5-haiku-20241022
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
|
||||
# Use any model name from Anthropic's API
|
||||
- model: anthropic/claude-sonnet-4
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
|
||||
DeepSeek
|
||||
~~~~~~~~
|
||||
|
||||
**Provider Prefix:** ``deepseek/``
|
||||
|
||||
**API Endpoint:** ``/v1/chat/completions``
|
||||
|
||||
**Authentication:** API Key - Get your DeepSeek API key from `DeepSeek Platform <https://platform.deepseek.com/api_keys>`_.
|
||||
|
||||
**Supported Chat Models:** All DeepSeek chat models including DeepSeek-Chat, DeepSeek-Coder, and all future releases.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - DeepSeek Chat
|
||||
- ``deepseek/deepseek-chat``
|
||||
- General purpose chat model
|
||||
* - DeepSeek Coder
|
||||
- ``deepseek/deepseek-coder``
|
||||
- Code-specialized model
|
||||
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: deepseek/deepseek-chat
|
||||
access_key: $DEEPSEEK_API_KEY
|
||||
|
||||
- model: deepseek/deepseek-coder
|
||||
access_key: $DEEPSEEK_API_KEY
|
||||
|
||||
Mistral AI
|
||||
~~~~~~~~~~
|
||||
|
||||
**Provider Prefix:** ``mistral/``
|
||||
|
||||
**API Endpoint:** ``/v1/chat/completions``
|
||||
|
||||
**Authentication:** API Key - Get your Mistral API key from `Mistral AI Console <https://console.mistral.ai/api-keys/>`_.
|
||||
|
||||
**Supported Chat Models:** All Mistral chat models including Mistral Large, Mistral Small, Ministral, and all future releases.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - Mistral Large
|
||||
- ``mistral/mistral-large-latest``
|
||||
- Most capable model
|
||||
* - Mistral Medium
|
||||
- ``mistral/mistral-medium-latest``
|
||||
- Balanced performance
|
||||
* - Mistral Small
|
||||
- ``mistral/mistral-small-latest``
|
||||
- Fast and efficient
|
||||
* - Ministral 3B
|
||||
- ``mistral/ministral-3b-latest``
|
||||
- Compact model
|
||||
|
||||
**Configuration Examples:**
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: mistral/mistral-large-latest
|
||||
access_key: $MISTRAL_API_KEY
|
||||
|
||||
- model: mistral/mistral-small-latest
|
||||
access_key: $MISTRAL_API_KEY
|
||||
|
||||
Groq
|
||||
~~~~
|
||||
|
||||
**Provider Prefix:** ``groq/``
|
||||
|
||||
**API Endpoint:** ``/openai/v1/chat/completions`` (transformed internally)
|
||||
|
||||
**Authentication:** API Key - Get your Groq API key from `Groq Console <https://console.groq.com/keys>`_.
|
||||
|
||||
**Supported Chat Models:** All Groq chat models including Llama 3, Mixtral, Gemma, and all future releases.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - Llama 3.1 8B
|
||||
- ``groq/llama3-8b-8192``
|
||||
- Fast inference Llama model
|
||||
* - Llama 3.1 70B
|
||||
- ``groq/llama3-70b-8192``
|
||||
- Larger Llama model
|
||||
* - Mixtral 8x7B
|
||||
- ``groq/mixtral-8x7b-32768``
|
||||
- Mixture of experts model
|
||||
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: groq/llama3-8b-8192
|
||||
access_key: $GROQ_API_KEY
|
||||
|
||||
- model: groq/mixtral-8x7b-32768
|
||||
access_key: $GROQ_API_KEY
|
||||
|
||||
Google Gemini
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
**Provider Prefix:** ``gemini/``
|
||||
|
||||
**API Endpoint:** ``/v1beta/openai/chat/completions`` (transformed internally)
|
||||
|
||||
**Authentication:** API Key - Get your Google AI API key from `Google AI Studio <https://aistudio.google.com/app/apikey>`_.
|
||||
|
||||
**Supported Chat Models:** All Google Gemini chat models including Gemini 1.5 Pro, Gemini 1.5 Flash, and all future releases.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - Gemini 1.5 Pro
|
||||
- ``gemini/gemini-1.5-pro``
|
||||
- Advanced reasoning and creativity
|
||||
* - Gemini 1.5 Flash
|
||||
- ``gemini/gemini-1.5-flash``
|
||||
- Fast and efficient model
|
||||
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: gemini/gemini-1.5-pro
|
||||
access_key: $GOOGLE_API_KEY
|
||||
|
||||
- model: gemini/gemini-1.5-flash
|
||||
access_key: $GOOGLE_API_KEY
|
||||
|
||||
Together AI
|
||||
~~~~~~~~~~~
|
||||
|
||||
**Provider Prefix:** ``together_ai/``
|
||||
|
||||
**API Endpoint:** ``/v1/chat/completions``
|
||||
|
||||
**Authentication:** API Key - Get your Together AI API key from `Together AI Settings <https://api.together.xyz/settings/api-keys>`_.
|
||||
|
||||
**Supported Chat Models:** All Together AI chat models including Llama, CodeLlama, Mixtral, Qwen, and hundreds of other open-source models.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - Meta Llama 2 7B
|
||||
- ``together_ai/meta-llama/Llama-2-7b-chat-hf``
|
||||
- Open source chat model
|
||||
* - Meta Llama 2 13B
|
||||
- ``together_ai/meta-llama/Llama-2-13b-chat-hf``
|
||||
- Larger open source model
|
||||
* - Code Llama 34B
|
||||
- ``together_ai/codellama/CodeLlama-34b-Instruct-hf``
|
||||
- Code-specialized model
|
||||
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: together_ai/meta-llama/Llama-2-7b-chat-hf
|
||||
access_key: $TOGETHER_API_KEY
|
||||
|
||||
- model: together_ai/codellama/CodeLlama-34b-Instruct-hf
|
||||
access_key: $TOGETHER_API_KEY
|
||||
|
||||
xAI
|
||||
~~~
|
||||
|
||||
**Provider Prefix:** ``xai/``
|
||||
|
||||
**API Endpoint:** ``/v1/chat/completions``
|
||||
|
||||
**Authentication:** API Key - Get your xAI API key from `xAI Console <https://console.x.ai/>`_.
|
||||
|
||||
**Supported Chat Models:** All xAI chat models including Grok Beta and all future releases.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 30 20 50
|
||||
|
||||
* - Model Name
|
||||
- Model ID for Config
|
||||
- Description
|
||||
* - Grok Beta
|
||||
- ``xai/grok-beta``
|
||||
- Conversational AI model
|
||||
|
||||
**Configuration Examples:**
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: xai/grok-beta
|
||||
access_key: $XAI_API_KEY
|
||||
|
||||
Providers Requiring Base URL
|
||||
----------------------------
|
||||
|
||||
Azure OpenAI
|
||||
~~~~~~~~~~~~
|
||||
|
||||
**Provider Prefix:** ``azure_openai/``
|
||||
|
||||
**API Endpoint:** ``/openai/deployments/{deployment-name}/chat/completions`` (constructed automatically)
|
||||
|
||||
**Authentication:** API Key + Base URL - Get your Azure OpenAI API key from `Azure Portal <https://portal.azure.com/>`_ → Your OpenAI Resource → Keys and Endpoint.
|
||||
|
||||
**Supported Chat Models:** All Azure OpenAI chat models including GPT-4o, GPT-4, GPT-3.5-turbo deployed in your Azure subscription.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
# Single deployment
|
||||
- model: azure_openai/gpt-4o
|
||||
access_key: $AZURE_OPENAI_API_KEY
|
||||
base_url: https://your-resource.openai.azure.com
|
||||
|
||||
# Multiple deployments
|
||||
- model: azure_openai/gpt-4o-mini
|
||||
access_key: $AZURE_OPENAI_API_KEY
|
||||
base_url: https://your-resource.openai.azure.com
|
||||
|
||||
Ollama
|
||||
~~~~~~
|
||||
|
||||
**Provider Prefix:** ``ollama/``
|
||||
|
||||
**API Endpoint:** ``/v1/chat/completions`` (Ollama's OpenAI-compatible endpoint)
|
||||
|
||||
**Authentication:** None (Base URL only) - Install Ollama from `Ollama.com <https://ollama.com/>`_ and pull your desired models.
|
||||
|
||||
**Supported Chat Models:** All chat models available in your local Ollama installation. Use ``ollama list`` to see installed models.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
# Local Ollama installation
|
||||
- model: ollama/llama3.1
|
||||
base_url: http://localhost:11434
|
||||
|
||||
# Ollama in Docker (from host)
|
||||
- model: ollama/codellama
|
||||
base_url: http://host.docker.internal:11434
|
||||
|
||||
OpenAI-Compatible Providers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Supported Models:** Any chat models from providers that implement the OpenAI Chat Completions API standard.
|
||||
|
||||
For providers that implement the OpenAI API but aren't natively supported:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
# Generic OpenAI-compatible provider
|
||||
- model: custom-provider/custom-model
|
||||
base_url: https://api.customprovider.com
|
||||
provider_interface: openai
|
||||
access_key: $CUSTOM_API_KEY
|
||||
|
||||
# Local deployment
|
||||
- model: local/llama2-7b
|
||||
base_url: http://localhost:8000
|
||||
provider_interface: openai
|
||||
|
||||
Advanced Configuration
|
||||
----------------------
|
||||
|
||||
Multiple Provider Instances
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Configure multiple instances of the same provider:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
# Production OpenAI
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_PROD_KEY
|
||||
name: openai-prod
|
||||
|
||||
# Development OpenAI (different key/quota)
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_DEV_KEY
|
||||
name: openai-dev
|
||||
|
||||
Default Model Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Mark one model as the default for fallback scenarios:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true # Used when no specific model is requested
|
||||
|
||||
Routing Preferences
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Configure routing preferences for dynamic model selection:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
llm_providers:
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
routing_preferences:
|
||||
- name: complex_reasoning
|
||||
description: deep analysis, mathematical problem solving, and logical reasoning
|
||||
- name: code_review
|
||||
description: reviewing and analyzing existing code for bugs and improvements
|
||||
|
||||
- model: anthropic/claude-3-5-sonnet-20241022
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
routing_preferences:
|
||||
- name: creative_writing
|
||||
description: creative content generation, storytelling, and writing assistance
|
||||
|
||||
Model Selection Guidelines
|
||||
--------------------------
|
||||
|
||||
**For Production Applications:**
|
||||
- **High Performance**: OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet
|
||||
- **Cost-Effective**: OpenAI GPT-4o mini, Anthropic Claude 3.5 Haiku
|
||||
- **Code Tasks**: DeepSeek Coder, Together AI Code Llama
|
||||
- **Local Deployment**: Ollama with Llama 3.1 or Code Llama
|
||||
|
||||
**For Development/Testing:**
|
||||
- **Fast Iteration**: Groq models (optimized inference)
|
||||
- **Local Testing**: Ollama models
|
||||
- **Cost Control**: Smaller models like GPT-4o mini or Mistral Small
|
||||
|
||||
See Also
|
||||
--------
|
||||
|
||||
- :ref:`client_libraries` - Using different client libraries with providers
|
||||
- :ref:`model_aliases` - Creating semantic model names
|
||||
- :ref:`llm_router` - Setting up intelligent routing
|
||||
- :ref:`client_libraries` - Using different client libraries
|
||||
- :ref:`model_aliases` - Creating semantic model names
|
||||
|
|
@ -22,7 +22,7 @@ Upstream (Egress)
|
|||
Arch automatically configures a listener to route requests from your application to upstream LLM API providers (or hosts).
|
||||
When you start Arch, it creates a listener for egress traffic based on the presence of the ``listener`` configuration
|
||||
section in the configuration file. Arch binds itself to a local address such as ``127.0.0.1:12000/v1`` or a DNS-based
|
||||
address like ``arch.local:12000/v1`` for outgoing traffic. For more details on LLM providers, read :ref:`here <llm_provider>`.
|
||||
address like ``arch.local:12000/v1`` for outgoing traffic. For more details on LLM providers, read :ref:`here <llm_providers>`.
|
||||
|
||||
Configure Listener
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
|
|
|||
|
|
@ -31,7 +31,7 @@ code to LLMs.
|
|||
|
||||
When you start Arch, you specify a listener address/port that you want to bind downstream. But, Arch uses are predefined port
|
||||
that you can use (``127.0.0.1:12000``) to proxy egress calls originating from your application to LLMs (API-based or hosted).
|
||||
For more details, check out :ref:`LLM provider <llm_provider>`.
|
||||
For more details, check out :ref:`LLM providers <llm_providers>`.
|
||||
|
||||
**Prompt Target**: Arch offers a primitive called :ref:`prompt target <prompt_target>` to help separate business logic from
|
||||
undifferentiated work in building generative AI apps. Prompt targets are endpoints that receive prompts that are processed by Arch.
|
||||
|
|
|
|||
|
|
@ -3,10 +3,9 @@
|
|||
|
||||
Overview
|
||||
============
|
||||
`Arch <https://github.com/katanemo/arch>`_ is a smart edge and AI gateway for AI-native apps - one that is natively designed to handle and process prompts, not just network traffic.
|
||||
|
||||
Built by contributors to the widely adopted `Envoy Proxy <https://www.envoyproxy.io/>`_, Arch handles the *pesky low-level work* in building agentic apps — like applying guardrails, clarifying vague user input, routing prompts to the right agent, and unifying access to any LLM. It’s a language and framework friendly infrastructure layer designed to help you build and ship agentic apps faster.
|
||||
`Arch <https://github.com/katanemo/arch>`_ is a smart edge and AI gateway for AI agents - one that is natively designed to handle and process prompts, not just network traffic.
|
||||
|
||||
Built by contributors to the widely adopted `Envoy Proxy <https://www.envoyproxy.io/>`_, Arch handles the *pesky low-level work* in building agentic apps — like applying guardrails, clarifying vague user input, routing prompts to the right agent, and unifying access to any LLM. It’s a protocol-friendly and framework-agnostic infrastructure layer designed to help you build and ship agentic apps faster.
|
||||
|
||||
In this documentation, you will learn how to quickly set up Arch to trigger API calls via prompts, apply prompt guardrails without writing any application-level logic,
|
||||
simplify the interaction with upstream LLMs, and improve observability all while simplifying your application development process.
|
||||
|
|
@ -53,8 +52,8 @@ Deep dive into essential ideas and mechanisms behind Arch:
|
|||
|
||||
Learn about the technology stack
|
||||
|
||||
.. grid-item-card:: :octicon:`webhook` LLM Provider
|
||||
:link: ../concepts/llm_provider.html
|
||||
.. grid-item-card:: :octicon:`webhook` LLM Providers
|
||||
:link: ../concepts/llm_providers/llm_providers.html
|
||||
|
||||
Explore Arch’s LLM integration options
|
||||
|
||||
|
|
|
|||
|
|
@ -134,7 +134,7 @@ It will automatically validate parameters, and ensure that the required paramete
|
|||
|
||||
|
||||
Once a downstream function (API) is called, Arch Gateway takes the response and sends it an upstream LLM to complete the request (for summarization, Q/A, text generation tasks).
|
||||
For more details on how Arch Gateway enables you to centralize usage of LLMs, please read :ref:`LLM providers <llm_provider>`.
|
||||
For more details on how Arch Gateway enables you to centralize usage of LLMs, please read :ref:`LLM providers <llm_providers>`.
|
||||
|
||||
By completing these steps, you enable Arch to manage the process from validation to response, ensuring users receive consistent, reliable results - and that you are focused
|
||||
on the stuff that matters most.
|
||||
|
|
|
|||
|
|
@ -5,18 +5,67 @@ LLM Routing
|
|||
|
||||
With the rapid proliferation of large language models (LLM) — each optimized for different strengths, style, or latency/cost profile — routing has become an essential technique to operationalize the use of different models.
|
||||
|
||||
Arch Router is an intelligent routing system that automatically selects the most appropriate LLM for each user request based on user-defined usage preferences. Specifically Arch-Router guides model selection by matching queries to user-defined domains (e.g., finance and healthcare) and action types (e.g., code generation, image editing, etc.).
|
||||
Our preference-aligned approach matches practical definitions of performance in the real world and makes routing decisions more transparent and adaptable.
|
||||
Arch provides two distinct routing approaches to meet different use cases:
|
||||
|
||||
1. **Static Model Selection**: Direct routing to specific models based on provider configuration and model aliases
|
||||
2. **Preference-Aligned Dynamic Routing**: Intelligent routing using the Arch-Router model based on context and user-defined preferences
|
||||
|
||||
This enables optimal performance, cost efficiency, and response quality by matching requests with the most suitable model from your available LLM fleet.
|
||||
|
||||
|
||||
Routing Workflow
|
||||
-------------------------
|
||||
Routing Methods
|
||||
---------------
|
||||
|
||||
**Static Model Selection**
|
||||
|
||||
Static routing allows you to directly specify which model to use, either through:
|
||||
|
||||
- **Direct Model Names**: Use provider-specific names like ``openai/gpt-4o-mini``
|
||||
- **Model Aliases**: Use semantic names like ``fast-model`` or ``arch.summarize.v1`` (see :ref:`model_aliases`)
|
||||
|
||||
This approach is ideal when you know exactly which model you want to use for specific tasks or when implementing your own routing logic at the application level.
|
||||
|
||||
**Preference-Aligned Dynamic Routing (Arch-Router)**
|
||||
|
||||
Dynamic routing uses the Arch-Router model to automatically select the most appropriate LLM for each request based on:
|
||||
|
||||
- **Domain Analysis**: Identifies the subject matter (e.g., legal, healthcare, programming)
|
||||
- **Action Classification**: Determines the type of operation (e.g., summarization, code generation, translation)
|
||||
- **User-Defined Preferences**: Maps domains and actions to preferred models
|
||||
|
||||
This approach is ideal when you want intelligent, context-aware routing that adapts to the content and intent of each request.
|
||||
|
||||
|
||||
Static Model Selection Workflow
|
||||
--------------------------------
|
||||
|
||||
For static routing, the process is straightforward:
|
||||
|
||||
#. **Client Request**
|
||||
|
||||
The client specifies the exact model to use, either by provider name (``openai/gpt-4o``) or alias (``fast-model``).
|
||||
|
||||
#. **Model Resolution**
|
||||
|
||||
If using an alias, Arch resolves it to the actual provider model name.
|
||||
|
||||
#. **Direct Routing**
|
||||
|
||||
The request is sent directly to the specified model without analysis or decision-making.
|
||||
|
||||
#. **Response Handling**
|
||||
|
||||
The response is returned to the client with optional metadata about the routing decision.
|
||||
|
||||
|
||||
Preference-Aligned Dynamic Routing Workflow (Arch-Router)
|
||||
---------------------------------------
|
||||
|
||||
For preference-aligned dynamic routing, the process involves intelligent analysis:
|
||||
|
||||
#. **Prompt Analysis**
|
||||
|
||||
When a user submits a prompt, the Router analyzes it to determine the domain (subject matter) or action (type of operation requested).
|
||||
When a user submits a prompt without specifying a model, the Arch-Router analyzes it to determine the domain (subject matter) and action (type of operation requested).
|
||||
|
||||
#. **Model Selection**
|
||||
|
||||
|
|
@ -53,51 +102,146 @@ In summary, Arch-Router demonstrates:
|
|||
- **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments.
|
||||
|
||||
|
||||
Implementing LLM Routing
|
||||
-----------------------------
|
||||
Implementing Routing
|
||||
--------------------
|
||||
|
||||
To configure LLM routing in our gateway, you need to define a prompt target configuration that specifies the routing model and the LLM providers. This configuration will allow Arch Gateway to route incoming prompts to the appropriate model based on the defined routes.
|
||||
|
||||
Below is an example to show how to set up a prompt target for the Arch Router:
|
||||
|
||||
- **Step 1: Define the routing model in the `routing` section**. You can use the `archgw-v1-router-model` as the katanemo routing model or any other routing model you prefer.
|
||||
|
||||
- **Step 2: Define the listeners in the `listeners` section**. This is where you specify the address and port for incoming traffic, as well as the message format (e.g., OpenAI).
|
||||
|
||||
- **Step 3: Define the LLM providers in the `llm_providers` section**. This is where you specify the routing model, and any other models you want to use for specific tasks and their route usage descriptions (e.g., code generation, code understanding).
|
||||
|
||||
.. Note::
|
||||
Make sure you define a model for default usage, such as `gpt-4o`, which will be used when no specific route is matched for an user prompt.
|
||||
**Static Model Selection**
|
||||
|
||||
For static routing, simply configure your LLM providers and optionally define model aliases:
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Route Config Example
|
||||
|
||||
:caption: Static Routing Configuration
|
||||
|
||||
listeners:
|
||||
egress_traffic:
|
||||
egress_traffic:
|
||||
address: 0.0.0.0
|
||||
port: 12000
|
||||
message_format: openai
|
||||
timeout: 30s
|
||||
|
||||
llm_providers:
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
routing_preferences:
|
||||
- name: code understanding
|
||||
description: understand and explain existing code snippets, functions, or libraries
|
||||
- model: anthropic/claude-3-5-sonnet-20241022
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
|
||||
- model: openai/gpt-4.1
|
||||
access_key: $OPENAI_API_KEY
|
||||
routing_preferences:
|
||||
- name: code generation
|
||||
description: generating new code snippets, functions, or boilerplate based on user prompts or requirements
|
||||
# Optional: Define aliases for easier client usage
|
||||
model_aliases:
|
||||
fast-model:
|
||||
target: gpt-4o-mini
|
||||
smart-model:
|
||||
target: gpt-4o
|
||||
creative-model:
|
||||
target: claude-3-5-sonnet-20241022
|
||||
|
||||
Clients can then specify models directly:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Using provider model names
|
||||
response = client.chat.completions.create(
|
||||
model="openai/gpt-4o-mini",
|
||||
messages=[{"role": "user", "content": "Hello!"}]
|
||||
)
|
||||
|
||||
# Using aliases
|
||||
response = client.chat.completions.create(
|
||||
model="fast-model",
|
||||
messages=[{"role": "user", "content": "Hello!"}]
|
||||
)
|
||||
|
||||
**Preference-Aligned Dynamic Routing (Arch-Router)**
|
||||
|
||||
To configure preference-aligned dynamic routing, you need to define routing preferences that map domains and actions to specific models:
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Preference-Aligned Dynamic Routing Configuration
|
||||
|
||||
listeners:
|
||||
egress_traffic:
|
||||
address: 0.0.0.0
|
||||
port: 12000
|
||||
message_format: openai
|
||||
timeout: 30s
|
||||
|
||||
llm_providers:
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
routing_preferences:
|
||||
- name: code understanding
|
||||
description: understand and explain existing code snippets, functions, or libraries
|
||||
- name: complex reasoning
|
||||
description: deep analysis, mathematical problem solving, and logical reasoning
|
||||
|
||||
- model: anthropic/claude-3-5-sonnet-20241022
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
routing_preferences:
|
||||
- name: creative writing
|
||||
description: creative content generation, storytelling, and writing assistance
|
||||
- name: code generation
|
||||
description: generating new code snippets, functions, or boilerplate based on user prompts
|
||||
|
||||
Clients can let the router decide or use aliases:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Let Arch-Router choose based on content
|
||||
response = client.chat.completions.create(
|
||||
messages=[{"role": "user", "content": "Write a creative story about space exploration"}]
|
||||
# No model specified - router will analyze and choose claude-3-5-sonnet-20241022
|
||||
)
|
||||
|
||||
|
||||
Combining Routing Methods
|
||||
-------------------------
|
||||
|
||||
You can combine static model selection with dynamic routing preferences for maximum flexibility:
|
||||
|
||||
.. code-block:: yaml
|
||||
:caption: Hybrid Routing Configuration
|
||||
|
||||
llm_providers:
|
||||
- model: openai/gpt-4o-mini
|
||||
access_key: $OPENAI_API_KEY
|
||||
default: true
|
||||
|
||||
- model: openai/gpt-4o
|
||||
access_key: $OPENAI_API_KEY
|
||||
routing_preferences:
|
||||
- name: complex_reasoning
|
||||
description: deep analysis and complex problem solving
|
||||
|
||||
- model: anthropic/claude-3-5-sonnet-20241022
|
||||
access_key: $ANTHROPIC_API_KEY
|
||||
routing_preferences:
|
||||
- name: creative_tasks
|
||||
description: creative writing and content generation
|
||||
|
||||
model_aliases:
|
||||
# Static aliases for direct routing
|
||||
fast-model:
|
||||
target: gpt-4o-mini
|
||||
|
||||
reasoning-model:
|
||||
target: gpt-4o
|
||||
|
||||
# Aliases that can also participate in dynamic routing
|
||||
creative-model:
|
||||
target: claude-3-5-sonnet-20241022
|
||||
|
||||
This configuration allows clients to:
|
||||
|
||||
1. **Use direct model selection**: ``model="fast-model"``
|
||||
2. **Let the router decide**: No model specified, router analyzes content
|
||||
|
||||
Example Use Cases
|
||||
-------------------------
|
||||
|
|
@ -112,7 +256,7 @@ Here are common scenarios where Arch-Router excels:
|
|||
- **Conversational Routing**: Track conversation context to identify when topics shift between domains or when the type of assistance needed changes mid-conversation.
|
||||
|
||||
|
||||
Best practice
|
||||
Best practicesm
|
||||
-------------------------
|
||||
- **💡Consistent Naming:** Route names should align with their descriptions.
|
||||
|
||||
|
|
|
|||
|
|
@ -39,7 +39,7 @@ Built by contributors to the widely adopted `Envoy Proxy <https://www.envoyproxy
|
|||
:maxdepth: 2
|
||||
|
||||
concepts/tech_overview/tech_overview
|
||||
concepts/llm_provider
|
||||
concepts/llm_providers/llm_providers
|
||||
concepts/prompt_target
|
||||
|
||||
.. tab-item:: Guides
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue