nomyo-router/doc/semantic-cache.md
alpha-nerd-nomyo fbdc73eebb fix: improvements, fixes and opt-in cache
doc: semantic-cache.md added with detailed write-up
2026-03-10 15:19:37 +01:00

12 KiB
Raw Blame History

Semantic Cache

NOMYO Router includes a built-in LLM semantic cache that can dramatically reduce inference costs and latency by reusing previous responses — either via exact-match or by finding semantically equivalent questions.

How It Works

Every incoming request is checked against the cache before being forwarded to an LLM endpoint. On a cache hit the stored response is returned immediately, bypassing inference entirely.

Cache isolation is strict: each combination of route + model + system prompt gets its own namespace (a 16-character SHA-256 prefix). A question asked under one system prompt can never leak into a different user's or route's namespace.

Two lookup strategies are available:

Mode How it matches When to use
Exact match Normalized text hash When you need 100% identical answers, zero false positives
Semantic match Cosine similarity of sentence embeddings When questions are paraphrased or phrased differently

In semantic mode the embedding is a weighted average of:

  • The last user message (70% by default) — captures what is being asked
  • A BM25-weighted summary of the chat history (30% by default) — provides conversation context

This means "What is the capital of France?" and "Can you tell me the capital city of France?" will hit the same cache entry, but a question in a different conversation context will not.

MoE Model Bypass

Models with names starting with moe- always bypass the cache entirely. Mixture-of-Experts models produce non-deterministic outputs by design and are not suitable for response caching.

Privacy Protection

Responses that contain personally identifiable information (emails, UUIDs, numeric IDs ≥ 8 digits, or tokens extracted from [Tags: identity] lines in the system prompt) are stored without a semantic embedding. They remain reachable via exact-match within the same user-specific namespace but are invisible to cross-user semantic search.

If a personalized response is generated with a generic (shared) system prompt, it is skipped entirely — never stored.


Docker Image Variants

Tag Semantic cache Approx. image size Includes
latest No (exact match only) ~300 MB Base router only
latest-semantic Yes ~800 MB sentence-transformers, torch, all-MiniLM-L6-v2 model baked in

Use the latest image when you only need exact-match deduplication. It has no heavy ML dependencies.

Use latest-semantic when you want to catch paraphrased or semantically equivalent questions. The all-MiniLM-L6-v2 model is baked into the image at build time so no internet access is needed at runtime.

Note: If you set cache_similarity < 1.0 but use the latest image, the router falls back to exact-match caching and logs a warning. It will not error.

Build locally:

# Lean image (exact match only)
docker build -t nomyo-router .

# Semantic image (~500 MB larger)
docker build --build-arg SEMANTIC_CACHE=true -t nomyo-router:semantic .

Dependencies

Lean image / bare-metal (exact match)

No extra dependencies beyond the base requirements.txt. The semantic-llm-cache library provides exact-match storage with a no-op embedding provider.

Semantic image / bare-metal (semantic match)

Additional heavy packages are required:

Package Purpose Approx. size
sentence-transformers Sentence embedding model wrapper ~100 MB
torch (CPU) PyTorch inference backend ~700 MB
numpy Vector math for weighted mean embeddings ~20 MB

Install for bare-metal semantic mode:

pip install sentence-transformers torch --index-url https://download.pytorch.org/whl/cpu

The all-MiniLM-L6-v2 model (~90 MB) is downloaded on first use (or baked in when using the :semantic Docker image).


Configuration

All cache settings live in config.yaml and can be overridden with environment variables prefixed NOMYO_ROUTER_.

Minimal — enable exact-match with in-memory backend

# config.yaml
cache_enabled: true

Full reference

# config.yaml

# Master switch — set to true to enable the cache
cache_enabled: false

# Storage backend: "memory" | "sqlite" | "redis"
# memory  — fast, in-process, lost on restart
# sqlite  — persistent single-file, good for single-instance deployments
# redis   — persistent, shared across multiple router instances
cache_backend: memory

# Cosine similarity threshold for semantic matching
# 1.0  = exact match only (no sentence-transformers required)
# 0.95 = very close paraphrases only (recommended starting point)
# 0.85 = broader matching, higher false-positive risk
# Requires :semantic Docker image (or sentence-transformers installed)
cache_similarity: 1.0

# Time-to-live in seconds; null or omit to cache forever
cache_ttl: 3600

# SQLite backend: path to cache database file
cache_db_path: llm_cache.db

# Redis backend: connection URL
cache_redis_url: redis://localhost:6379/0

# Weight of chat-history embedding in the combined query vector (0.01.0)
# 0.3 = 30% history context, 70% last user message (default)
# Increase if your use case is highly conversation-dependent
cache_history_weight: 0.3

Environment variable overrides

NOMYO_ROUTER_CACHE_ENABLED=true
NOMYO_ROUTER_CACHE_BACKEND=sqlite
NOMYO_ROUTER_CACHE_SIMILARITY=0.95
NOMYO_ROUTER_CACHE_TTL=7200
NOMYO_ROUTER_CACHE_DB_PATH=/data/llm_cache.db
NOMYO_ROUTER_CACHE_REDIS_URL=redis://redis:6379/0
NOMYO_ROUTER_CACHE_HISTORY_WEIGHT=0.3

Per-request opt-in

The cache is opted in per-request via the nomyo.cache field in the request body. The global cache_enabled setting must also be true.

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What is the capital of France?"}],
  "nomyo": {
    "cache": true
  }
}

Backend Comparison

memory sqlite redis
Persistence No (lost on restart) Yes Yes
Multi-instance No No Yes
Setup required None None Redis server
Recommended for Development, testing Single-instance production Multi-instance / HA production

SQLite — persistent single-instance

cache_enabled: true
cache_backend: sqlite
cache_db_path: /data/llm_cache.db
cache_ttl: 86400   # 24 hours

Mount the path in Docker:

docker run -d \
  -v /path/to/data:/data \
  -e NOMYO_ROUTER_CACHE_ENABLED=true \
  -e NOMYO_ROUTER_CACHE_BACKEND=sqlite \
  -e NOMYO_ROUTER_CACHE_DB_PATH=/data/llm_cache.db \
  ghcr.io/nomyo-ai/nomyo-router:latest

Redis — distributed / multi-instance

cache_enabled: true
cache_backend: redis
cache_redis_url: redis://redis:6379/0
cache_similarity: 0.95
cache_ttl: 3600

Docker Compose example:

services:
  nomyo-router:
    image: ghcr.io/nomyo-ai/nomyo-router:latest-semantic
    ports:
      - "12434:12434"
    environment:
      NOMYO_ROUTER_CACHE_ENABLED: "true"
      NOMYO_ROUTER_CACHE_BACKEND: redis
      NOMYO_ROUTER_CACHE_REDIS_URL: redis://redis:6379/0
      NOMYO_ROUTER_CACHE_SIMILARITY: "0.95"
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

volumes:
  redis_data:

Code Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:12434/v1",
    api_key="unused",
)

# First call — cache miss, hits the LLM
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_body={"nomyo": {"cache": True}},
)
print(response.choices[0].message.content)

# Second call (exact same question) — cache hit, returned instantly
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_body={"nomyo": {"cache": True}},
)
print(response.choices[0].message.content)

# With semantic mode enabled, this also hits the cache:
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Tell me the capital city of France"}],
    extra_body={"nomyo": {"cache": True}},
)
print(response.choices[0].message.content)

Python (raw HTTP / httpx)

import httpx

payload = {
    "model": "llama3.2",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum entanglement in one sentence."},
    ],
    "nomyo": {"cache": True},
}

resp = httpx.post("http://localhost:12434/v1/chat/completions", json=payload)
print(resp.json()["choices"][0]["message"]["content"])

curl

curl -s http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "nomyo": {"cache": true}
  }' | jq .choices[0].message.content

Check cache statistics

curl -s http://localhost:12434/api/cache/stats | jq .
{
  "hits": 42,
  "misses": 18,
  "hit_rate": 0.7,
  "semantic": true,
  "backend": "sqlite",
  "similarity_threshold": 0.95,
  "history_weight": 0.3
}

Clear the cache

curl -X POST http://localhost:12434/api/cache/clear

Use Cases Where Semantic Caching Helps Most

FAQ and support bots

Users ask the same questions in slightly different ways. "How do I reset my password?", "I forgot my password", "password reset instructions" — all semantically equivalent. Semantic caching collapses these into a single LLM call.

Recommended settings: cache_similarity: 0.900.95, cache_backend: sqlite or redis, long TTL.

Document Q&A / RAG pipelines

When the same document set is queried repeatedly by many users, common factual questions ("What is the contract start date?", "When does this contract begin?") hit the cache across users — as long as the system prompt is not user-personalized.

Recommended settings: cache_similarity: 0.92, cache_backend: redis for multi-user deployments.

Code generation with repeated patterns

Development tools that generate boilerplate, explain errors, or convert code often receive nearly-identical prompts. Caching prevents re-running expensive large-model inference for the same error message phrased differently.

Recommended settings: cache_similarity: 0.93, cache_backend: sqlite.

High-traffic chatbots with shared context

Public-facing bots where many users share the same system prompt (same product, same persona). Generic factual answers can be safely reused across users. The privacy guard ensures personalized responses are never shared.

Recommended settings: cache_similarity: 0.900.95, cache_backend: redis.

Batch processing / data pipelines

When running LLM inference over large datasets with many near-duplicate or repeated inputs, caching can reduce inference calls dramatically and make pipelines idempotent.

Recommended settings: cache_similarity: 0.95, cache_backend: sqlite, cache_ttl: null (cache forever for deterministic batch runs).


Tuning the Similarity Threshold

cache_similarity Behaviour Risk
1.0 Exact match only No false positives
0.97 Catches minor typos and punctuation differences Very low
0.95 Catches clear paraphrases (recommended default) Low
0.90 Catches broader rewordings Moderate — may match different intents
< 0.85 Very aggressive matching High false-positive rate, not recommended

Start at 0.95 and lower gradually while monitoring cache hit rate via /api/cache/stats.


Limitations

  • Streaming responses cached from a non-streaming request are served as a single chunk (not token-by-token). This is indistinguishable to most clients.
  • MoE models (moe-* prefix) are never cached.
  • Token counts are not recorded for cache hits (the LLM was not called).
  • The memory backend does not survive a router restart.
  • Semantic search requires the :semantic image (or local sentence-transformers install); without it, cache_similarity < 1.0 silently falls back to exact match.