# semantic-llm-cache **Async semantic caching for LLM API calls — reduce costs with one decorator.** [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Python](https://img.shields.io/pypi/pyversions/semantic-llm-cache)](https://pypi.org/project/semantic-llm-cache/) > **Fork of [karthyick/prompt-cache](https://github.com/karthyick/prompt-cache)** — fully converted to async for use with async frameworks (FastAPI, aiohttp, Starlette, etc.). ## Overview LLM API calls are expensive and slow. In production applications, **20-40% of prompts are semantically identical** but get charged as separate API calls. `semantic-llm-cache` solves this with a simple decorator that: - ✅ **Caches semantically similar prompts** (not just exact matches) - ✅ **Reduces API costs by 20-40%** - ✅ **Returns cached responses in <10ms** - ✅ **Works with any LLM provider** (OpenAI, Anthropic, Ollama, local models) - ✅ **Fully async** — native `async/await` throughout, no event loop blocking - ✅ **Auto-detects** sync vs async decorated functions — one decorator for both ## What changed from the original | Area | Original | This fork | | ---------------------- | --------------------------- | --------------------------------------------------------------------- | | Backends | sync (`sqlite3`, `redis`) | async (`aiosqlite`, `redis.asyncio`) | | `@cache` decorator | sync only | auto-detects async/sync | | `EmbeddingCache` | sync`encode()` | adds`async aencode()` via `asyncio.to_thread` | | `CacheContext` | sync only | supports both`with` and `async with` | | `CachedLLM` | `chat()` | adds`achat()` | | Utility functions | sync | `clear_cache`, `invalidate`, `warm_cache`, `export_cache` all async | | `StorageBackend` ABC | sync abstract methods | all abstract methods are`async def` | | Min Python | 3.9 | 3.10 (uses`X | Y` union syntax) | ## Installation Not yet published to PyPI. Install directly from the repository: ```bash # Clone git clone https://github.com/YOUR_ORG/prompt-cache.git cd prompt-cache # Core (exact match only, SQLite backend) pip install . # With semantic similarity (sentence-transformers) pip install ".[semantic]" # With Redis backend pip install ".[redis]" # Everything pip install ".[all]" ``` Or install directly via pip from git: ```bash pip install "git+https://github.com/nomyo-ai/.git" pip install "git+https://github.com/nomyo-ai/async-semantic-llm-cache.git[semantic]" ``` ## Quick Start ### Async function (FastAPI, aiohttp, etc.) ```python from semantic_llm_cache import cache @cache(similarity=0.95, ttl=3600) async def ask_llm(prompt: str) -> str: return await call_ollama(prompt) # First call — LLM hit await ask_llm("What is Python?") # Second call — cache hit (<10ms, free) await ask_llm("What's Python?") # 95% similar → cache hit ``` ### Sync function (backwards compatible) ```python from semantic_llm_cache import cache @cache() def ask_llm_sync(prompt: str) -> str: return call_openai(prompt) # works, but don't use inside a running event loop ``` ### Semantic Matching ```python from semantic_llm_cache import cache @cache(similarity=0.90) async def ask_llm(prompt: str) -> str: return await call_ollama(prompt) await ask_llm("What is Python?") # LLM call await ask_llm("What's Python?") # cache hit (95% similar) await ask_llm("Explain Python") # cache hit (91% similar) await ask_llm("What is Rust?") # LLM call (different topic) ``` ### SQLite backend (default, persistent) ```python from semantic_llm_cache import cache from semantic_llm_cache.backends import SQLiteBackend backend = SQLiteBackend(db_path="my_cache.db") @cache(backend=backend, similarity=0.95) async def ask_llm(prompt: str) -> str: return await call_ollama(prompt) ``` ### Redis backend (distributed) ```python from semantic_llm_cache import cache from semantic_llm_cache.backends import RedisBackend backend = RedisBackend(url="redis://localhost:6379/0") await backend.ping() # verify connection (replaces __init__ connection test) @cache(backend=backend, similarity=0.95) async def ask_llm(prompt: str) -> str: return await call_ollama(prompt) ``` ### Cache Statistics ```python from semantic_llm_cache import get_stats stats = get_stats() # { # "hits": 1547, # "misses": 892, # "hit_rate": 0.634, # "estimated_savings_usd": 3.09, # "total_saved_ms": 773500 # } ``` ### Cache Management ```python from semantic_llm_cache.stats import clear_cache, invalidate # Clear all cached entries await clear_cache() # Invalidate entries matching a pattern await invalidate(pattern="Python") ``` ### Async context manager ```python from semantic_llm_cache import CacheContext async with CacheContext(similarity=0.9) as ctx: result1 = await any_cached_llm_call("prompt 1") result2 = await any_cached_llm_call("prompt 2") print(ctx.stats) # {"hits": 1, "misses": 1} ``` ### CachedLLM wrapper ```python from semantic_llm_cache import CachedLLM llm = CachedLLM(similarity=0.9, ttl=3600) response = await llm.achat("What is Python?", llm_func=my_async_llm) ``` ## API Reference ### `@cache()` Decorator ```python @cache( similarity: float = 1.0, # 1.0 = exact match, 0.9 = semantic ttl: int = 3600, # seconds, None = forever backend: Backend = None, # None = in-memory namespace: str = "default", # isolate different use cases enabled: bool = True, # toggle for debugging key_func: Callable = None, # custom cache key ) async def my_llm_function(prompt: str) -> str: ... ``` ### Parameters | Parameter | Type | Default | Description | | -------------- | -------------- | ------------- | ----------------------------------------------------------- | | `similarity` | `float` | `1.0` | Cosine similarity threshold (1.0 = exact, 0.9 = semantic) | | `ttl` | `int | None` | `3600` | Time-to-live in seconds (None = never expires) | | `backend` | `Backend` | `None` | Storage backend (None = in-memory) | | `namespace` | `str` | `"default"` | Isolate different use cases | | `enabled` | `bool` | `True` | Enable/disable caching | | `key_func` | `Callable` | `None` | Custom cache key function | ### Utility Functions ```python from semantic_llm_cache import get_stats # sync — safe anywhere from semantic_llm_cache.stats import ( clear_cache, # async invalidate, # async warm_cache, # async export_cache, # async ) ``` ## Backends | Backend | Description | I/O | | ----------------- | -------------------------------------- | ---------------------------- | | `MemoryBackend` | In-memory LRU (default) | none — runs in event loop | | `SQLiteBackend` | Persistent, file-based (`aiosqlite`) | async non-blocking | | `RedisBackend` | Distributed (`redis.asyncio`) | async non-blocking | ## Embedding Providers | Provider | Quality | Notes | | ------------------------------- | ------------------------------ | ---------------------------- | | `DummyEmbeddingProvider` | hash-only, no semantic match | zero deps, default | | `SentenceTransformerProvider` | high (local model) | requires`[semantic]` extra | | `OpenAIEmbeddingProvider` | high (API) | requires`[openai]` extra | Embedding inference is offloaded via `asyncio.to_thread` — model loading is blocking and should be done at application startup, not on first request. ```python from semantic_llm_cache.similarity import create_embedding_provider, EmbeddingCache # Pre-load at startup (blocking — do this in lifespan, not a request handler) provider = create_embedding_provider("sentence-transformer") embedding_cache = EmbeddingCache(provider=provider) # Use in request handlers (non-blocking) embedding = await embedding_cache.aencode("my prompt") ``` ## Performance | Metric | Value | | ---------------------------- | ------------------------------------------ | | Cache hit latency | <10ms | | Embedding overhead on miss | ~50ms (sentence-transformers, offloaded) | | Typical hit rate | 25-40% | | Cost reduction | 20-40% | ## Requirements - Python >= 3.10 - numpy >= 1.24.0 - aiosqlite >= 0.19.0 ### Optional - `sentence-transformers >= 2.2.0` — semantic matching - `redis >= 4.2.0` — Redis backend (includes `redis.asyncio`) - `openai >= 1.0.0` — OpenAI embeddings ## License MIT — see [LICENSE](LICENSE). ## Credits Original library by **Karthick Raja M** ([@karthyick](https://github.com/karthyick)). Async conversion by this fork.