async-semantic-llm-cache/README.md

# semantic-llm-cache

**Async semantic caching for LLM API calls — reduce costs with one decorator.**

[![PyPI](https://img.shields.io/pypi/v/semantic-llm-cache)](https://pypi.org/project/semantic-llm-cache/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python](https://img.shields.io/pypi/pyversions/semantic-llm-cache)](https://pypi.org/project/semantic-llm-cache/)

> **Fork of [karthyick/prompt-cache](https://github.com/karthyick/prompt-cache)** — fully converted to async for use with async frameworks (FastAPI, aiohttp, Starlette, etc.).

## Overview

LLM API calls are expensive and slow. In production applications, **20-40% of prompts are semantically identical** but get charged as separate API calls. `semantic-llm-cache` solves this with a simple decorator that:

- ✅ **Caches semantically similar prompts** (not just exact matches)
- ✅ **Reduces API costs by 20-40%**
- ✅ **Returns cached responses in <10ms**
- ✅ **Works with any LLM provider** (OpenAI, Anthropic, Ollama, local models)
- ✅ **Fully async** — native `async/await` throughout, no event loop blocking
- ✅ **Auto-detects** sync vs async decorated functions — one decorator for both

## What changed from the original

| Area                 | Original                  | This fork                                                           |
| -------------------- | ------------------------- | ------------------------------------------------------------------- |
| Backends             | sync (`sqlite3`, `redis`) | async (`aiosqlite`, `redis.asyncio`)                                |
| `@cache` decorator   | sync only                 | auto-detects async/sync                                             |
| `EmbeddingCache`     | sync `encode()`           | adds `async aencode()` via `asyncio.to_thread`                      |
| `CacheContext`       | sync only                 | supports both `with` and `async with`                               |
| `CachedLLM`          | `chat()`                  | adds `achat()`                                                      |
| Utility functions    | sync                      | `clear_cache`, `invalidate`, `warm_cache`, `export_cache` all async |
| `StorageBackend` ABC | sync abstract methods     | all abstract methods are `async def`                                |
| Min Python           | 3.9                       | 3.10 (uses `X \| Y` union syntax)                                   |

## Installation

Not yet published to PyPI. Install directly from the repository:

```bash
# Clone
git clone https://github.com/YOUR_ORG/prompt-cache.git
cd prompt-cache

# Core (exact match only, SQLite backend)
pip install .

# With semantic similarity (sentence-transformers)
pip install ".[semantic]"

# With Redis backend
pip install ".[redis]"

# Everything
pip install ".[all]"
```

Or install directly via pip from git:

```bash
pip install "git+https://github.com/nomyo-ai/.git"
pip install "git+https://github.com/nomyo-ai/async-semantic-llm-cache.git[semantic]"
```

## Quick Start

### Async function (FastAPI, aiohttp, etc.)

```python
from semantic_llm_cache import cache

@cache(similarity=0.95, ttl=3600)
async def ask_llm(prompt: str) -> str:
    return await call_ollama(prompt)

# First call — LLM hit
await ask_llm("What is Python?")

# Second call — cache hit (<10ms, free)
await ask_llm("What's Python?")  # 95% similar → cache hit
```

### Sync function (backwards compatible)

```python
from semantic_llm_cache import cache

@cache()
def ask_llm_sync(prompt: str) -> str:
    return call_openai(prompt)  # works, but don't use inside a running event loop
```

### Semantic Matching

```python
from semantic_llm_cache import cache

@cache(similarity=0.90)
async def ask_llm(prompt: str) -> str:
    return await call_ollama(prompt)

await ask_llm("What is Python?")   # LLM call
await ask_llm("What's Python?")    # cache hit (95% similar)
await ask_llm("Explain Python")    # cache hit (91% similar)
await ask_llm("What is Rust?")     # LLM call (different topic)
```

### SQLite backend (default, persistent)

```python
from semantic_llm_cache import cache
from semantic_llm_cache.backends import SQLiteBackend

backend = SQLiteBackend(db_path="my_cache.db")

@cache(backend=backend, similarity=0.95)
async def ask_llm(prompt: str) -> str:
    return await call_ollama(prompt)
```

### Redis backend (distributed)

```python
from semantic_llm_cache import cache
from semantic_llm_cache.backends import RedisBackend

backend = RedisBackend(url="redis://localhost:6379/0")
await backend.ping()  # verify connection (replaces __init__ connection test)

@cache(backend=backend, similarity=0.95)
async def ask_llm(prompt: str) -> str:
    return await call_ollama(prompt)
```

### Cache Statistics

```python
from semantic_llm_cache import get_stats

stats = get_stats()
# {
#     "hits": 1547,
#     "misses": 892,
#     "hit_rate": 0.634,
#     "estimated_savings_usd": 3.09,
#     "total_saved_ms": 773500
# }
```

### Cache Management

```python
from semantic_llm_cache.stats import clear_cache, invalidate

# Clear all cached entries
await clear_cache()

# Invalidate entries matching a pattern
await invalidate(pattern="Python")
```

### Async context manager

```python
from semantic_llm_cache import CacheContext

async with CacheContext(similarity=0.9) as ctx:
    result1 = await any_cached_llm_call("prompt 1")
    result2 = await any_cached_llm_call("prompt 2")

print(ctx.stats)  # {"hits": 1, "misses": 1}
```

### CachedLLM wrapper

```python
from semantic_llm_cache import CachedLLM

llm = CachedLLM(similarity=0.9, ttl=3600)
response = await llm.achat("What is Python?", llm_func=my_async_llm)
```

## API Reference

### `@cache()` Decorator

```python
@cache(
    similarity: float = 1.0,      # 1.0 = exact match, 0.9 = semantic
    ttl: int = 3600,              # seconds, None = forever
    backend: Backend = None,      # None = in-memory
    namespace: str = "default",   # isolate different use cases
    enabled: bool = True,         # toggle for debugging
    key_func: Callable = None,    # custom cache key
)
async def my_llm_function(prompt: str) -> str:
    ...
```

### Parameters

| Parameter    | Type          | Default     | Description                                               |
| ------------ | ------------- | ----------- | --------------------------------------------------------- |
| `similarity` | `float`       | `1.0`       | Cosine similarity threshold (1.0 = exact, 0.9 = semantic) |
| `ttl`        | `int \| None` | `3600`      | Time-to-live in seconds (None = never expires)            |
| `backend`    | `Backend`     | `None`      | Storage backend (None = in-memory)                        |
| `namespace`  | `str`         | `"default"` | Isolate different use cases                               |
| `enabled`    | `bool`        | `True`      | Enable/disable caching                                    |
| `key_func`   | `Callable`    | `None`      | Custom cache key function                                 |

### Utility Functions

```python
from semantic_llm_cache import get_stats          # sync — safe anywhere
from semantic_llm_cache.stats import (
    clear_cache,   # async
    invalidate,    # async
    warm_cache,    # async
    export_cache,  # async
)
```

## Backends

| Backend         | Description                          | I/O                       |
| --------------- | ------------------------------------ | ------------------------- |
| `MemoryBackend` | In-memory LRU (default)              | none — runs in event loop |
| `SQLiteBackend` | Persistent, file-based (`aiosqlite`) | async non-blocking        |
| `RedisBackend`  | Distributed (`redis.asyncio`)        | async non-blocking        |

## Embedding Providers

| Provider                      | Quality                      | Notes                       |
| ----------------------------- | ---------------------------- | --------------------------- |
| `DummyEmbeddingProvider`      | hash-only, no semantic match | zero deps, default          |
| `SentenceTransformerProvider` | high (local model)           | requires `[semantic]` extra |
| `OpenAIEmbeddingProvider`     | high (API)                   | requires `[openai]` extra   |

Embedding inference is offloaded via `asyncio.to_thread` — model loading is blocking and should be done at application startup, not on first request.

```python
from semantic_llm_cache.similarity import create_embedding_provider, EmbeddingCache

# Pre-load at startup (blocking — do this in lifespan, not a request handler)
provider = create_embedding_provider("sentence-transformer")
embedding_cache = EmbeddingCache(provider=provider)

# Use in request handlers (non-blocking)
embedding = await embedding_cache.aencode("my prompt")
```

## Performance

| Metric                     | Value                                    |
| -------------------------- | ---------------------------------------- |
| Cache hit latency          | <10ms                                    |
| Embedding overhead on miss | ~50ms (sentence-transformers, offloaded) |
| Typical hit rate           | 25-40%                                   |
| Cost reduction             | 20-40%                                   |

## Requirements

- Python >= 3.10
- numpy >= 1.24.0
- aiosqlite >= 0.19.0

### Optional

- `sentence-transformers >= 2.2.0` — semantic matching
- `redis >= 4.2.0` — Redis backend (includes `redis.asyncio`)
- `openai >= 1.0.0` — OpenAI embeddings

## License

MIT — see [LICENSE](LICENSE).

## Credits

Original library by **Karthick Raja M** ([@karthyick](https://github.com/karthyick)).  
Async conversion by this fork.
Add files via upload initial commit 2026-03-06 15:54:47 +01:00			`# semantic-llm-cache`

			`Async semantic caching for LLM API calls — reduce costs with one decorator.`

			`[![PyPI](https://img.shields.io/pypi/v/semantic-llm-cache)](https://pypi.org/project/semantic-llm-cache/)`
			`[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)`
			`[![Python](https://img.shields.io/pypi/pyversions/semantic-llm-cache)](https://pypi.org/project/semantic-llm-cache/)`

			`> Fork of [karthyick/prompt-cache](https://github.com/karthyick/prompt-cache) — fully converted to async for use with async frameworks (FastAPI, aiohttp, Starlette, etc.).`

			`## Overview`

			LLM API calls are expensive and slow. In production applications, 20-40% of prompts are semantically identical but get charged as separate API calls. `semantic-llm-cache` solves this with a simple decorator that:

			`- ✅ Caches semantically similar prompts (not just exact matches)`
			`- ✅ Reduces API costs by 20-40%`
			`- ✅ Returns cached responses in <10ms`
			`- ✅ Works with any LLM provider (OpenAI, Anthropic, Ollama, local models)`
			- ✅ Fully async — native `async/await` throughout, no event loop blocking
			`- ✅ Auto-detects sync vs async decorated functions — one decorator for both`

			`## What changed from the original`

			`\| Area \| Original \| This fork \|`
			`\| -------------------- \| ------------------------- \| ------------------------------------------------------------------- \|`
			\| Backends \| sync (`sqlite3`, `redis`) \| async (`aiosqlite`, `redis.asyncio`) \|
			\| `@cache` decorator \| sync only \| auto-detects async/sync \|
			\| `EmbeddingCache` \| sync `encode()` \| adds `async aencode()` via `asyncio.to_thread` \|
			\| `CacheContext` \| sync only \| supports both `with` and `async with` \|
			\| `CachedLLM` \| `chat()` \| adds `achat()` \|
			\| Utility functions \| sync \| `clear_cache`, `invalidate`, `warm_cache`, `export_cache` all async \|
			\| `StorageBackend` ABC \| sync abstract methods \| all abstract methods are `async def` \|
			\| Min Python \| 3.9 \| 3.10 (uses `X \\| Y` union syntax) \|

			`## Installation`

			`Not yet published to PyPI. Install directly from the repository:`

			```bash
			`# Clone`
			`git clone https://github.com/YOUR_ORG/prompt-cache.git`
			`cd prompt-cache`

			`# Core (exact match only, SQLite backend)`
			`pip install .`

			`# With semantic similarity (sentence-transformers)`
			`pip install ".[semantic]"`

			`# With Redis backend`
			`pip install ".[redis]"`

			`# Everything`
			`pip install ".[all]"`
			```

			`Or install directly via pip from git:`

			```bash
			`pip install "git+https://github.com/nomyo-ai/.git"`
			`pip install "git+https://github.com/nomyo-ai/async-semantic-llm-cache.git[semantic]"`
			```

			`## Quick Start`

			`### Async function (FastAPI, aiohttp, etc.)`

			```python
			`from semantic_llm_cache import cache`

			`@cache(similarity=0.95, ttl=3600)`
			`async def ask_llm(prompt: str) -> str:`
			`return await call_ollama(prompt)`

			`# First call — LLM hit`
			`await ask_llm("What is Python?")`

			`# Second call — cache hit (<10ms, free)`
			`await ask_llm("What's Python?") # 95% similar → cache hit`
			```

			`### Sync function (backwards compatible)`

			```python
			`from semantic_llm_cache import cache`

			`@cache()`
			`def ask_llm_sync(prompt: str) -> str:`
			`return call_openai(prompt) # works, but don't use inside a running event loop`
			```

			`### Semantic Matching`

			```python
			`from semantic_llm_cache import cache`

			`@cache(similarity=0.90)`
			`async def ask_llm(prompt: str) -> str:`
			`return await call_ollama(prompt)`

			`await ask_llm("What is Python?") # LLM call`
			`await ask_llm("What's Python?") # cache hit (95% similar)`
			`await ask_llm("Explain Python") # cache hit (91% similar)`
			`await ask_llm("What is Rust?") # LLM call (different topic)`
			```

			`### SQLite backend (default, persistent)`

			```python
			`from semantic_llm_cache import cache`
			`from semantic_llm_cache.backends import SQLiteBackend`

			`backend = SQLiteBackend(db_path="my_cache.db")`

			`@cache(backend=backend, similarity=0.95)`
			`async def ask_llm(prompt: str) -> str:`
			`return await call_ollama(prompt)`
			```

			`### Redis backend (distributed)`

			```python
			`from semantic_llm_cache import cache`
			`from semantic_llm_cache.backends import RedisBackend`

			`backend = RedisBackend(url="redis://localhost:6379/0")`
			`await backend.ping() # verify connection (replaces __init__ connection test)`

			`@cache(backend=backend, similarity=0.95)`
			`async def ask_llm(prompt: str) -> str:`
			`return await call_ollama(prompt)`
			```

			`### Cache Statistics`

			```python
			`from semantic_llm_cache import get_stats`

			`stats = get_stats()`
			`# {`
			`# "hits": 1547,`
			`# "misses": 892,`
			`# "hit_rate": 0.634,`
			`# "estimated_savings_usd": 3.09,`
			`# "total_saved_ms": 773500`
			`# }`
			```

			`### Cache Management`

			```python
			`from semantic_llm_cache.stats import clear_cache, invalidate`

			`# Clear all cached entries`
			`await clear_cache()`

			`# Invalidate entries matching a pattern`
			`await invalidate(pattern="Python")`
			```

			`### Async context manager`

			```python
			`from semantic_llm_cache import CacheContext`

			`async with CacheContext(similarity=0.9) as ctx:`
			`result1 = await any_cached_llm_call("prompt 1")`
			`result2 = await any_cached_llm_call("prompt 2")`

			`print(ctx.stats) # {"hits": 1, "misses": 1}`
			```

			`### CachedLLM wrapper`

			```python
			`from semantic_llm_cache import CachedLLM`

			`llm = CachedLLM(similarity=0.9, ttl=3600)`
			`response = await llm.achat("What is Python?", llm_func=my_async_llm)`
			```

			`## API Reference`

			### `@cache()` Decorator

			```python
			`@cache(`
			`similarity: float = 1.0, # 1.0 = exact match, 0.9 = semantic`
			`ttl: int = 3600, # seconds, None = forever`
			`backend: Backend = None, # None = in-memory`
			`namespace: str = "default", # isolate different use cases`
			`enabled: bool = True, # toggle for debugging`
			`key_func: Callable = None, # custom cache key`
			`)`
			`async def my_llm_function(prompt: str) -> str:`
			`...`
			```

			`### Parameters`

			`\| Parameter \| Type \| Default \| Description \|`
			`\| ------------ \| ------------- \| ----------- \| --------------------------------------------------------- \|`
			\| `similarity` \| `float` \| `1.0` \| Cosine similarity threshold (1.0 = exact, 0.9 = semantic) \|
			\| `ttl` \| `int \\| None` \| `3600` \| Time-to-live in seconds (None = never expires) \|
			\| `backend` \| `Backend` \| `None` \| Storage backend (None = in-memory) \|
			\| `namespace` \| `str` \| `"default"` \| Isolate different use cases \|
			\| `enabled` \| `bool` \| `True` \| Enable/disable caching \|
			\| `key_func` \| `Callable` \| `None` \| Custom cache key function \|

			`### Utility Functions`

			```python
			`from semantic_llm_cache import get_stats # sync — safe anywhere`
			`from semantic_llm_cache.stats import (`
			`clear_cache, # async`
			`invalidate, # async`
			`warm_cache, # async`
			`export_cache, # async`
			`)`
			```

			`## Backends`

			`\| Backend \| Description \| I/O \|`
			`\| --------------- \| ------------------------------------ \| ------------------------- \|`
			\| `MemoryBackend` \| In-memory LRU (default) \| none — runs in event loop \|
			\| `SQLiteBackend` \| Persistent, file-based (`aiosqlite`) \| async non-blocking \|
			\| `RedisBackend` \| Distributed (`redis.asyncio`) \| async non-blocking \|

			`## Embedding Providers`

			`\| Provider \| Quality \| Notes \|`
			`\| ----------------------------- \| ---------------------------- \| --------------------------- \|`
			\| `DummyEmbeddingProvider` \| hash-only, no semantic match \| zero deps, default \|
			\| `SentenceTransformerProvider` \| high (local model) \| requires `[semantic]` extra \|
			\| `OpenAIEmbeddingProvider` \| high (API) \| requires `[openai]` extra \|

			Embedding inference is offloaded via `asyncio.to_thread` — model loading is blocking and should be done at application startup, not on first request.

			```python
			`from semantic_llm_cache.similarity import create_embedding_provider, EmbeddingCache`

			`# Pre-load at startup (blocking — do this in lifespan, not a request handler)`
			`provider = create_embedding_provider("sentence-transformer")`
			`embedding_cache = EmbeddingCache(provider=provider)`

			`# Use in request handlers (non-blocking)`
			`embedding = await embedding_cache.aencode("my prompt")`
			```

			`## Performance`

			`\| Metric \| Value \|`
			`\| -------------------------- \| ---------------------------------------- \|`
			`\| Cache hit latency \| <10ms \|`
			`\| Embedding overhead on miss \| ~50ms (sentence-transformers, offloaded) \|`
			`\| Typical hit rate \| 25-40% \|`
			`\| Cost reduction \| 20-40% \|`

			`## Requirements`

			`- Python >= 3.10`
			`- numpy >= 1.24.0`
			`- aiosqlite >= 0.19.0`

			`### Optional`

			- `sentence-transformers >= 2.2.0` — semantic matching
			- `redis >= 4.2.0` — Redis backend (includes `redis.asyncio`)
			- `openai >= 1.0.0` — OpenAI embeddings

			`## License`

			`MIT — see [LICENSE](LICENSE).`

			`## Credits`

			`Original library by Karthick Raja M ([@karthyick](https://github.com/karthyick)).`
			`Async conversion by this fork.`