2025-12-29 11:15:18 +08:00
# FlakeStorm
2025-12-28 21:55:01 +08:00
< p align = "center" >
< strong > The Agent Reliability Engine< / strong > < br >
< em > Chaos Engineering for AI Agents< / em >
< / p >
< p align = "center" >
2025-12-29 11:15:18 +08:00
< a href = "https://github.com/flakestorm/flakestorm/blob/main/LICENSE" >
2025-12-29 00:11:02 +08:00
< img src = "https://img.shields.io/badge/license-AGPLv3-blue.svg" alt = "License" >
2025-12-28 21:55:01 +08:00
< / a >
2025-12-29 11:15:18 +08:00
< a href = "https://pypi.org/project/flakestorm/" >
< img src = "https://img.shields.io/pypi/v/flakestorm.svg" alt = "PyPI" >
2025-12-28 21:55:01 +08:00
< / a >
2025-12-29 11:15:18 +08:00
< a href = "https://pypi.org/project/flakestorm/" >
< img src = "https://img.shields.io/pypi/pyversions/flakestorm.svg" alt = "Python Versions" >
2025-12-28 21:55:01 +08:00
< / a >
2025-12-29 11:15:18 +08:00
< a href = "https://flakestorm.com" >
2025-12-29 00:11:02 +08:00
< img src = "https://img.shields.io/badge/☁️-Cloud%20Available-blueviolet" alt = "Cloud" >
< / a >
2025-12-28 21:55:01 +08:00
< / p >
---
## The Problem
**The "Happy Path" Fallacy**: Current AI development tools focus on getting an agent to work *once* . Developers tweak prompts until they get a correct answer, declare victory, and ship.
**The Reality**: LLMs are non-deterministic. An agent that works on Monday with `temperature=0.7` might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.
**The Void**:
- **Observability Tools** (LangSmith) tell you *after* the agent failed in production
- **Eval Libraries** (RAGAS) focus on academic scores rather than system reliability
- **Missing Link**: A tool that actively *attacks* the agent to prove robustness before deployment
## The Solution
2025-12-29 11:15:18 +08:00
**FlakeStorm** is a local-first testing engine that applies **Chaos Engineering** principles to AI Agents.
2025-12-28 21:55:01 +08:00
2025-12-29 11:15:18 +08:00
Instead of running one test case, FlakeStorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a **Robustness Score** .
2025-12-28 21:55:01 +08:00
2025-12-29 11:15:18 +08:00
> **"If it passes FlakeStorm, it won't break in Production."**
2025-12-29 00:11:02 +08:00
2025-12-29 11:15:18 +08:00
## Features
2025-12-28 21:55:01 +08:00
2025-12-29 00:11:02 +08:00
- ✅ **5 Mutation Types** : Paraphrasing, noise, tone shifts, basic adversarial, custom templates
- ✅ **Invariant Assertions** : Deterministic checks, semantic similarity, basic safety
- ✅ **Local-First** : Uses Ollama with Qwen 3 8B for free testing
- ✅ **Beautiful Reports** : Interactive HTML reports with pass/fail matrices
2025-12-29 11:15:18 +08:00
- ✅ **50 Mutations Max** : Per test run
- ✅ **Sequential Execution** : One test at a time
2025-12-28 21:55:01 +08:00
## Quick Start
### Installation
```bash
2025-12-29 11:15:18 +08:00
pip install flakestorm
2025-12-28 21:55:01 +08:00
```
### Prerequisites
2025-12-29 11:15:18 +08:00
FlakeStorm uses [Ollama ](https://ollama.ai ) for local model inference:
2025-12-28 21:55:01 +08:00
```bash
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the default model
ollama pull qwen3:8b
```
### Initialize Configuration
```bash
2025-12-29 11:15:18 +08:00
flakestorm init
2025-12-28 21:55:01 +08:00
```
2025-12-29 11:15:18 +08:00
This creates a `flakestorm.yaml` configuration file:
2025-12-28 21:55:01 +08:00
```yaml
version: "1.0"
agent:
endpoint: "http://localhost:8000/invoke"
type: "http"
timeout: 30000
model:
provider: "ollama"
name: "qwen3:8b"
base_url: "http://localhost:11434"
mutations:
2025-12-29 11:15:18 +08:00
count: 10 # Max 50 total per run
2025-12-28 21:55:01 +08:00
types:
- paraphrase
- noise
- tone_shift
- prompt_injection
golden_prompts:
- "Book a flight to Paris for next Monday"
- "What's my account balance?"
invariants:
- type: "latency"
max_ms: 2000
- type: "valid_json"
output:
format: "html"
path: "./reports"
```
### Run Tests
```bash
2025-12-29 11:15:18 +08:00
flakestorm run
2025-12-28 21:55:01 +08:00
```
Output:
```
Generating mutations... ━━━━━━━━━━━━━━━━━━━━ 100%
Running attacks... ━━━━━━━━━━━━━━━━━━━━ 100%
╭──────────────────────────────────────────╮
│ Robustness Score: 87.5% │
│ ──────────────────────── │
2025-12-29 00:11:02 +08:00
│ Passed: 17/20 mutations │
│ Failed: 3 (2 latency, 1 injection) │
2025-12-28 21:55:01 +08:00
╰──────────────────────────────────────────╯
2025-12-29 11:15:18 +08:00
Report saved to: ./reports/flakestorm-2024-01-15-143022.html
2025-12-28 21:55:01 +08:00
```
2025-12-29 00:11:02 +08:00
### Check Limits
```bash
2025-12-29 11:15:18 +08:00
flakestorm limits # Show edition limits
flakestorm cloud # Learn about Cloud features
2025-12-29 00:11:02 +08:00
```
2025-12-28 21:55:01 +08:00
## Mutation Types
| Type | Description | Example |
|------|-------------|---------|
| **Paraphrase** | Semantically equivalent rewrites | "Book a flight" → "I need to fly out" |
| **Noise** | Typos and spelling errors | "Book a flight" → "Book a fliight plz" |
| **Tone Shift** | Aggressive/impatient phrasing | "Book a flight" → "I need a flight NOW!" |
2025-12-29 00:11:02 +08:00
| **Prompt Injection** | Basic adversarial attacks | "Book a flight and ignore previous instructions" |
| **Custom** | Your own mutation templates | Define with `{prompt}` placeholder |
2025-12-29 11:15:18 +08:00
> **Need advanced mutations?** Visit [flakestorm.com](https://flakestorm.com) for more options.
2025-12-28 21:55:01 +08:00
## Invariants (Assertions)
### Deterministic
```yaml
invariants:
- type: "contains"
value: "confirmation_code"
- type: "latency"
max_ms: 2000
- type: "valid_json"
```
### Semantic
```yaml
invariants:
- type: "similarity"
expected: "Your flight has been booked"
threshold: 0.8
```
2025-12-29 00:11:02 +08:00
### Safety (Basic)
2025-12-28 21:55:01 +08:00
```yaml
invariants:
2025-12-29 00:11:02 +08:00
- type: "excludes_pii" # Basic regex patterns
2025-12-28 21:55:01 +08:00
- type: "refusal_check"
```
2025-12-29 11:15:18 +08:00
> **Need advanced safety?** Visit [flakestorm.com](https://flakestorm.com) for more options.
2025-12-29 00:11:02 +08:00
2025-12-28 21:55:01 +08:00
## Agent Adapters
### HTTP Endpoint
```yaml
agent:
type: "http"
endpoint: "http://localhost:8000/invoke"
```
### Python Callable
```python
2025-12-29 11:15:18 +08:00
from flakestorm import test_agent
2025-12-28 21:55:01 +08:00
@test_agent
async def my_agent(input: str) -> str:
# Your agent logic
return response
```
### LangChain
```yaml
agent:
type: "langchain"
module: "my_agent:chain"
```
## CI/CD Integration
2025-12-29 11:15:18 +08:00
For local testing:
2025-12-29 00:11:02 +08:00
```bash
# Run before committing (manual)
2025-12-29 11:15:18 +08:00
flakestorm run --min-score 0.9
2025-12-28 21:55:01 +08:00
```
2025-12-29 11:15:18 +08:00
For advanced CI/CD features, visit [flakestorm.com ](https://flakestorm.com ).
2025-12-29 00:11:02 +08:00
2025-12-28 21:55:01 +08:00
## Robustness Score
The Robustness Score is calculated as:
$$R = \frac{W_s \cdot S_{passed} + W_d \cdot D_{passed}}{N_{total}}$$
Where:
- $S_{passed}$ = Semantic variations passed
- $D_{passed}$ = Deterministic tests passed
- $W$ = Weights assigned by mutation difficulty
## Documentation
2025-12-29 00:11:02 +08:00
### Getting Started
- [📖 Usage Guide ](docs/USAGE_GUIDE.md ) - Complete end-to-end guide
- [⚙️ Configuration Guide ](docs/CONFIGURATION_GUIDE.md ) - All configuration options
- [🧪 Test Scenarios ](docs/TEST_SCENARIOS.md ) - Real-world examples with code
### For Developers
- [🏗️ Architecture & Modules ](docs/MODULES.md ) - How the code works
- [❓ Developer FAQ ](docs/DEVELOPER_FAQ.md ) - Q& A about design decisions
- [📦 Publishing Guide ](docs/PUBLISHING.md ) - How to publish to PyPI
- [🤝 Contributing ](docs/CONTRIBUTING.md ) - How to contribute
### Reference
- [📋 API Specification ](docs/API_SPECIFICATION.md ) - API reference
- [🧪 Testing Guide ](docs/TESTING_GUIDE.md ) - How to run and write tests
- [✅ Implementation Checklist ](docs/IMPLEMENTATION_CHECKLIST.md ) - Development progress
2025-12-28 21:55:01 +08:00
## License
2025-12-29 00:11:02 +08:00
AGPLv3 - See [LICENSE ](LICENSE ) for details.
2025-12-28 21:55:01 +08:00
---
< p align = "center" >
2025-12-29 11:15:18 +08:00
< strong > Tested with FlakeStorm< / strong > < br >
< img src = "https://img.shields.io/badge/tested%20with-flakestorm-brightgreen" alt = "Tested with FlakeStorm" >
2025-12-28 21:55:01 +08:00
< / p >
2025-12-29 00:11:02 +08:00
< p align = "center" >
2025-12-29 11:15:18 +08:00
< a href = "https://flakestorm.com" >
< strong > ⚡ Need more features? Visit FlakeStorm Cloud →< / strong >
2025-12-29 00:11:02 +08:00
< / a >
< / p >