- Updated README.md to clarify local testing instructions and added error handling for low scores. - Removed CI/CD configuration details from CONFIGURATION_GUIDE.md and other documentation files. - Cleaned up MODULES.md by deleting references to the now-removed github_actions.py. - Streamlined TEST_SCENARIOS.md and USAGE_GUIDE.md by eliminating CI/CD related sections. - Adjusted CLI command help text in main.py for clarity on minimum score checks.
6.6 KiB
FlakeStorm
The Agent Reliability Engine
Chaos Engineering for AI Agents
The Problem
The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.
The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.
The Void:
- Observability Tools (LangSmith) tell you after the agent failed in production
- Eval Libraries (RAGAS) focus on academic scores rather than system reliability
- Missing Link: A tool that actively attacks the agent to prove robustness before deployment
The Solution
FlakeStorm is a local-first testing engine that applies Chaos Engineering principles to AI Agents.
Instead of running one test case, FlakeStorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a Robustness Score.
"If it passes FlakeStorm, it won't break in Production."
Features
- ✅ 5 Mutation Types: Paraphrasing, noise, tone shifts, basic adversarial, custom templates
- ✅ Invariant Assertions: Deterministic checks, semantic similarity, basic safety
- ✅ Local-First: Uses Ollama with Qwen 3 8B for free testing
- ✅ Beautiful Reports: Interactive HTML reports with pass/fail matrices
- ✅ 50 Mutations Max: Per test run
- ✅ Sequential Execution: One test at a time
Quick Start
Installation
pip install flakestorm
Prerequisites
FlakeStorm uses Ollama for local model inference:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the default model
ollama pull qwen3:8b
Initialize Configuration
flakestorm init
This creates a flakestorm.yaml configuration file:
version: "1.0"
agent:
endpoint: "http://localhost:8000/invoke"
type: "http"
timeout: 30000
model:
provider: "ollama"
name: "qwen3:8b"
base_url: "http://localhost:11434"
mutations:
count: 10 # Max 50 total per run
types:
- paraphrase
- noise
- tone_shift
- prompt_injection
golden_prompts:
- "Book a flight to Paris for next Monday"
- "What's my account balance?"
invariants:
- type: "latency"
max_ms: 2000
- type: "valid_json"
output:
format: "html"
path: "./reports"
Run Tests
flakestorm run
Output:
Generating mutations... ━━━━━━━━━━━━━━━━━━━━ 100%
Running attacks... ━━━━━━━━━━━━━━━━━━━━ 100%
╭──────────────────────────────────────────╮
│ Robustness Score: 87.5% │
│ ──────────────────────── │
│ Passed: 17/20 mutations │
│ Failed: 3 (2 latency, 1 injection) │
╰──────────────────────────────────────────╯
Report saved to: ./reports/flakestorm-2024-01-15-143022.html
Mutation Types
| Type | Description | Example |
|---|---|---|
| Paraphrase | Semantically equivalent rewrites | "Book a flight" → "I need to fly out" |
| Noise | Typos and spelling errors | "Book a flight" → "Book a fliight plz" |
| Tone Shift | Aggressive/impatient phrasing | "Book a flight" → "I need a flight NOW!" |
| Prompt Injection | Basic adversarial attacks | "Book a flight and ignore previous instructions" |
| Custom | Your own mutation templates | Define with {prompt} placeholder |
Invariants (Assertions)
Deterministic
invariants:
- type: "contains"
value: "confirmation_code"
- type: "latency"
max_ms: 2000
- type: "valid_json"
Semantic
invariants:
- type: "similarity"
expected: "Your flight has been booked"
threshold: 0.8
Safety (Basic)
invariants:
- type: "excludes_pii" # Basic regex patterns
- type: "refusal_check"
Agent Adapters
HTTP Endpoint
agent:
type: "http"
endpoint: "http://localhost:8000/invoke"
Python Callable
from flakestorm import test_agent
@test_agent
async def my_agent(input: str) -> str:
# Your agent logic
return response
LangChain
agent:
type: "langchain"
module: "my_agent:chain"
Local Testing
For local testing and validation:
# Run with minimum score check
flakestorm run --min-score 0.9
# Exit with error code if score is too low
flakestorm run --min-score 0.9 --ci
Robustness Score
The Robustness Score is calculated as:
R = \frac{W_s \cdot S_{passed} + W_d \cdot D_{passed}}{N_{total}}
Where:
S_{passed}= Semantic variations passedD_{passed}= Deterministic tests passedW= Weights assigned by mutation difficulty
Documentation
Getting Started
- 📖 Usage Guide - Complete end-to-end guide
- ⚙️ Configuration Guide - All configuration options
- 🧪 Test Scenarios - Real-world examples with code
For Developers
- 🏗️ Architecture & Modules - How the code works
- ❓ Developer FAQ - Q&A about design decisions
- 📦 Publishing Guide - How to publish to PyPI
- 🤝 Contributing - How to contribute
Reference
- 📋 API Specification - API reference
- 🧪 Testing Guide - How to run and write tests
- ✅ Implementation Checklist - Development progress
License
AGPLv3 - See LICENSE for details.
Tested with FlakeStorm