2025-12-30 16:36:24 +08:00
# Flakestorm
2025-12-28 21:55:01 +08:00
< p align = "center" >
< strong > The Agent Reliability Engine< / strong > < br >
2026-01-05 16:53:23 +08:00
< em > Chaos Engineering for Production AI Agents< / em >
2025-12-28 21:55:01 +08:00
< / p >
< p align = "center" >
2025-12-29 11:15:18 +08:00
< a href = "https://github.com/flakestorm/flakestorm/blob/main/LICENSE" >
2025-12-29 13:10:55 +08:00
< img src = "https://img.shields.io/badge/license-Apache--2.0-blue.svg" alt = "License" >
2025-12-28 21:55:01 +08:00
< / a >
2025-12-29 13:10:55 +08:00
< a href = "https://github.com/flakestorm/flakestorm" >
< img src = "https://img.shields.io/github/stars/flakestorm/flakestorm?style=social" alt = "GitHub Stars" >
2025-12-28 21:55:01 +08:00
< / a >
< / p >
---
## The Problem
**The "Happy Path" Fallacy**: Current AI development tools focus on getting an agent to work *once* . Developers tweak prompts until they get a correct answer, declare victory, and ship.
2026-01-05 16:53:23 +08:00
**The Reality**: LLMs are non-deterministic. An agent that works on Monday with `temperature=0.7` might fail on Tuesday. Production agents face real users who make typos, get aggressive, and attempt prompt injections. Real traffic exposes failures that happy-path testing misses.
2025-12-28 21:55:01 +08:00
**The Void**:
- **Observability Tools** (LangSmith) tell you *after* the agent failed in production
- **Eval Libraries** (RAGAS) focus on academic scores rather than system reliability
2026-01-05 16:53:23 +08:00
- **CI Pipelines** lack chaos testing — agents ship untested against adversarial inputs
2025-12-28 21:55:01 +08:00
- **Missing Link**: A tool that actively *attacks* the agent to prove robustness before deployment
## The Solution
2026-01-05 16:53:23 +08:00
**Flakestorm** is a chaos testing layer for production AI agents. It applies **Chaos Engineering** principles to systematically test how your agents behave under adversarial inputs before real users encounter them.
2025-12-28 21:55:01 +08:00
2026-01-05 16:53:23 +08:00
Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a **Robustness Score** . Run it before deploy, in CI, or against production-like environments.
2025-12-28 21:55:01 +08:00
2025-12-30 16:36:24 +08:00
> **"If it passes Flakestorm, it won't break in Production."**
2025-12-29 00:11:02 +08:00
2026-01-05 16:53:23 +08:00
## Who Flakestorm Is For
- **Teams shipping AI agents to production** — Catch failures before users do
- **Engineers running agents behind APIs** — Test against real-world abuse patterns
- **Teams already paying for LLM APIs** — Reduce regressions and production incidents
- **CI/CD pipelines** — Automated reliability gates before deployment
Flakestorm is built for production-grade agents handling real traffic. While it works great for exploration and hobby projects, it's designed to catch the failures that matter when agents are deployed at scale.
2025-12-28 21:55:01 +08:00
2026-01-05 17:05:54 +08:00
#
2026-01-02 20:01:12 +08:00
## Demo
### flakestorm in Action

*Watch flakestorm generate mutations and test your agent in real-time*
### Test Report
2026-01-02 21:52:56 +08:00





2026-01-02 20:01:12 +08:00
*Interactive HTML reports with detailed failure analysis and recommendations*
2026-01-05 17:05:54 +08:00
## How Flakestorm Works
2025-12-30 22:33:47 +08:00
2026-01-05 17:05:54 +08:00
Flakestorm follows a simple but powerful workflow:
2025-12-30 22:33:47 +08:00
2026-01-05 17:05:54 +08:00
1. **You provide "Golden Prompts"** — example inputs that should always work correctly
2. **Flakestorm generates mutations** — using a local LLM, it creates adversarial variations:
- Paraphrases (same meaning, different words)
- Typos and noise (realistic user errors)
- Tone shifts (frustrated, urgent, aggressive users)
- Prompt injections (security attacks)
- Encoding attacks (Base64, URL encoding)
- Context manipulation (noisy, verbose inputs)
- Length extremes (empty, very long inputs)
3. **Your agent processes each mutation** — Flakestorm sends them to your agent endpoint
4. **Invariants are checked** — responses are validated against rules you define (latency, content, safety)
5. **Robustness Score is calculated** — weighted by mutation difficulty and importance
6. **Report is generated** — interactive HTML showing what passed, what failed, and why
2025-12-30 22:33:47 +08:00
2026-01-05 17:05:54 +08:00
The result: You know exactly how your agent will behave under stress before users ever see it.
2025-12-30 22:33:47 +08:00
2026-01-05 17:05:54 +08:00
## Features
2025-12-30 22:33:47 +08:00
2026-01-05 17:05:54 +08:00
- ✅ **8 Core Mutation Types** : Comprehensive robustness testing covering semantic, input, security, and edge cases
- ✅ **Invariant Assertions** : Deterministic checks, semantic similarity, basic safety
- ✅ **Local-First** : Uses Ollama with Qwen 3 8B for free testing
- ✅ **Beautiful Reports** : Interactive HTML reports with pass/fail matrices
2025-12-30 22:33:47 +08:00
2026-01-05 17:05:54 +08:00
## Toward a Zero-Setup Path
2026-01-02 22:32:18 +08:00
2026-01-05 18:28:28 +08:00
Future improvements include:
2025-12-28 21:55:01 +08:00
2026-01-05 17:05:54 +08:00
- **Cloud-hosted mutation generation**: No need to install Ollama locally
- **One-command setup**: Automated installation and configuration
- **Docker containers**: Pre-configured environments for instant testing
- **CI/CD integrations**: Native GitHub Actions, GitLab CI, and more
- **Comprehensive Reporting**: Dashboard and reports with team collaboration.
2026-01-02 22:32:18 +08:00
2026-01-05 17:05:54 +08:00
The goal: Test your agent's robustness with a single command, no local dependencies required.
2025-12-30 22:33:47 +08:00
2026-01-05 17:05:54 +08:00
For now, the local execution path gives you full control and privacy. As we build toward zero-setup, you'll always have the option to run everything locally.
2025-12-30 18:02:36 +08:00
2026-01-05 17:05:54 +08:00
# Try Flakestorm in ~60 Seconds
2025-12-30 18:02:36 +08:00
2026-01-05 17:05:54 +08:00
Want to see Flakestorm in action immediately? Here's the fastest path:
2025-12-28 21:55:01 +08:00
2026-01-05 17:05:54 +08:00
1. **Install flakestorm** (if you have Python 3.10+):
```bash
pip install flakestorm
```
2025-12-28 21:55:01 +08:00
2026-01-05 17:05:54 +08:00
2. **Initialize a test configuration** :
```bash
flakestorm init
```
2025-12-28 21:55:01 +08:00
2026-01-05 16:58:45 +08:00
3. **Point it at your agent** (edit `flakestorm.yaml` ):
```yaml
agent:
endpoint: "http://localhost:8000/invoke" # Your agent's endpoint
type: "http"
```
2025-12-28 21:55:01 +08:00
2026-01-05 16:58:45 +08:00
4. **Run your first test** :
```bash
flakestorm run
```
2025-12-28 21:55:01 +08:00
2026-01-05 17:05:54 +08:00
That's it! You'll get a robustness score and detailed report showing how your agent handles adversarial inputs.
> **Note**: For full local execution (including mutation generation), you'll need Ollama installed. See the [Usage Guide](docs/USAGE_GUIDE.md) for complete setup instructions.
2026-01-05 16:53:23 +08:00
2025-12-28 21:55:01 +08:00
## Documentation
2025-12-29 00:11:02 +08:00
### Getting Started
2026-01-05 16:53:23 +08:00
- [📖 Usage Guide ](docs/USAGE_GUIDE.md ) - Complete end-to-end guide (includes local setup)
2025-12-29 00:11:02 +08:00
- [⚙️ Configuration Guide ](docs/CONFIGURATION_GUIDE.md ) - All configuration options
2025-12-31 23:04:47 +08:00
- [🔌 Connection Guide ](docs/CONNECTION_GUIDE.md ) - How to connect FlakeStorm to your agent
2025-12-29 00:11:02 +08:00
- [🧪 Test Scenarios ](docs/TEST_SCENARIOS.md ) - Real-world examples with code
2026-01-01 17:29:41 +08:00
- [🔗 Integrations Guide ](docs/INTEGRATIONS_GUIDE.md ) - HuggingFace models & semantic similarity
2025-12-29 00:11:02 +08:00
### For Developers
- [🏗️ Architecture & Modules ](docs/MODULES.md ) - How the code works
- [❓ Developer FAQ ](docs/DEVELOPER_FAQ.md ) - Q& A about design decisions
- [🤝 Contributing ](docs/CONTRIBUTING.md ) - How to contribute
2026-01-04 23:56:13 +08:00
### Troubleshooting
- [🔧 Fix Installation Issues ](FIX_INSTALL.md ) - Resolve `ModuleNotFoundError: No module named 'flakestorm.reports'`
- [🔨 Fix Build Issues ](BUILD_FIX.md ) - Resolve `pip install .` vs `pip install -e .` problems
2025-12-29 00:11:02 +08:00
### Reference
- [📋 API Specification ](docs/API_SPECIFICATION.md ) - API reference
- [🧪 Testing Guide ](docs/TESTING_GUIDE.md ) - How to run and write tests
- [✅ Implementation Checklist ](docs/IMPLEMENTATION_CHECKLIST.md ) - Development progress
2025-12-28 21:55:01 +08:00
## License
2025-12-30 22:33:47 +08:00
Apache 2.0 - See [LICENSE ](LICENSE ) for details.
2025-12-28 21:55:01 +08:00
---
< p align = "center" >
2025-12-30 16:36:24 +08:00
< strong > Tested with Flakestorm< / strong > < br >
< img src = "https://img.shields.io/badge/tested%20with-flakestorm-brightgreen" alt = "Tested with Flakestorm" >
2025-12-28 21:55:01 +08:00
< / p >