2025-12-30 16:36:24 +08:00
# Flakestorm
2025-12-28 21:55:01 +08:00
< p align = "center" >
< strong > The Agent Reliability Engine< / strong > < br >
< em > Chaos Engineering for AI Agents< / em >
< / p >
< p align = "center" >
2025-12-29 11:15:18 +08:00
< a href = "https://github.com/flakestorm/flakestorm/blob/main/LICENSE" >
2025-12-29 13:10:55 +08:00
< img src = "https://img.shields.io/badge/license-Apache--2.0-blue.svg" alt = "License" >
2025-12-28 21:55:01 +08:00
< / a >
2025-12-29 13:10:55 +08:00
< a href = "https://github.com/flakestorm/flakestorm" >
< img src = "https://img.shields.io/github/stars/flakestorm/flakestorm?style=social" alt = "GitHub Stars" >
2025-12-28 21:55:01 +08:00
< / a >
< / p >
---
## The Problem
**The "Happy Path" Fallacy**: Current AI development tools focus on getting an agent to work *once* . Developers tweak prompts until they get a correct answer, declare victory, and ship.
**The Reality**: LLMs are non-deterministic. An agent that works on Monday with `temperature=0.7` might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.
**The Void**:
- **Observability Tools** (LangSmith) tell you *after* the agent failed in production
- **Eval Libraries** (RAGAS) focus on academic scores rather than system reliability
- **Missing Link**: A tool that actively *attacks* the agent to prove robustness before deployment
## The Solution
2025-12-30 16:36:24 +08:00
**Flakestorm** is a local-first testing engine that applies **Chaos Engineering** principles to AI Agents.
2025-12-28 21:55:01 +08:00
2025-12-30 16:36:24 +08:00
Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a **Robustness Score** .
2025-12-28 21:55:01 +08:00
2025-12-30 16:36:24 +08:00
> **"If it passes Flakestorm, it won't break in Production."**
2025-12-29 00:11:02 +08:00
2026-01-04 23:28:43 +08:00
## What You Get in Minutes
2025-12-28 21:55:01 +08:00
2026-01-04 23:28:43 +08:00
Within minutes of setup, Flakestorm gives you:
- **Robustness Score**: A single number (0.0-1.0) that quantifies your agent's reliability
- **Failure Analysis**: Detailed reports showing exactly which mutations broke your agent and why
- **Security Insights**: Discover prompt injection vulnerabilities before attackers do
- **Edge Case Discovery**: Find boundary conditions that would cause production failures
- **Actionable Reports**: Interactive HTML reports with specific recommendations for improvement
No more guessing if your agent is production-ready. Flakestorm tells you exactly what will break and how to fix it.
2025-12-28 21:55:01 +08:00
2026-01-02 20:01:12 +08:00
## Demo
### flakestorm in Action

*Watch flakestorm generate mutations and test your agent in real-time*
### Test Report
2026-01-02 21:52:56 +08:00





2026-01-02 20:01:12 +08:00
*Interactive HTML reports with detailed failure analysis and recommendations*
2026-01-04 23:28:43 +08:00
## Try Flakestorm in ~60 Seconds
2025-12-30 18:36:42 +08:00
2026-01-04 23:28:43 +08:00
Want to see Flakestorm in action immediately? Here's the fastest path:
2025-12-30 18:36:42 +08:00
2026-01-04 23:28:43 +08:00
1. **Install flakestorm** (if you have Python 3.10+):
```bash
pip install flakestorm
```
2025-12-30 18:36:42 +08:00
2026-01-04 23:28:43 +08:00
2. **Initialize a test configuration** :
```bash
flakestorm init
```
2025-12-30 18:36:42 +08:00
2026-01-04 23:28:43 +08:00
3. **Point it at your agent** (edit `flakestorm.yaml` ):
```yaml
agent:
endpoint: "http://localhost:8000/invoke" # Your agent's endpoint
type: "http"
```
2025-12-30 18:36:42 +08:00
2026-01-04 23:28:43 +08:00
4. **Run your first test** :
```bash
flakestorm run
```
2025-12-30 18:36:42 +08:00
2026-01-04 23:28:43 +08:00
That's it! You'll get a robustness score and detailed report showing how your agent handles adversarial inputs.
2025-12-30 18:36:42 +08:00
2026-01-04 23:28:43 +08:00
> **Note**: For full local execution (including mutation generation), you'll need Ollama installed. See the [Local Execution](#local-execution-advanced--power-users) section below or the [Usage Guide](docs/USAGE_GUIDE.md) for complete setup instructions.
2025-12-30 22:33:47 +08:00
2026-01-04 23:28:43 +08:00
## How Flakestorm Works
2025-12-30 22:33:47 +08:00
2026-01-04 23:28:43 +08:00
Flakestorm follows a simple but powerful workflow:
2025-12-30 22:33:47 +08:00
2026-01-04 23:28:43 +08:00
1. **You provide "Golden Prompts"** — example inputs that should always work correctly
2. **Flakestorm generates mutations** — using a local LLM, it creates adversarial variations:
- Paraphrases (same meaning, different words)
- Typos and noise (realistic user errors)
- Tone shifts (frustrated, urgent, aggressive users)
- Prompt injections (security attacks)
- Encoding attacks (Base64, URL encoding)
- Context manipulation (noisy, verbose inputs)
- Length extremes (empty, very long inputs)
3. **Your agent processes each mutation** — Flakestorm sends them to your agent endpoint
4. **Invariants are checked** — responses are validated against rules you define (latency, content, safety)
5. **Robustness Score is calculated** — weighted by mutation difficulty and importance
6. **Report is generated** — interactive HTML showing what passed, what failed, and why
2025-12-30 22:33:47 +08:00
2026-01-04 23:28:43 +08:00
The result: You know exactly how your agent will behave under stress before users ever see it.
2025-12-30 22:33:47 +08:00
2026-01-04 23:28:43 +08:00
## Features
2025-12-30 22:33:47 +08:00
2026-01-04 23:28:43 +08:00
- ✅ **8 Core Mutation Types** : Comprehensive robustness testing covering semantic, input, security, and edge cases
- ✅ **Invariant Assertions** : Deterministic checks, semantic similarity, basic safety
- ✅ **Local-First** : Uses Ollama with Qwen 3 8B for free testing
- ✅ **Beautiful Reports** : Interactive HTML reports with pass/fail matrices
2025-12-30 22:33:47 +08:00
2026-01-04 23:28:43 +08:00
## Local Execution (Advanced / Power Users)
2026-01-02 22:32:18 +08:00
2026-01-04 23:28:43 +08:00
For full local execution with mutation generation, you'll need to set up Ollama and configure your Python environment. This section covers the complete setup process for users who want to run everything locally without external dependencies.
2025-12-28 21:55:01 +08:00
2026-01-04 23:28:43 +08:00
> **Quick Setup**: For detailed installation instructions, troubleshooting, and configuration options, see the [Usage Guide](docs/USAGE_GUIDE.md). The guide includes step-by-step instructions for Ollama installation, Python environment setup, model selection, and advanced configuration.
2026-01-02 22:32:18 +08:00
2026-01-04 23:28:43 +08:00
## Toward a Zero-Setup Path
We're working on making Flakestorm even easier to use. Future improvements include:
- **Cloud-hosted mutation generation**: No need to install Ollama locally
- **One-command setup**: Automated installation and configuration
- **Docker containers**: Pre-configured environments for instant testing
- **CI/CD integrations**: Native GitHub Actions, GitLab CI, and more
2026-01-04 23:39:24 +08:00
- **Comprehensive Reporting**: Dashboard and reports with team collaboration.
2026-01-04 23:28:43 +08:00
The goal: Test your agent's robustness with a single command, no local dependencies required.
For now, the local execution path gives you full control and privacy. As we build toward zero-setup, you'll always have the option to run everything locally.
2025-12-29 00:11:02 +08:00
2025-12-28 21:55:01 +08:00
## Documentation
2025-12-29 00:11:02 +08:00
### Getting Started
- [📖 Usage Guide ](docs/USAGE_GUIDE.md ) - Complete end-to-end guide
- [⚙️ Configuration Guide ](docs/CONFIGURATION_GUIDE.md ) - All configuration options
2025-12-31 23:04:47 +08:00
- [🔌 Connection Guide ](docs/CONNECTION_GUIDE.md ) - How to connect FlakeStorm to your agent
2025-12-29 00:11:02 +08:00
- [🧪 Test Scenarios ](docs/TEST_SCENARIOS.md ) - Real-world examples with code
2026-01-01 17:29:41 +08:00
- [🔗 Integrations Guide ](docs/INTEGRATIONS_GUIDE.md ) - HuggingFace models & semantic similarity
2025-12-29 00:11:02 +08:00
### For Developers
- [🏗️ Architecture & Modules ](docs/MODULES.md ) - How the code works
- [❓ Developer FAQ ](docs/DEVELOPER_FAQ.md ) - Q& A about design decisions
- [📦 Publishing Guide ](docs/PUBLISHING.md ) - How to publish to PyPI
- [🤝 Contributing ](docs/CONTRIBUTING.md ) - How to contribute
### Reference
- [📋 API Specification ](docs/API_SPECIFICATION.md ) - API reference
- [🧪 Testing Guide ](docs/TESTING_GUIDE.md ) - How to run and write tests
- [✅ Implementation Checklist ](docs/IMPLEMENTATION_CHECKLIST.md ) - Development progress
2025-12-28 21:55:01 +08:00
## License
2025-12-30 22:33:47 +08:00
Apache 2.0 - See [LICENSE ](LICENSE ) for details.
2025-12-28 21:55:01 +08:00
---
< p align = "center" >
2025-12-30 16:36:24 +08:00
< strong > Tested with Flakestorm< / strong > < br >
< img src = "https://img.shields.io/badge/tested%20with-flakestorm-brightgreen" alt = "Tested with Flakestorm" >
2025-12-28 21:55:01 +08:00
< / p >