mirror of
https://github.com/flakestorm/flakestorm.git
synced 2026-04-25 00:36:54 +02:00
Enhance README.md to clarify the purpose and functionality of Flakestorm for production AI agents. Update descriptions to emphasize chaos testing, adversarial input handling, and CI/CD integration. Add sections on target users and production deployment patterns, ensuring comprehensive guidance for teams shipping AI agents.
This commit is contained in:
parent
efde15e9cb
commit
9e1204a9fe
1 changed files with 36 additions and 11 deletions
47
README.md
47
README.md
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
<p align="center">
|
||||
<strong>The Agent Reliability Engine</strong><br>
|
||||
<em>Chaos Engineering for AI Agents</em>
|
||||
<em>Chaos Engineering for Production AI Agents</em>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
|
|
@ -20,26 +20,36 @@
|
|||
|
||||
**The "Happy Path" Fallacy**: Current AI development tools focus on getting an agent to work *once*. Developers tweak prompts until they get a correct answer, declare victory, and ship.
|
||||
|
||||
**The Reality**: LLMs are non-deterministic. An agent that works on Monday with `temperature=0.7` might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.
|
||||
**The Reality**: LLMs are non-deterministic. An agent that works on Monday with `temperature=0.7` might fail on Tuesday. Production agents face real users who make typos, get aggressive, and attempt prompt injections. Real traffic exposes failures that happy-path testing misses.
|
||||
|
||||
**The Void**:
|
||||
- **Observability Tools** (LangSmith) tell you *after* the agent failed in production
|
||||
- **Eval Libraries** (RAGAS) focus on academic scores rather than system reliability
|
||||
- **CI Pipelines** lack chaos testing — agents ship untested against adversarial inputs
|
||||
- **Missing Link**: A tool that actively *attacks* the agent to prove robustness before deployment
|
||||
|
||||
## The Solution
|
||||
|
||||
**Flakestorm** is a local-first testing engine that applies **Chaos Engineering** principles to AI Agents.
|
||||
**Flakestorm** is a chaos testing layer for production AI agents. It applies **Chaos Engineering** principles to systematically test how your agents behave under adversarial inputs before real users encounter them.
|
||||
|
||||
Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a **Robustness Score**.
|
||||
Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a **Robustness Score**. Run it before deploy, in CI, or against production-like environments.
|
||||
|
||||
> **"If it passes Flakestorm, it won't break in Production."**
|
||||
|
||||
## Who Flakestorm Is For
|
||||
|
||||
- **Teams shipping AI agents to production** — Catch failures before users do
|
||||
- **Engineers running agents behind APIs** — Test against real-world abuse patterns
|
||||
- **Teams already paying for LLM APIs** — Reduce regressions and production incidents
|
||||
- **CI/CD pipelines** — Automated reliability gates before deployment
|
||||
|
||||
Flakestorm is built for production-grade agents handling real traffic. While it works great for exploration and hobby projects, it's designed to catch the failures that matter when agents are deployed at scale.
|
||||
|
||||
## Features
|
||||
|
||||
- ✅ **8 Core Mutation Types**: Comprehensive robustness testing covering semantic, input, security, and edge cases
|
||||
- ✅ **Invariant Assertions**: Deterministic checks, semantic similarity, basic safety
|
||||
- ✅ **Local-First**: Uses Ollama with Qwen 3 8B for free testing
|
||||
- ✅ **CI/CD Ready**: Run in pipelines with exit codes and score thresholds
|
||||
- ✅ **Beautiful Reports**: Interactive HTML reports with pass/fail matrices
|
||||
|
||||
## Demo
|
||||
|
|
@ -66,7 +76,9 @@ Instead of running one test case, Flakestorm takes a single "Golden Prompt", gen
|
|||
|
||||
## Quick Start
|
||||
|
||||
### Installation Order
|
||||
> **Note**: This local path is great for quick exploration. Production teams typically run Flakestorm in CI or cloud-based setups. See the [Usage Guide](docs/USAGE_GUIDE.md) for production deployment patterns.
|
||||
|
||||
### Local Installation (OSS)
|
||||
|
||||
1. **Install Ollama first** (system-level service)
|
||||
2. **Create virtual environment** (for Python packages)
|
||||
|
|
@ -75,7 +87,7 @@ Instead of running one test case, Flakestorm takes a single "Golden Prompt", gen
|
|||
|
||||
### Step 1: Install Ollama (System-Level)
|
||||
|
||||
FlakeStorm uses [Ollama](https://ollama.ai) for local model inference. Install this first:
|
||||
For local execution, FlakeStorm uses [Ollama](https://ollama.ai) for mutation generation. This is an implementation detail for the OSS path — production setups typically use cloud-based mutation services. Install this first:
|
||||
|
||||
**macOS Installation:**
|
||||
|
||||
|
|
@ -361,17 +373,20 @@ agent:
|
|||
module: "my_agent:chain"
|
||||
```
|
||||
|
||||
## Local Testing
|
||||
## CI/CD Integration
|
||||
|
||||
Flakestorm is designed to run in CI pipelines with configurable score thresholds:
|
||||
|
||||
For local testing and validation:
|
||||
```bash
|
||||
# Run with minimum score check
|
||||
flakestorm run --min-score 0.9
|
||||
|
||||
# Exit with error code if score is too low
|
||||
# Exit with error code if score is too low (for CI gates)
|
||||
flakestorm run --min-score 0.9 --ci
|
||||
```
|
||||
|
||||
For local testing and development, the same commands work without the `--ci` flag.
|
||||
|
||||
## Robustness Score
|
||||
|
||||
The Robustness Score is calculated as:
|
||||
|
|
@ -383,10 +398,20 @@ Where:
|
|||
- $D_{passed}$ = Deterministic tests passed
|
||||
- $W$ = Weights assigned by mutation difficulty
|
||||
|
||||
## Production Deployment
|
||||
|
||||
Local execution is ideal for exploration and development. For production agents, Flakestorm is evolving toward a zero-setup, cloud-based workflow that mirrors real deployments. The OSS local path will always remain available for teams who prefer self-hosted solutions.
|
||||
|
||||
See the [Usage Guide](docs/USAGE_GUIDE.md) for:
|
||||
- Local setup and Ollama configuration
|
||||
- Python environment details
|
||||
- Production deployment patterns
|
||||
- CI/CD integration examples
|
||||
|
||||
## Documentation
|
||||
|
||||
### Getting Started
|
||||
- [📖 Usage Guide](docs/USAGE_GUIDE.md) - Complete end-to-end guide
|
||||
- [📖 Usage Guide](docs/USAGE_GUIDE.md) - Complete end-to-end guide (includes local setup)
|
||||
- [⚙️ Configuration Guide](docs/CONFIGURATION_GUIDE.md) - All configuration options
|
||||
- [🔌 Connection Guide](docs/CONNECTION_GUIDE.md) - How to connect FlakeStorm to your agent
|
||||
- [🧪 Test Scenarios](docs/TEST_SCENARIOS.md) - Real-world examples with code
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue