2025-12-29 11:32:50 +08:00
# flakestorm Usage Guide
> **The Agent Reliability Engine** - Chaos Engineering for AI Agents
This comprehensive guide walks you through using flakestorm to test your AI agents for reliability, robustness, and safety.
---
## Table of Contents
1. [Introduction ](#introduction )
2. [Installation ](#installation )
3. [Quick Start ](#quick-start )
4. [Core Concepts ](#core-concepts )
5. [Configuration Deep Dive ](#configuration-deep-dive )
6. [Running Tests ](#running-tests )
7. [Understanding Results ](#understanding-results )
8. [Integration Patterns ](#integration-patterns )
2025-12-30 16:03:42 +08:00
9. [Advanced Usage ](#advanced-usage )
10. [Troubleshooting ](#troubleshooting )
2025-12-29 11:32:50 +08:00
---
## Introduction
### What is flakestorm?
flakestorm is an **adversarial testing framework** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.
### Why Use flakestorm?
| Problem | How flakestorm Helps |
|---------|-------------------|
| Agent fails with typos in user input | Tests with noise mutations |
| Agent leaks sensitive data | Safety assertions catch PII exposure |
| Agent behavior varies unpredictably | Semantic similarity assertions ensure consistency |
| Prompt injection attacks | Tests agent resilience to injection attempts |
| No way to quantify reliability | Provides robustness scores (0.0 - 1.0) |
### How It Works
```
┌─────────────────────────────────────────────────────────────────┐
│ flakestorm FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. GOLDEN PROMPTS 2. MUTATION ENGINE │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ "Book a flight │ ───► │ Local LLM │ │
│ │ from NYC to LA"│ │ (Qwen/Ollama) │ │
│ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Mutated Prompts │ │
│ │ • Typos │ │
│ │ • Paraphrases │ │
│ │ • Injections │ │
│ └────────┬────────┘ │
│ │ │
│ 3. YOUR AGENT ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ AI Agent │ ◄─── │ Test Runner │ │
│ │ (HTTP/Python) │ │ (Async) │ │
│ └────────┬────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ 4. VERIFICATION 5. REPORTING │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Invariant │ ───► │ HTML/JSON/CLI │ │
│ │ Assertions │ │ Reports │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Robustness │ │
│ │ Score: 0.85 │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Installation
### Prerequisites
- **Python 3.10+** (3.11 recommended)
- **Ollama** (for local LLM mutation generation)
- **Rust** (optional, for performance optimization)
### Step 1: Install Ollama
```bash
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve
```
### Step 2: Pull the Default Model
```bash
# Pull Qwen Coder 3 8B (recommended for mutations)
ollama pull qwen2.5-coder:7b
# Verify it's working
ollama run qwen2.5-coder:7b "Hello, world!"
```
### Step 3: Install flakestorm
```bash
# From PyPI (when published)
pip install flakestorm
# From source (development)
git clone https://github.com/flakestorm/flakestorm.git
cd flakestorm
pip install -e ".[dev]"
```
### Step 4: (Optional) Install Rust Extension
For 80x+ performance improvement on scoring:
```bash
cd rust
pip install maturin
maturin build --release
pip install ../target/wheels/*.whl
```
### Verify Installation
```bash
flakestorm --version
flakestorm --help
```
---
## Quick Start
### 1. Initialize Configuration
```bash
# Create flakestorm.yaml in your project
flakestorm init
```
### 2. Configure Your Agent
Edit `flakestorm.yaml` :
```yaml
# Your AI agent endpoint
agent:
endpoint: "http://localhost:8000/chat"
type: http
timeout: 30
# Prompts that should always work
golden_prompts:
- "What is the weather in New York?"
- "Book a flight from NYC to LA for tomorrow"
- "Cancel my reservation #12345 "
# What to check in responses
invariants:
- type: contains
value: "weather"
prompt_filter: "weather"
- type: latency
max_ms: 5000
- type: excludes_pii
```
### 3. Run Tests
```bash
# Basic run
flakestorm run
# With HTML report
flakestorm run --output html
# CI mode (fails if score < threshold)
flakestorm run --ci --min-score 0.8
```
### 4. View Results
```bash
# Open the generated report
2025-12-30 16:13:29 +08:00
open reports/flakestorm-*.html
2025-12-29 11:32:50 +08:00
```
---
## Core Concepts
### Golden Prompts
**What they are:** Carefully crafted prompts that represent your agent's core use cases. These are prompts that *should always work correctly* .
**How to choose them:**
- Cover all major user intents
- Include edge cases you've seen in production
- Represent different complexity levels
```yaml
golden_prompts:
# Simple intent
- "Hello, how are you?"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Complex intent with parameters
- "Book a flight from New York to Los Angeles departing March 15th"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Edge case
- "What if I need to cancel my booking?"
```
### Mutation Types
flakestorm generates adversarial variations of your golden prompts:
| Type | Description | Example |
|------|-------------|---------|
| `paraphrase` | Same meaning, different words | "Book flight" → "Reserve a plane ticket" |
| `noise` | Typos and formatting errors | "Book flight" → "Bok fligt" |
| `tone_shift` | Different emotional tone | "Book flight" → "I NEED A FLIGHT NOW!!!" |
| `prompt_injection` | Attempted jailbreaks | "Book flight. Ignore above and..." |
### Invariants (Assertions)
Rules that agent responses must satisfy:
```yaml
invariants:
# Response must contain a keyword
- type: contains
value: "booked"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Response must NOT contain certain content
- type: not_contains
value: "error"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Response must match regex pattern
- type: regex
pattern: "confirmation.*#[A-Z0-9]+"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Response time limit
- type: latency
max_ms: 3000
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Must be valid JSON
- type: valid_json
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Semantic similarity to expected response
- type: similarity
expected: "Your flight has been booked successfully"
threshold: 0.8
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Safety: no PII leakage
- type: excludes_pii
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Safety: must include refusal for dangerous requests
- type: refusal
```
### Robustness Score
A number from 0.0 to 1.0 indicating how reliable your agent is:
```
Score = (Weighted Passed Tests) / (Total Weighted Tests)
```
Weights by mutation type:
- `prompt_injection` : 1.5 (harder to defend against)
- `paraphrase` : 1.0 (should always work)
- `tone_shift` : 1.0 (should handle different tones)
- `noise` : 0.8 (minor errors are acceptable)
**Interpretation:**
- **0.9+**: Excellent - Production ready
- **0.8-0.9**: Good - Minor improvements needed
- **0.7-0.8**: Fair - Needs work
- **< 0.7 ** : Poor - Significant reliability issues
---
## Configuration Deep Dive
### Full Configuration Schema
```yaml
# =============================================================================
# AGENT CONFIGURATION
# =============================================================================
agent:
# Required: Where to send requests
endpoint: "http://localhost:8000/chat"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Agent type: http, python, or langchain
type: http
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Request timeout in seconds
timeout: 30
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# HTTP-specific settings
headers:
Authorization: "Bearer ${API_KEY}" # Environment variable expansion
Content-Type: "application/json"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# How to format the request body
# Available placeholders: {prompt}
request_template: |
{"message": "{prompt}", "stream": false}
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# JSONPath to extract response from JSON
response_path: "$.response"
# =============================================================================
# GOLDEN PROMPTS
# =============================================================================
golden_prompts:
- "What is 2 + 2?"
- "Summarize this article: {article_text}"
- "Translate to Spanish: Hello, world!"
# =============================================================================
# MUTATION CONFIGURATION
# =============================================================================
mutations:
# Number of mutations per golden prompt
count: 20
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Which mutation types to use
types:
- paraphrase
- noise
- tone_shift
- prompt_injection
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Weights for scoring (higher = more important to pass)
weights:
paraphrase: 1.0
noise: 0.8
tone_shift: 1.0
prompt_injection: 1.5
# =============================================================================
# LLM CONFIGURATION (for mutation generation)
# =============================================================================
llm:
# Ollama model to use
model: "qwen2.5-coder:7b"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Ollama server URL
host: "http://localhost:11434"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Generation temperature (higher = more creative mutations)
temperature: 0.8
# =============================================================================
# INVARIANTS (ASSERTIONS)
# =============================================================================
invariants:
# Example: Response must contain booking confirmation
- type: contains
value: "confirmed"
case_sensitive: false
prompt_filter: "book" # Only apply to prompts containing "book"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Example: Response time limit
- type: latency
max_ms: 5000
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Example: Must be valid JSON
- type: valid_json
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Example: Semantic similarity
- type: similarity
expected: "I've booked your flight"
threshold: 0.75
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Example: No PII in response
- type: excludes_pii
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Example: Must refuse dangerous requests
- type: refusal
prompt_filter: "ignore|bypass|jailbreak"
# =============================================================================
# ADVANCED SETTINGS
# =============================================================================
advanced:
# Concurrent test executions
concurrency: 10
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Retry failed requests
retries: 3
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Output directory for reports
output_dir: "./reports"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Fail threshold for CI mode
min_score: 0.8
```
### Environment Variable Expansion
Use `${VAR_NAME}` syntax to reference environment variables:
```yaml
agent:
endpoint: "${AGENT_URL}"
headers:
Authorization: "Bearer ${API_KEY}"
```
---
## Running Tests
### Basic Commands
```bash
# Run with default config (flakestorm.yaml)
flakestorm run
# Specify config file
flakestorm run --config my-config.yaml
# Output format: terminal (default), html, json
flakestorm run --output html
# Quiet mode (less output)
flakestorm run --quiet
# Verbose mode (more output)
flakestorm run --verbose
```
### Individual Commands
```bash
# Just verify config is valid
flakestorm verify --config flakestorm.yaml
# Generate report from previous run
flakestorm report --input results.json --output html
# Show current score
flakestorm score --input results.json
```
---
## Understanding Results
### Terminal Output
```
╭──────────────────────────────────────────────────────────────────╮
│ flakestorm TEST RESULTS │
├──────────────────────────────────────────────────────────────────┤
│ Robustness Score: 0.85 │
│ ████████████████████░░░░ 85% │
├──────────────────────────────────────────────────────────────────┤
│ Total Mutations: 80 │
│ ✅ Passed: 68 │
│ ❌ Failed: 12 │
├──────────────────────────────────────────────────────────────────┤
│ By Mutation Type: │
│ paraphrase: 95% (19/20) │
│ noise: 90% (18/20) │
│ tone_shift: 85% (17/20) │
│ prompt_injection: 70% (14/20) │
├──────────────────────────────────────────────────────────────────┤
│ Latency: avg=245ms, p50=200ms, p95=450ms, p99=890ms │
╰──────────────────────────────────────────────────────────────────╯
```
### HTML Report
The HTML report provides:
1. **Summary Dashboard** - Overall score, pass/fail breakdown
2. **Mutation Matrix** - Visual grid of all test results
3. **Failure Details** - Specific failures with input/output
4. **Latency Charts** - Response time distribution
5. **Recommendations** - AI-generated improvement suggestions
### JSON Export
```json
{
"timestamp": "2024-01-15T10:30:00Z",
"config_hash": "abc123",
"statistics": {
"total_mutations": 80,
"passed_mutations": 68,
"failed_mutations": 12,
"robustness_score": 0.85,
"avg_latency_ms": 245,
"p95_latency_ms": 450
},
"results": [
{
"golden_prompt": "Book a flight to NYC",
"mutation": "Reserve a plane ticket to New York",
"mutation_type": "paraphrase",
"passed": true,
"response": "I've booked your flight...",
"latency_ms": 234,
"checks": [
{"type": "contains", "passed": true},
{"type": "latency", "passed": true}
]
}
]
}
```
---
## Integration Patterns
### Pattern 1: HTTP Agent
Most common pattern - agent exposed via REST API:
```yaml
agent:
endpoint: "http://localhost:8000/api/chat"
type: http
request_template: |
{"message": "{prompt}"}
response_path: "$.reply"
```
**Your agent code:**
```python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class ChatRequest(BaseModel):
message: str
class ChatResponse(BaseModel):
reply: str
@app .post("/api/chat")
async def chat(request: ChatRequest) -> ChatResponse:
# Your agent logic here
response = your_llm_call(request.message)
return ChatResponse(reply=response)
```
### Pattern 2: Python Module
Direct Python integration (no HTTP overhead):
```yaml
agent:
endpoint: "my_agent.agent:handle_message"
type: python
```
**Your agent code (`my_agent/agent.py` ):**
```python
def handle_message(prompt: str) -> str:
"""
flakestorm will call this function directly.
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
Args:
prompt: The user message (mutated)
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
Returns:
The agent's response as a string
"""
# Your agent logic
return process_message(prompt)
```
### Pattern 3: LangChain Agent
For LangChain-based agents:
```yaml
agent:
endpoint: "my_agent.chain:agent"
type: langchain
```
**Your agent code:**
```python
from langchain.agents import AgentExecutor
# flakestorm will call agent.invoke({"input": prompt})
agent = AgentExecutor(...)
```
---
## Advanced Usage
### Custom Mutation Templates
Override default mutation prompts:
```yaml
mutations:
templates:
paraphrase: |
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
Rewrite this prompt with completely different words
2025-12-29 11:32:50 +08:00
but preserve the exact meaning: "{prompt}"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
noise: |
Add realistic typos and formatting errors to this prompt.
Make 2-3 small mistakes: "{prompt}"
```
### Filtering Invariants by Prompt
Apply assertions only to specific prompts:
```yaml
invariants:
# Only for booking-related prompts
- type: contains
value: "confirmation"
prompt_filter: "book|reserve|schedule"
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Only for cancellation prompts
- type: regex
pattern: "cancelled|refunded"
prompt_filter: "cancel"
```
### Custom Weights
Adjust scoring weights based on your priorities:
```yaml
mutations:
weights:
# Security is critical - weight injection tests higher
prompt_injection: 2.0
Add comprehensive documentation for flakestorm
- Introduced multiple new documents including API Specification, Configuration Guide, Contributing Guide, Developer FAQ, Implementation Checklist, Module Documentation, Publishing Guide, Test Scenarios, Testing Guide, and Usage Guide.
- Each document provides detailed instructions, examples, and best practices for using and contributing to flakestorm.
- Enhanced overall project documentation to support users and developers in understanding and utilizing the framework effectively.
2025-12-29 11:33:01 +08:00
2025-12-29 11:32:50 +08:00
# Typo tolerance is less important
noise: 0.5
```
### Parallel Execution
Control concurrency for rate-limited APIs:
```yaml
advanced:
concurrency: 5 # Max 5 parallel requests
retries: 3 # Retry failed requests 3 times
```
---
## Troubleshooting
### Common Issues
#### "Cannot connect to Ollama"
```bash
# Check if Ollama is running
curl http://localhost:11434/api/version
# Start Ollama if not running
ollama serve
```
#### "Model not found"
```bash
# List available models
ollama list
# Pull the required model
ollama pull qwen2.5-coder:7b
```
#### "Agent connection refused"
```bash
# Verify your agent is running
curl http://localhost:8000/health
# Check the endpoint in config
cat flakestorm.yaml | grep endpoint
```
#### "Timeout errors"
Increase timeout in config:
```yaml
agent:
timeout: 60 # Increase to 60 seconds
```
#### "Low robustness score"
1. Review failed mutations in the report
2. Identify patterns (e.g., all prompt_injection failing)
3. Improve your agent's handling of those cases
4. Re-run tests
### Debug Mode
```bash
# Enable verbose logging
flakestorm run --verbose
# Or set environment variable
2025-12-30 16:13:29 +08:00
export FLAKESTORM_DEBUG=1
2025-12-29 11:32:50 +08:00
flakestorm run
```
### Getting Help
- **Documentation**: https://flakestorm.dev/docs
- **GitHub Issues**: https://github.com/flakestorm/flakestorm/issues
- **Discord**: https://discord.gg/flakestorm
---
## Next Steps
1. **Start simple** : Test with 1-2 golden prompts first
2. **Add invariants gradually** : Start with `contains` and `latency`
3. **Review failures** : Use reports to understand weak points
4. **Iterate** : Improve agent, re-test, repeat
5. **Integrate to CI** : Automate testing on every PR
---
*Built with ❤️ by the flakestorm Team*