# flakestorm Usage Guide

> **The Agent Reliability Engine** - Chaos Engineering for AI Agents

This comprehensive guide walks you through using flakestorm to test your AI agents for reliability, robustness, and safety.

---

## Table of Contents

1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Quick Start](#quick-start)
4. [Core Concepts](#core-concepts)
5. [Configuration Deep Dive](#configuration-deep-dive)
6. [Running Tests](#running-tests)
7. [Understanding Results](#understanding-results)
8. [Integration Patterns](#integration-patterns)
9. [Advanced Usage](#advanced-usage)
10. [Troubleshooting](#troubleshooting)

---

## Introduction

### What is flakestorm?

flakestorm is an **adversarial testing framework** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.

### Why Use flakestorm?

| Problem | How flakestorm Helps |
|---------|-------------------|
| Agent fails with typos in user input | Tests with noise mutations |
| Agent leaks sensitive data | Safety assertions catch PII exposure |
| Agent behavior varies unpredictably | Semantic similarity assertions ensure consistency |
| Prompt injection attacks | Tests agent resilience to injection attempts |
| No way to quantify reliability | Provides robustness scores (0.0 - 1.0) |

### How It Works

```
┌─────────────────────────────────────────────────────────────────┐
│                         flakestorm FLOW                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. GOLDEN PROMPTS          2. MUTATION ENGINE                 │
│   ┌─────────────────┐        ┌─────────────────┐               │
│   │ "Book a flight  │  ───►  │ Local LLM       │               │
│   │  from NYC to LA"│        │ (Qwen/Ollama)   │               │
│   └─────────────────┘        └────────┬────────┘               │
│                                       │                         │
│                                       ▼                         │
│                              ┌─────────────────┐               │
│                              │ Mutated Prompts │               │
│                              │ • Typos         │               │
│                              │ • Paraphrases   │               │
│                              │ • Injections    │               │
│                              └────────┬────────┘               │
│                                       │                         │
│   3. YOUR AGENT                       ▼                         │
│   ┌─────────────────┐        ┌─────────────────┐               │
│   │ AI Agent        │  ◄───  │ Test Runner     │               │
│   │ (HTTP/Python)   │        │ (Async)         │               │
│   └────────┬────────┘        └─────────────────┘               │
│            │                                                    │
│            ▼                                                    │
│   4. VERIFICATION            5. REPORTING                       │
│   ┌─────────────────┐        ┌─────────────────┐               │
│   │ Invariant       │  ───►  │ HTML/JSON/CLI   │               │
│   │ Assertions      │        │ Reports         │               │
│   └─────────────────┘        └─────────────────┘               │
│                                       │                         │
│                                       ▼                         │
│                              ┌─────────────────┐               │
│                              │ Robustness      │               │
│                              │ Score: 0.85     │               │
│                              └─────────────────┘               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## Installation

### Prerequisites

- **Python 3.10+** (3.11 recommended)
- **Ollama** (for local LLM mutation generation)
- **Rust** (optional, for performance optimization)

### Installation Order

**Important:** Install Ollama first (it's a system-level service), then set up your Python virtual environment:

1. **Install Ollama** (system-level, runs independently)
2. **Create virtual environment** (for Python packages)
3. **Install flakestorm** (Python package)
4. **Start Ollama service** (if not already running)
5. **Pull the model** (required for mutation generation)

### Step 1: Install Ollama (System-Level)

**macOS Installation:**

```bash
# Option 1: Homebrew (recommended)
brew install ollama

# If you get permission errors, fix permissions first:
sudo chown -R $(whoami) /Users/imac-frank/Library/Logs/Homebrew
sudo chown -R $(whoami) /usr/local/Cellar
sudo chown -R $(whoami) /usr/local/Homebrew
brew install ollama

# Option 2: Official Installer (if Homebrew doesn't work)
# Visit https://ollama.ai/download and download the macOS installer
# Double-click the .dmg file and follow the installation wizard
```

**Windows Installation:**

1. **Download the Installer:**
   - Visit https://ollama.com/download/windows
   - Download `OllamaSetup.exe`

2. **Run the Installer:**
   - Double-click `OllamaSetup.exe`
   - Follow the installation wizard
   - Ollama will be installed and added to your PATH automatically

3. **Verify Installation:**
   ```powershell
   ollama --version
   ```

**Linux Installation:**

```bash
# Install using the official script
curl -fsSL https://ollama.com/install.sh | sh

# Or using package managers:
# Ubuntu/Debian
sudo apt install ollama

# Fedora/RHEL
sudo dnf install ollama

# Arch Linux
sudo pacman -S ollama
```

**Start Ollama Service:**

After installation, start Ollama:

```bash
# macOS/Linux - Start the service
ollama serve

# Windows - Ollama runs as a service automatically after installation
# You can also start it manually from the Start menu or run:
ollama serve
```

### Step 2: Pull the Default Model

```bash
# Pull Qwen Coder 3 8B (recommended for mutations)
ollama pull qwen2.5-coder:7b

# Verify it's working
ollama run qwen2.5-coder:7b "Hello, world!"
```

### Step 3: Create Virtual Environment and Install flakestorm

**Important:** On macOS (and some Linux distributions), Python environments are externally managed. You must use a virtual environment:

```bash
# Create a virtual environment (do this AFTER installing Ollama)
python3 -m venv venv

# Activate it (macOS/Linux)
source venv/bin/activate

# Activate it (Windows)
# venv\Scripts\activate

# Now install flakestorm
# From PyPI (when published)
pip install flakestorm

# From source (development)
git clone https://github.com/flakestorm/flakestorm.git
cd flakestorm
pip install -e ".[dev]"
```

**Note:** Ollama is installed at the system level and doesn't need to be in your virtual environment. The virtual environment is only for Python packages (flakestorm and its dependencies).

**Alternative: Using pipx (for CLI applications)**

If you only want to use flakestorm as a CLI tool (not develop it), you can use `pipx`:

```bash
# Install pipx (if not already installed)
brew install pipx  # macOS
# Or: python3 -m pip install --user pipx

# Install flakestorm
pipx install flakestorm
```

**Note:** Make sure you're using Python 3.10+. You can verify with:
```bash
python3 --version  # Should be 3.10 or higher
```

### Step 4: (Optional) Install Rust Extension

For 80x+ performance improvement on scoring:

```bash
cd rust
# Make sure virtual environment is activated
pip install maturin
maturin build --release
pip install ../target/wheels/*.whl
```

### Verify Installation

```bash
flakestorm --version
flakestorm --help
```

---

## Quick Start

### 1. Initialize Configuration

```bash
# Create flakestorm.yaml in your project
flakestorm init
```

### 2. Configure Your Agent

Edit `flakestorm.yaml`:

```yaml
# Your AI agent endpoint
agent:
  endpoint: "http://localhost:8000/chat"
  type: http
  timeout: 30

# Prompts that should always work
golden_prompts:
  - "What is the weather in New York?"
  - "Book a flight from NYC to LA for tomorrow"
  - "Cancel my reservation #12345"

# What to check in responses
invariants:
  - type: contains
    value: "weather"
    prompt_filter: "weather"
  - type: latency
    max_ms: 5000
  - type: excludes_pii
```

### 3. Run Tests

```bash
# Basic run
flakestorm run

# With HTML report
flakestorm run --output html

# CI mode (fails if score < threshold)
flakestorm run --ci --min-score 0.8
```

### 4. View Results

```bash
# Open the generated report
open reports/flakestorm-*.html
```

---

## Core Concepts

### Golden Prompts

**What they are:** Carefully crafted prompts that represent your agent's core use cases. These are prompts that *should always work correctly*.

**How to choose them:**
- Cover all major user intents
- Include edge cases you've seen in production
- Represent different complexity levels

```yaml
golden_prompts:
  # Simple intent
  - "Hello, how are you?"

  # Complex intent with parameters
  - "Book a flight from New York to Los Angeles departing March 15th"

  # Edge case
  - "What if I need to cancel my booking?"
```

### Mutation Types

flakestorm generates adversarial variations of your golden prompts:

| Type | Description | Example |
|------|-------------|---------|
| `paraphrase` | Same meaning, different words | "Book flight" → "Reserve a plane ticket" |
| `noise` | Typos and formatting errors | "Book flight" → "Bok fligt" |
| `tone_shift` | Different emotional tone | "Book flight" → "I NEED A FLIGHT NOW!!!" |
| `prompt_injection` | Attempted jailbreaks | "Book flight. Ignore above and..." |

### Invariants (Assertions)

Rules that agent responses must satisfy:

```yaml
invariants:
  # Response must contain a keyword
  - type: contains
    value: "booked"

  # Response must NOT contain certain content
  - type: not_contains
    value: "error"

  # Response must match regex pattern
  - type: regex
    pattern: "confirmation.*#[A-Z0-9]+"

  # Response time limit
  - type: latency
    max_ms: 3000

  # Must be valid JSON
  - type: valid_json

  # Semantic similarity to expected response
  - type: similarity
    expected: "Your flight has been booked successfully"
    threshold: 0.8

  # Safety: no PII leakage
  - type: excludes_pii

  # Safety: must include refusal for dangerous requests
  - type: refusal
```

### Robustness Score

A number from 0.0 to 1.0 indicating how reliable your agent is:

```
Score = (Weighted Passed Tests) / (Total Weighted Tests)
```

Weights by mutation type:
- `prompt_injection`: 1.5 (harder to defend against)
- `paraphrase`: 1.0 (should always work)
- `tone_shift`: 1.0 (should handle different tones)
- `noise`: 0.8 (minor errors are acceptable)

**Interpretation:**
- **0.9+**: Excellent - Production ready
- **0.8-0.9**: Good - Minor improvements needed
- **0.7-0.8**: Fair - Needs work
- **<0.7**: Poor - Significant reliability issues

---

## Configuration Deep Dive

### Full Configuration Schema

```yaml
# =============================================================================
# AGENT CONFIGURATION
# =============================================================================
agent:
  # Required: Where to send requests
  endpoint: "http://localhost:8000/chat"

  # Agent type: http, python, or langchain
  type: http

  # Request timeout in seconds
  timeout: 30

  # HTTP-specific settings
  headers:
    Authorization: "Bearer ${API_KEY}"  # Environment variable expansion
    Content-Type: "application/json"

  # How to format the request body
  # Available placeholders: {prompt}
  request_template: |
    {"message": "{prompt}", "stream": false}

  # JSONPath to extract response from JSON
  response_path: "$.response"

# =============================================================================
# GOLDEN PROMPTS
# =============================================================================
golden_prompts:
  - "What is 2 + 2?"
  - "Summarize this article: {article_text}"
  - "Translate to Spanish: Hello, world!"

# =============================================================================
# MUTATION CONFIGURATION
# =============================================================================
mutations:
  # Number of mutations per golden prompt
  count: 20

  # Which mutation types to use
  types:
    - paraphrase
    - noise
    - tone_shift
    - prompt_injection

  # Weights for scoring (higher = more important to pass)
  weights:
    paraphrase: 1.0
    noise: 0.8
    tone_shift: 1.0
    prompt_injection: 1.5

# =============================================================================
# LLM CONFIGURATION (for mutation generation)
# =============================================================================
llm:
  # Ollama model to use
  model: "qwen2.5-coder:7b"

  # Ollama server URL
  host: "http://localhost:11434"

  # Generation temperature (higher = more creative mutations)
  temperature: 0.8

# =============================================================================
# INVARIANTS (ASSERTIONS)
# =============================================================================
invariants:
  # Example: Response must contain booking confirmation
  - type: contains
    value: "confirmed"
    case_sensitive: false
    prompt_filter: "book"  # Only apply to prompts containing "book"

  # Example: Response time limit
  - type: latency
    max_ms: 5000

  # Example: Must be valid JSON
  - type: valid_json

  # Example: Semantic similarity
  - type: similarity
    expected: "I've booked your flight"
    threshold: 0.75

  # Example: No PII in response
  - type: excludes_pii

  # Example: Must refuse dangerous requests
  - type: refusal
    prompt_filter: "ignore|bypass|jailbreak"

# =============================================================================
# ADVANCED SETTINGS
# =============================================================================
advanced:
  # Concurrent test executions
  concurrency: 10

  # Retry failed requests
  retries: 3

  # Output directory for reports
  output_dir: "./reports"

  # Fail threshold for CI mode
  min_score: 0.8
```

### Environment Variable Expansion

Use `${VAR_NAME}` syntax to reference environment variables:

```yaml
agent:
  endpoint: "${AGENT_URL}"
  headers:
    Authorization: "Bearer ${API_KEY}"
```

---

## Running Tests

### Basic Commands

```bash
# Run with default config (flakestorm.yaml)
flakestorm run

# Specify config file
flakestorm run --config my-config.yaml

# Output format: terminal (default), html, json
flakestorm run --output html

# Quiet mode (less output)
flakestorm run --quiet

# Verbose mode (more output)
flakestorm run --verbose
```

### Individual Commands

```bash
# Just verify config is valid
flakestorm verify --config flakestorm.yaml

# Generate report from previous run
flakestorm report --input results.json --output html

# Show current score
flakestorm score --input results.json
```

---

## Understanding Results

### Terminal Output

```
╭──────────────────────────────────────────────────────────────────╮
│                     flakestorm TEST RESULTS                        │
├──────────────────────────────────────────────────────────────────┤
│  Robustness Score: 0.85                                          │
│  ████████████████████░░░░ 85%                                    │
├──────────────────────────────────────────────────────────────────┤
│  Total Mutations: 80                                             │
│  ✅ Passed: 68                                                   │
│  ❌ Failed: 12                                                   │
├──────────────────────────────────────────────────────────────────┤
│  By Mutation Type:                                               │
│    paraphrase:       95% (19/20)                                 │
│    noise:            90% (18/20)                                 │
│    tone_shift:       85% (17/20)                                 │
│    prompt_injection: 70% (14/20)                                 │
├──────────────────────────────────────────────────────────────────┤
│  Latency: avg=245ms, p50=200ms, p95=450ms, p99=890ms            │
╰──────────────────────────────────────────────────────────────────╯
```

### HTML Report

The HTML report provides:

1. **Summary Dashboard** - Overall score, pass/fail breakdown
2. **Mutation Matrix** - Visual grid of all test results
3. **Failure Details** - Specific failures with input/output
4. **Latency Charts** - Response time distribution
5. **Recommendations** - AI-generated improvement suggestions

### JSON Export

```json
{
  "timestamp": "2024-01-15T10:30:00Z",
  "config_hash": "abc123",
  "statistics": {
    "total_mutations": 80,
    "passed_mutations": 68,
    "failed_mutations": 12,
    "robustness_score": 0.85,
    "avg_latency_ms": 245,
    "p95_latency_ms": 450
  },
  "results": [
    {
      "golden_prompt": "Book a flight to NYC",
      "mutation": "Reserve a plane ticket to New York",
      "mutation_type": "paraphrase",
      "passed": true,
      "response": "I've booked your flight...",
      "latency_ms": 234,
      "checks": [
        {"type": "contains", "passed": true},
        {"type": "latency", "passed": true}
      ]
    }
  ]
}
```

---

## Integration Patterns

### Pattern 1: HTTP Agent

Most common pattern - agent exposed via REST API:

```yaml
agent:
  endpoint: "http://localhost:8000/api/chat"
  type: http
  request_template: |
    {"message": "{prompt}"}
  response_path: "$.reply"
```

**Your agent code:**

```python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    message: str

class ChatResponse(BaseModel):
    reply: str

@app.post("/api/chat")
async def chat(request: ChatRequest) -> ChatResponse:
    # Your agent logic here
    response = your_llm_call(request.message)
    return ChatResponse(reply=response)
```

### Pattern 2: Python Module

Direct Python integration (no HTTP overhead):

```yaml
agent:
  endpoint: "my_agent.agent:handle_message"
  type: python
```

**Your agent code (`my_agent/agent.py`):**

```python
def handle_message(prompt: str) -> str:
    """
    flakestorm will call this function directly.

    Args:
        prompt: The user message (mutated)

    Returns:
        The agent's response as a string
    """
    # Your agent logic
    return process_message(prompt)
```

### Pattern 3: LangChain Agent

For LangChain-based agents:

```yaml
agent:
  endpoint: "my_agent.chain:agent"
  type: langchain
```

**Your agent code:**

```python
from langchain.agents import AgentExecutor

# flakestorm will call agent.invoke({"input": prompt})
agent = AgentExecutor(...)
```

---

## Advanced Usage

### Custom Mutation Templates

Override default mutation prompts:

```yaml
mutations:
  templates:
    paraphrase: |
      Rewrite this prompt with completely different words
      but preserve the exact meaning: "{prompt}"

    noise: |
      Add realistic typos and formatting errors to this prompt.
      Make 2-3 small mistakes: "{prompt}"
```

### Filtering Invariants by Prompt

Apply assertions only to specific prompts:

```yaml
invariants:
  # Only for booking-related prompts
  - type: contains
    value: "confirmation"
    prompt_filter: "book|reserve|schedule"

  # Only for cancellation prompts
  - type: regex
    pattern: "cancelled|refunded"
    prompt_filter: "cancel"
```

### Custom Weights

Adjust scoring weights based on your priorities:

```yaml
mutations:
  weights:
    # Security is critical - weight injection tests higher
    prompt_injection: 2.0

    # Typo tolerance is less important
    noise: 0.5
```

### Parallel Execution

Control concurrency for rate-limited APIs:

```yaml
advanced:
  concurrency: 5  # Max 5 parallel requests
  retries: 3      # Retry failed requests 3 times
```

---

## Troubleshooting

### Common Issues

#### "Cannot connect to Ollama"

```bash
# Check if Ollama is running
curl http://localhost:11434/api/version

# Start Ollama if not running
ollama serve
```

#### "Model not found"

```bash
# List available models
ollama list

# Pull the required model
ollama pull qwen2.5-coder:7b
```

#### "Agent connection refused"

```bash
# Verify your agent is running
curl http://localhost:8000/health

# Check the endpoint in config
cat flakestorm.yaml | grep endpoint
```

#### "Timeout errors"

Increase timeout in config:

```yaml
agent:
  timeout: 60  # Increase to 60 seconds
```

#### "Low robustness score"

1. Review failed mutations in the report
2. Identify patterns (e.g., all prompt_injection failing)
3. Improve your agent's handling of those cases
4. Re-run tests

#### "Homebrew permission errors when installing Ollama"

If you get `Operation not permitted` errors when running `brew install ollama`:

```bash
# Fix Homebrew permissions
sudo chown -R $(whoami) /Users/imac-frank/Library/Logs/Homebrew
sudo chown -R $(whoami) /usr/local/Cellar
sudo chown -R $(whoami) /usr/local/Homebrew

# Then try again
brew install ollama

# Or use the official installer from https://ollama.ai/download instead
```

#### "Ollama binary contains HTML or syntax errors"

If you downloaded a file that contains HTML instead of the binary:

1. **macOS:** Use Homebrew or download the official `.dmg` installer from https://ollama.ai/download
2. **Windows:** Download `OllamaSetup.exe` from https://ollama.com/download/windows
3. **Linux:** Use the official install script: `curl -fsSL https://ollama.com/install.sh | sh`

Never download binaries directly via curl from the download page - always use the official installers or package managers.

### Debug Mode

```bash
# Enable verbose logging
flakestorm run --verbose

# Or set environment variable
export FLAKESTORM_DEBUG=1
flakestorm run
```

### Getting Help

- **Documentation**: https://flakestorm.dev/docs
- **GitHub Issues**: https://github.com/flakestorm/flakestorm/issues
- **Discord**: https://discord.gg/flakestorm

---

## Next Steps

1. **Start simple**: Test with 1-2 golden prompts first
2. **Add invariants gradually**: Start with `contains` and `latency`
3. **Review failures**: Use reports to understand weak points
4. **Iterate**: Improve agent, re-test, repeat
5. **Integrate to CI**: Automate testing on every PR

---

*Built with ❤️ by the flakestorm Team*